1
|
Bachlava E, Taylor CA, Tang S, Bowers JE, Mandel JR, Burke JM, Knapp SJ. SNP discovery and development of a high-density genotyping array for sunflower. PLoS One 2012; 7:e29814. [PMID: 22238659 PMCID: PMC3251610 DOI: 10.1371/journal.pone.0029814] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2011] [Accepted: 12/06/2011] [Indexed: 11/23/2022] Open
Abstract
Recent advances in next-generation DNA sequencing technologies have made possible the development of high-throughput SNP genotyping platforms that allow for the simultaneous interrogation of thousands of single-nucleotide polymorphisms (SNPs). Such resources have the potential to facilitate the rapid development of high-density genetic maps, and to enable genome-wide association studies as well as molecular breeding approaches in a variety of taxa. Herein, we describe the development of a SNP genotyping resource for use in sunflower (Helianthus annuus L.). This work involved the development of a reference transcriptome assembly for sunflower, the discovery of thousands of high quality SNPs based on the generation and analysis of ca. 6 Gb of transcriptome re-sequencing data derived from multiple genotypes, the selection of 10,640 SNPs for inclusion in the genotyping array, and the use of the resulting array to screen a diverse panel of sunflower accessions as well as related wild species. The results of this work revealed a high frequency of polymorphic SNPs and relatively high level of cross-species transferability. Indeed, greater than 95% of successful SNP assays revealed polymorphism, and more than 90% of these assays could be successfully transferred to related wild species. Analysis of the polymorphism data revealed patterns of genetic differentiation that were largely congruent with the evolutionary history of sunflower, though the large number of markers allowed for finer resolution than has previously been possible.
Collapse
Affiliation(s)
- Eleni Bachlava
- Center for Applied Genetic Technologies, University of Georgia, Athens, Georgia, United States of America
| | - Christopher A. Taylor
- Center for Applied Genetic Technologies, University of Georgia, Athens, Georgia, United States of America
| | - Shunxue Tang
- Center for Applied Genetic Technologies, University of Georgia, Athens, Georgia, United States of America
| | - John E. Bowers
- Center for Applied Genetic Technologies, University of Georgia, Athens, Georgia, United States of America
- Department of Plant Biology, University of Georgia, Athens, Georgia, United States of America
| | - Jennifer R. Mandel
- Department of Plant Biology, University of Georgia, Athens, Georgia, United States of America
| | - John M. Burke
- Department of Plant Biology, University of Georgia, Athens, Georgia, United States of America
- * E-mail:
| | - Steven J. Knapp
- Center for Applied Genetic Technologies, University of Georgia, Athens, Georgia, United States of America
| |
Collapse
|
2
|
Cassidy-Hanley DM, Cordonnier-Pratt MM, Pratt LH, Devine C, Mozammal Hossain M, Dickerson HW, Clark TG. Transcriptional profiling of stage specific gene expression in the parasitic ciliate Ichthyophthirius multifiliis. Mol Biochem Parasitol 2011; 178:29-39. [PMID: 21524669 DOI: 10.1016/j.molbiopara.2011.04.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2011] [Revised: 03/30/2011] [Accepted: 04/06/2011] [Indexed: 01/23/2023]
Abstract
The parasitic ciliate, Ichthyophthirius multifiliis (Ich), is among the most important protozoan pathogens of freshwater fish. Ichthyophthirius cannot be grown in cell culture, and the development of effective prophylactic and therapeutic treatments has been hampered by a lack of information regarding genes involved in virulence, differentiation and growth. To help address this issue, we have generated EST libraries from the two major stages of the parasite life cycle that infect and develop within host tissues. A total of 25,084 ESTs were generated from non-normalized libraries prepared from polyA+ RNA of infective theronts and host-associated trophonts, respectively. Cluster analysis identified 5311 unique transcripts (UniScripts), of which 2091 were contigs and 3220 singletons. Extrapolation of the data based on rates of EST discovery suggests that more than half the expected protein-coding genes of I. multifiliis are represented in this data. BLASTX comparisons against GenBank nr, UniProtKB (SwissProt and TrEMBL), as well as Tetrahymena thermophila, Plasmodium falciparum, and Paramecium tetraurelia protein databases produced 3694 significant (E-value ≤1e(-10)) hits, of which 1178 were annotated using gene ontology (GO) analysis. A high proportion of UniScripts (63%) showed similarity to other ciliate proteins. When combined with expression profiling data, GO ontology analysis of Biological Process, Cellular Component, and Molecular Function revealed interesting differences in gene families expressed in the two stages. Indeed, the most abundant transcripts were highly stage-specific and coincided with the metabolic activities associated with each stage. This work provides an effective genomics resource to further our understanding of Ichthyophthirius biology, and lays the groundwork for the identification of potential drug targets and vaccines candidates for the control of this devastating fish pathogen.
Collapse
Affiliation(s)
- Donna M Cassidy-Hanley
- Department of Microbiology and Immunology, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, United States.
| | | | | | | | | | | | | |
Collapse
|
3
|
Troshin PV, Postis VL, Ashworth D, Baldwin SA, McPherson MJ, Barton GJ. PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities. BMC Res Notes 2011; 4:48. [PMID: 21385349 PMCID: PMC3058032 DOI: 10.1186/1756-0500-4-48] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2010] [Accepted: 03/07/2011] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Facilities that provide a service for DNA sequencing typically support large numbers of users and experiment types. The cost of services is often reduced by the use of liquid handling robots but the efficiency of such facilities is hampered because the software for such robots does not usually integrate well with the systems that run the sequencing machines. Accordingly, there is a need for software systems capable of integrating different robotic systems and managing sample information for DNA sequencing services. In this paper, we describe an extension to the Protein Information Management System (PIMS) that is designed for DNA sequencing facilities. The new version of PIMS has a user-friendly web interface and integrates all aspects of the sequencing process, including sample submission, handling and tracking, together with capture and management of the data. RESULTS The PIMS sequencing extension has been in production since July 2009 at the University of Leeds DNA Sequencing Facility. It has completely replaced manual data handling and simplified the tasks of data management and user communication. Samples from 45 groups have been processed with an average throughput of 10000 samples per month. The current version of the PIMS sequencing extension works with Applied Biosystems 3130XL 96-well plate sequencer and MWG 4204 or Aviso Theonyx liquid handling robots, but is readily adaptable for use with other combinations of robots. CONCLUSIONS PIMS has been extended to provide a user-friendly and integrated data management solution for DNA sequencing facilities that is accessed through a normal web browser and allows simultaneous access by multiple users as well as facility managers. The system integrates sequencing and liquid handling robots, manages the data flow, and provides remote access to the sequencing results. The software is freely available, for academic users, from http://www.pims-lims.org/.
Collapse
Affiliation(s)
- Peter V Troshin
- College of Life Sciences, University of Dundee, Dundee, DD1 4HN, UK.
| | | | | | | | | | | |
Collapse
|
4
|
Expressed sequence tags with cDNA termini: previously overlooked resources for gene annotation and transcriptome exploration in Chlamydomonas reinhardtii. Genetics 2008; 179:83-93. [PMID: 18493042 DOI: 10.1534/genetics.107.085605] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Many of Chlamydomonas reinhardtii expressed sequence tags (ESTs) in GenBank dbEST and community EST assemblies were either over- or undertrimmed in terms of their cDNA termini, which are defined as the diagnostic sequence elements that delineate 3'/5' ends of mRNA transcripts. Overtrimming represents a loss of directional, positional, and structural information of transcript ends whereas undertrimming causes unclean spurious sequences retained in ESTs that exert deleterious impacts on downstream EST-based applications. We examined 309,278 raw EST sequencing trace files of C. reinhardtii and found that only 57% had cDNA termini that matched the expected structures specified in their cDNA library constructions while satisfying our minimum length requirement for their final clean sequences. Using GMAP, 156,963 individual ESTs were mapped to the genome successfully, with their in silico-verified cDNA termini anchored to the genome. Our data analysis suggested strong macro- and microheterogeneity of 3'/5' end positions of individual transcripts derived from the same genes in C. reinhardtii. This work annotating differential ends of individual transcripts in the draft genome presents the research community with a new stream of data that will facilitate accurate determination of gene structures, genome annotation, and exploration of the transcriptome and mRNA metabolism in C. reinhardtii.
Collapse
|
5
|
Conner JA, Goel S, Gunawan G, Cordonnier-Pratt MM, Johnson VE, Liang C, Wang H, Pratt LH, Mullet JE, DeBarry J, Yang L, Bennetzen JL, Klein PE, Ozias-Akins P. Sequence analysis of bacterial artificial chromosome clones from the apospory-specific genomic region of Pennisetum and Cenchrus. PLANT PHYSIOLOGY 2008; 147:1396-411. [PMID: 18508959 PMCID: PMC2442526 DOI: 10.1104/pp.108.119081] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2008] [Accepted: 05/25/2008] [Indexed: 05/18/2023]
Abstract
Apomixis, asexual reproduction through seed, is widespread among angiosperm families. Gametophytic apomixis in Pennisetum squamulatum and Cenchrus ciliaris is controlled by the apospory-specific genomic region (ASGR), which is highly conserved and macrosyntenic between these species. Thirty-two ASGR bacterial artificial chromosomes (BACs) isolated from both species and one ASGR-recombining BAC from P. squamulatum, which together cover approximately 2.7 Mb of DNA, were used to investigate the genomic structure of this region. Phrap assembly of 4,521 high-quality reads generated 1,341 contiguous sequences (contigs; 730 from the ASGR and 30 from the ASGR-recombining BAC in P. squamulatum, plus 580 from the C. ciliaris ASGR). Contigs containing putative protein-coding regions unrelated to transposable elements were identified based on protein similarity after Basic Local Alignment Search Tool X analysis. These putative coding regions were further analyzed in silico with reference to the rice (Oryza sativa) and sorghum (Sorghum bicolor) genomes using the resources at Gramene (www.gramene.org) and Phytozome (www.phytozome.net) and by hybridization against sorghum BAC filters. The ASGR sequences reveal that the ASGR (1) contains both gene-rich and gene-poor segments, (2) contains several genes that may play a role in apomictic development, (3) has many classes of transposable elements, and (4) does not exhibit large-scale synteny with either rice or sorghum genomes but does contain multiple regions of microsynteny with these species.
Collapse
Affiliation(s)
- Joann A Conner
- Department of Horticulture, University of Georgia, Tifton, Georgia 31793-0748, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Freeman RM, Wu M, Cordonnier-Pratt MM, Pratt LH, Gruber CE, Smith M, Lander ES, Stange-Thomann N, Lowe CJ, Gerhart J, Kirschner M. cDNA sequences for transcription factors and signaling proteins of the hemichordate Saccoglossus kowalevskii: efficacy of the expressed sequence tag (EST) approach for evolutionary and developmental studies of a new organism. THE BIOLOGICAL BULLETIN 2008; 214:284-302. [PMID: 18574105 DOI: 10.2307/25470670] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
We describe a collection of expressed sequence tags (ESTs) for Saccoglossus kowalevskii, a direct-developing hemichordate valuable for evolutionary comparisons with chordates. The 202,175 ESTs represent 163,633 arrayed clones carrying cDNAs prepared from embryonic libraries, and they assemble into 13,677 continuous sequences (contigs), leaving 10,896 singletons (excluding mitochondrial sequences). Of the contigs, 53% had significant matches when BLAST was used to query the NCBI databases (< or = 10(-10)), as did 51% of the singletons. Contigs most frequently matched sequences from amphioxus (29%), chordates (67%), and deuterostomes (87%). From the clone array, we isolated 400 full-length sequences for transcription factors and signaling proteins of use for evolutionary and developmental studies. The set includes sequences for fox, pax, tbx, hox, and other homeobox-containing factors, and for ligands and receptors of the TGFbeta, Wnt, Hh, Delta/Notch, and RTK pathways. At least 80% of key sequences have been obtained, when judged against gene lists of model organisms. The median length of these cDNAs is 2.3 kb, including 1.05 kb of 3' untranslated region (UTR). Only 30% are entirely matched by single contigs assembled from ESTs. We conclude that an EST collection based on 150,000 clones is a rich source of sequences for molecular developmental work, and that the EST approach is an efficient way to initiate comparative studies of a new organism.
Collapse
Affiliation(s)
- R M Freeman
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Baerson SR, Dayan FE, Rimando AM, Nanayakkara NPD, Liu CJ, Schröder J, Fishbein M, Pan Z, Kagan IA, Pratt LH, Cordonnier-Pratt MM, Duke SO. A functional genomics investigation of allelochemical biosynthesis in Sorghum bicolor root hairs. J Biol Chem 2007; 283:3231-3247. [PMID: 17998204 DOI: 10.1074/jbc.m706587200] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Sorghum is considered to be one of the more allelopathic crop species, producing phytotoxins such as the potent benzoquinone sorgoleone (2-hydroxy-5-methoxy-3-[(Z,Z)-8',11',14'-pentadecatriene]-p-benzoquinone) and its analogs. Sorgoleone likely accounts for much of the allelopathy of Sorghum spp., typically representing the predominant constituent of Sorghum bicolor root exudates. Previous and ongoing studies suggest that the biosynthetic pathway for this plant growth inhibitor occurs in root hair cells, involving a polyketide synthase activity that utilizes an atypical 16:3 fatty acyl-CoA starter unit, resulting in the formation of a pentadecatrienyl resorcinol intermediate. Subsequent modifications of this resorcinolic intermediate are likely to be mediated by S-adenosylmethionine-dependent O-methyltransferases and dihydroxylation by cytochrome P450 monooxygenases, although the precise sequence of reactions has not been determined previously. Analyses performed by gas chromatography-mass spectrometry with sorghum root extracts identified a 3-methyl ether derivative of the likely pentadecatrienyl resorcinol intermediate, indicating that dihydroxylation of the resorcinol ring is preceded by O-methylation at the 3'-position by a novel 5-n-alk(en)ylresorcinol-utilizing O-methyltransferase activity. An expressed sequence tag data set consisting of 5,468 sequences selected at random from an S. bicolor root hair-specific cDNA library was generated to identify candidate sequences potentially encoding enzymes involved in the sorgoleone biosynthetic pathway. Quantitative real time reverse transcription-PCR and recombinant enzyme studies with putative O-methyltransferase sequences obtained from the expressed sequence tag data set have led to the identification of a novel O-methyltransferase highly and predominantly expressed in root hairs (designated SbOMT3), which preferentially utilizes alk(en)ylresorcinols among a panel of benzene-derivative substrates tested. SbOMT3 is therefore proposed to be involved in the biosynthesis of the allelochemical sorgoleone.
Collapse
Affiliation(s)
- Scott R Baerson
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677.
| | - Franck E Dayan
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677
| | - Agnes M Rimando
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677
| | - N P Dhammika Nanayakkara
- National Center for Natural Products Research, School of Pharmacy, University of Mississippi, University, Mississippi 38677
| | - Chang-Jun Liu
- Biology Department, Brookhaven National Laboratory, Upton, New York 11973
| | - Joachim Schröder
- Universität Freiburg, Institut für Biologie II, Schänzlestrasse 1, D-79104 Freiburg, Germany
| | - Mark Fishbein
- Department of Biology, Portland State University, Portland, Oregon 97207
| | - Zhiqiang Pan
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677
| | - Isabelle A Kagan
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677
| | - Lee H Pratt
- Department of Plant Biology, University of Georgia, Athens, Georgia 30602
| | | | - Stephen O Duke
- United States Department of Agriculture, Agricultural Research Service, Natural Products Utilization Research Unit, University, Mississippi 38677
| |
Collapse
|
8
|
Wendl MC, Smith S, Pohl CS, Dooling DJ, Chinwalla AT, Crouse K, Hepler T, Leong S, Carmichael L, Nhan M, Oberkfell BJ, Mardis ER, Hillier LW, Wilson RK. Design and implementation of a generalized laboratory data model. BMC Bioinformatics 2007; 8:362. [PMID: 17897463 PMCID: PMC2194795 DOI: 10.1186/1471-2105-8-362] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 09/26/2007] [Indexed: 12/02/2022] Open
Abstract
Background Investigators in the biological sciences continue to exploit laboratory automation methods and have dramatically increased the rates at which they can generate data. In many environments, the methods themselves also evolve in a rapid and fluid manner. These observations point to the importance of robust information management systems in the modern laboratory. Designing and implementing such systems is non-trivial and it appears that in many cases a database project ultimately proves unserviceable. Results We describe a general modeling framework for laboratory data and its implementation as an information management system. The model utilizes several abstraction techniques, focusing especially on the concepts of inheritance and meta-data. Traditional approaches commingle event-oriented data with regular entity data in ad hoc ways. Instead, we define distinct regular entity and event schemas, but fully integrate these via a standardized interface. The design allows straightforward definition of a "processing pipeline" as a sequence of events, obviating the need for separate workflow management systems. A layer above the event-oriented schema integrates events into a workflow by defining "processing directives", which act as automated project managers of items in the system. Directives can be added or modified in an almost trivial fashion, i.e., without the need for schema modification or re-certification of applications. Association between regular entities and events is managed via simple "many-to-many" relationships. We describe the programming interface, as well as techniques for handling input/output, process control, and state transitions. Conclusion The implementation described here has served as the Washington University Genome Sequencing Center's primary information system for several years. It handles all transactions underlying a throughput rate of about 9 million sequencing reactions of various kinds per month and has handily weathered a number of major pipeline reconfigurations. The basic data model can be readily adapted to other high-volume processing environments.
Collapse
Affiliation(s)
- Michael C Wendl
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Scott Smith
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Craig S Pohl
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - David J Dooling
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Asif T Chinwalla
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Kevin Crouse
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Todd Hepler
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Shin Leong
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Lynn Carmichael
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Mike Nhan
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | | | - Elaine R Mardis
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - LaDeana W Hillier
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| | - Richard K Wilson
- Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA
| |
Collapse
|
9
|
Kauff F, Cox CJ, Lutzoni F. WASABI: an automated sequence processing system for multigene phylogenies. Syst Biol 2007; 56:523-31. [PMID: 17562476 DOI: 10.1080/10635150701395340] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Affiliation(s)
- Frank Kauff
- Department of Biology, Duke University, Durham, NC 27708, USA.
| | | | | |
Collapse
|
10
|
A novel approach to sequence validating protein expression clones with automated decision making. BMC Bioinformatics 2007; 8:198. [PMID: 17567908 PMCID: PMC1914086 DOI: 10.1186/1471-2105-8-198] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2007] [Accepted: 06/13/2007] [Indexed: 02/02/2023] Open
Abstract
Background Whereas the molecular assembly of protein expression clones is readily automated and routinely accomplished in high throughput, sequence verification of these clones is still largely performed manually, an arduous and time consuming process. The ultimate goal of validation is to determine if a given plasmid clone matches its reference sequence sufficiently to be "acceptable" for use in protein expression experiments. Given the accelerating increase in availability of tens of thousands of unverified clones, there is a strong demand for rapid, efficient and accurate software that automates clone validation. Results We have developed an Automated Clone Evaluation (ACE) system – the first comprehensive, multi-platform, web-based plasmid sequence verification software package. ACE automates the clone verification process by defining each clone sequence as a list of multidimensional discrepancy objects, each describing a difference between the clone and its expected sequence including the resulting polypeptide consequences. To evaluate clones automatically, this list can be compared against user acceptance criteria that specify the allowable number of discrepancies of each type. This strategy allows users to re-evaluate the same set of clones against different acceptance criteria as needed for use in other experiments. ACE manages the entire sequence validation process including contig management, identifying and annotating discrepancies, determining if discrepancies correspond to polymorphisms and clone finishing. Designed to manage thousands of clones simultaneously, ACE maintains a relational database to store information about clones at various completion stages, project processing parameters and acceptance criteria. In a direct comparison, the automated analysis by ACE took less time and was more accurate than a manual analysis of a 93 gene clone set. Conclusion ACE was designed to facilitate high throughput clone sequence verification projects. The software has been used successfully to evaluate more than 55,000 clones at the Harvard Institute of Proteomics. The software dramatically reduced the amount of time and labor required to evaluate clone sequences and decreased the number of missed sequence discrepancies, which commonly occur during manual evaluation. In addition, ACE helped to reduce the number of sequencing reads needed to achieve adequate coverage for making decisions on clones.
Collapse
|
11
|
Liang C, Wang G, Liu L, Ji G, Fang L, Liu Y, Carter K, Webb JS, Dean JFD. ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs). BMC Genomics 2007; 8:134. [PMID: 17535431 PMCID: PMC1894976 DOI: 10.1186/1471-2164-8-134] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2006] [Accepted: 05/29/2007] [Indexed: 11/30/2022] Open
Abstract
Background With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences. Description ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software – WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces. Conclusion ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.
Collapse
Affiliation(s)
- Chun Liang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Gang Wang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Lin Liu
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, Fujian, 361005, China
| | - Lin Fang
- Beijing Genomics Institute, Beijing 101300, China
| | - Yuansheng Liu
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Kikia Carter
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Jason S Webb
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Jeffrey FD Dean
- Warnell School of Forestry and Natural Resources, University of Georgia, Athens, Georgia 30602, USA
| |
Collapse
|
12
|
Liang C, Wang G, Liu L, Ji G, Liu Y, Chen J, Webb JS, Reese G, Dean JFD. WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Res 2007; 35:W137-42. [PMID: 17488839 PMCID: PMC1933163 DOI: 10.1093/nar/gkm299] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Expressed sequence tags (ESTs) remain a dominant approach for characterizing the protein-encoding portions of various genomes. Due to inherent deficiencies, they also present serious challenges for data quality control. Before GenBank submission, EST sequences are typically screened and trimmed of vector and adapter/linker sequences, as well as polyA/T tails. Removal of these sequences presents an obstacle for data validation of error-prone ESTs and impedes data mining of certain functional motifs, whose detection relies on accurate annotation of positional information for polyA tails added posttranscriptionally. As raw DNA sequence information is made increasingly available from public repositories, such as NCBI Trace Archive, new tools will be necessary to reanalyze and mine this data for new information. WebTraceMiner (www.conifergdb.org/software/wtm) was designed as a public sequence processing service for raw EST traces, with a focus on detection and mining of sequence features that help characterize 3′ and 5′ termini of cDNA inserts, including vector fragments, adapter/linker sequences, insert-flanking restriction endonuclease recognition sites and polyA or polyT tails. WebTraceMiner complements other public EST resources and should prove to be a unique tool to facilitate data validation and mining of error-prone ESTs (e.g. discovery of new functional motifs).
Collapse
Affiliation(s)
- Chun Liang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|