1
|
Kozlov S, Grishin E. The mining of toxin-like polypeptides from EST database by single residue distribution analysis. BMC Genomics 2011; 12:88. [PMID: 21281459 PMCID: PMC3040730 DOI: 10.1186/1471-2164-12-88] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2010] [Accepted: 01/31/2011] [Indexed: 11/20/2022] Open
Abstract
Background Novel high throughput sequencing technologies require permanent development of bioinformatics data processing methods. Among them, rapid and reliable identification of encoded proteins plays a pivotal role. To search for particular protein families, the amino acid sequence motifs suitable for selective screening of nucleotide sequence databases may be used. In this work, we suggest a novel method for simplified representation of protein amino acid sequences named Single Residue Distribution Analysis, which is applicable both for homology search and database screening. Results Using the procedure developed, a search for amino acid sequence motifs in sea anemone polypeptides was performed, and 14 different motifs with broad and low specificity were discriminated. The adequacy of motifs for mining toxin-like sequences was confirmed by their ability to identify 100% toxin-like anemone polypeptides in the reference polypeptide database. The employment of novel motifs for the search of polypeptide toxins in Anemonia viridis EST dataset allowed us to identify 89 putative toxin precursors. The translated and modified ESTs were scanned using a special algorithm. In addition to direct comparison with the motifs developed, the putative signal peptides were predicted and homology with known structures was examined. Conclusions The suggested method may be used to retrieve structures of interest from the EST databases using simple amino acid sequence motifs as templates. The efficiency of the procedure for directed search of polypeptides is higher than that of most currently used methods. Analysis of 39939 ESTs of sea anemone Anemonia viridis resulted in identification of five protein precursors of earlier described toxins, discovery of 43 novel polypeptide toxins, and prediction of 39 putative polypeptide toxin sequences. In addition, two precursors of novel peptides presumably displaying neuronal function were disclosed.
Collapse
Affiliation(s)
- Sergey Kozlov
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, ul. Miklukho-Maklaya 16/10, 117997 Moscow, Russia
| | | |
Collapse
|
2
|
Macagno ER, Gaasterland T, Edsall L, Bafna V, Soares MB, Scheetz T, Casavant T, Da Silva C, Wincker P, Tasiemski A, Salzet M. Construction of a medicinal leech transcriptome database and its application to the identification of leech homologs of neural and innate immune genes. BMC Genomics 2010; 11:407. [PMID: 20579359 PMCID: PMC2996935 DOI: 10.1186/1471-2164-11-407] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2009] [Accepted: 06/25/2010] [Indexed: 11/17/2022] Open
Abstract
Background The medicinal leech, Hirudo medicinalis, is an important model system for the study of nervous system structure, function, development, regeneration and repair. It is also a unique species in being presently approved for use in medical procedures, such as clearing of pooled blood following certain surgical procedures. It is a current, and potentially also future, source of medically useful molecular factors, such as anticoagulants and antibacterial peptides, which may have evolved as a result of its parasitizing large mammals, including humans. Despite the broad focus of research on this system, little has been done at the genomic or transcriptomic levels and there is a paucity of openly available sequence data. To begin to address this problem, we constructed whole embryo and adult central nervous system (CNS) EST libraries and created a clustered sequence database of the Hirudo transcriptome that is available to the scientific community. Results A total of ~133,000 EST clones from two directionally-cloned cDNA libraries, one constructed from mRNA derived from whole embryos at several developmental stages and the other from adult CNS cords, were sequenced in one or both directions by three different groups: Genoscope (French National Sequencing Center), the University of Iowa Sequencing Facility and the DOE Joint Genome Institute. These were assembled using the phrap software package into 31,232 unique contigs and singletons, with an average length of 827 nt. The assembled transcripts were then translated in all six frames and compared to proteins in NCBI's non-redundant (NR) and to the Gene Ontology (GO) protein sequence databases, resulting in 15,565 matches to 11,236 proteins in NR and 13,935 matches to 8,073 proteins in GO. Searching the database for transcripts of genes homologous to those thought to be involved in the innate immune responses of vertebrates and other invertebrates yielded a set of nearly one hundred evolutionarily conserved sequences, representing all known pathways involved in these important functions. Conclusions The sequences obtained for Hirudo transcripts represent the first major database of genes expressed in this important model system. Comparison of translated open reading frames (ORFs) with the other openly available leech datasets, the genome and transcriptome of Helobdella robusta, shows an average identity at the amino acid level of 58% in matched sequences. Interestingly, comparison with other available Lophotrochozoans shows similar high levels of amino acid identity, where sequences match, for example, 64% with Capitella capitata (a polychaete) and 56% with Aplysia californica (a mollusk), as well as 58% with Schistosoma mansoni (a platyhelminth). Phylogenetic comparisons of putative Hirudo innate immune response genes present within the Hirudo transcriptome database herein described show a strong resemblance to the corresponding mammalian genes, indicating that this important physiological response may have older origins than what has been previously proposed.
Collapse
Affiliation(s)
- Eduardo R Macagno
- Division of Biological Sciences, University of California, San Diego, CA, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010. [PMID: 20089148 DOI: 10.1186/1471‐2105‐11‐38] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms. RESULTS SeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. CONCLUSIONS SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts.
Collapse
|
4
|
Falgueras J, Lara AJ, Fernández-Pozo N, Cantón FR, Pérez-Trabado G, Claros MG. SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010; 11:38. [PMID: 20089148 PMCID: PMC2832897 DOI: 10.1186/1471-2105-11-38] [Citation(s) in RCA: 142] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2009] [Accepted: 01/20/2010] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms. RESULTS SeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. CONCLUSIONS SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts.
Collapse
Affiliation(s)
- Juan Falgueras
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, Málaga, Spain
| | - Antonio J Lara
- Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain
| | - Noé Fernández-Pozo
- Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, 29071 Málaga, Spain
| | - Francisco R Cantón
- Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, 29071 Málaga, Spain
| | - Guillermo Pérez-Trabado
- Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain
- Departamento de Arquitectura de Computadores, Universidad de Málaga, Málaga, Spain
| | - M Gonzalo Claros
- Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain
- Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
5
|
Expressed sequence tags: normalization and subtraction of cDNA libraries expressed sequence tags\ normalization and subtraction of cDNA libraries. Methods Mol Biol 2009. [PMID: 19277560 DOI: 10.1007/978-1-60327-136-3_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Expressed Sequence Tags (ESTs) provide a rapid and efficient approach for gene discovery and analysis of gene expression in eukaryotes. ESTs have also become particularly important with recent expanded efforts in complete genome sequencing of understudied, nonmodel eukaryotes such as protists and algae. For these projects, ESTs provide an invaluable source of data for gene identification and prediction of exon-intron boundaries. The generation of EST data, although straightforward in concept, requires nonetheless great care to ensure the highest efficiency and return for the investment in time and funds. To this end, key steps in the process include generation of a normalized cDNA library to facilitate a high gene discovery rate followed by serial subtraction of normalized libraries to maintain the discovery rate. Here we describe in detail, protocols for normalization and subtraction of cDNA libraries followed by an example using the toxic dinoflagellate Alexandrium tamarense.
Collapse
|
6
|
Tang Z, Choi JH, Hemmerich C, Sarangi A, Colbourne JK, Dong Q. ESTPiper--a web-based analysis pipeline for expressed sequence tags. BMC Genomics 2009; 10:174. [PMID: 19383159 PMCID: PMC2676306 DOI: 10.1186/1471-2164-10-174] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2008] [Accepted: 04/21/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND EST sequencing projects are increasing in scale and scope as the genome sequencing technologies migrate from core sequencing centers to individual research laboratories. Effectively, generating EST data is no longer a bottleneck for investigators. However, processing large amounts of EST data remains a non-trivial challenge for many. Web-based EST analysis tools are proving to be the most convenient option for biologists when performing their analysis, so these tools must continuously improve on their utility to keep in step with the growing needs of research communities. We have developed a web-based EST analysis pipeline called ESTPiper, which streamlines typical large-scale EST analysis components. RESULTS The intuitive web interface guides users through each step of base calling, data cleaning, assembly, genome alignment, annotation, analysis of gene ontology (GO), and microarray oligonucleotide probe design. Each step is modularized. Therefore, a user can execute them separately or together in batch mode. In addition, the user has control over the parameters used by the underlying programs. Extensive documentation of ESTPiper's functionality is embedded throughout the web site to facilitate understanding of the required input and interpretation of the computational results. The user can also download intermediate results and port files to separate programs for further analysis. In addition, our server provides a time-stamped description of the run history for reproducibility. The pipeline can also be installed locally, allowing researchers to modify ESTPiper to suit their own needs. CONCLUSION ESTPiper streamlines the typical process of EST analysis. The pipeline was initially designed in part to support the Daphnia pulex cDNA sequencing project. A web server hosting ESTPiper is provided at http://estpiper.cgb.indiana.edu/ to now support projects of all size. The software is also freely available from the authors for local installations.
Collapse
Affiliation(s)
- Zuojian Tang
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| | - Jeong-Hyeon Choi
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| | - Chris Hemmerich
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| | - Ankita Sarangi
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| | - John K Colbourne
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| | - Qunfeng Dong
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
| |
Collapse
|
7
|
Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008; 33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]
Abstract
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Collapse
Affiliation(s)
- K Scheibye-Alsing
- Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
TAYLOR DLEE, BOOTH MICHAELG, MCFARLAND JACKW, HERRIOTT IANC, LENNON NIALLJ, NUSBAUM CHAD, MARR THOMASG. Increasing ecological inference from high throughput sequencing of fungi in the environment through a tagging approach. Mol Ecol Resour 2008; 8:742-52. [DOI: 10.1111/j.1755-0998.2008.02094.x] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
9
|
TAYLOR DLEE, BOOTH MICHAELG, MCFARLAND JACKW, HERRIOTT IANC, LENNON NIALLJ, NUSBAUM CHAD, MARR THOMASG. Increasing ecological inference from high throughput sequencing of fungi in the environment through a tagging approach. Mol Ecol Resour 2008. [DOI: 10.1111/j.1471-8286.2008.02094.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
10
|
Zhang YZ, Chen J, Nie ZM, Lü ZB, Wang D, Jiang CY, He PA, Liu LL, Lou YL, Song L, Wu XF. Expression of open reading frames in silkworm pupal cDNA library. Appl Biochem Biotechnol 2007; 136:327-43. [PMID: 17625237 DOI: 10.1007/s12010-007-9029-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/1999] [Revised: 11/30/1999] [Accepted: 11/30/1999] [Indexed: 11/24/2022]
Abstract
A cDNA library containing 2409 singletons was constructed from whole silkworm pupae (Bombyx mori) In addition, the types of genes overexpressed in pupa were analyzed. These genes contained 79 types of proteins with the exception of enzyme, mitochondrial DNA, andribosomal protein. Also analyzed were the expression and nonexpression of open reading frame (ORF) sequences in Escherichia coli. cDNA sequences were compared to the silkworm (B. mori) genome in the GenBank database and the silkworm cDNA database including the SilkBase and KAIKOBLAST databases and 498 novel expressed sequence tags (ESTs) and 217 unknown ESTs were found. After comparison with all available ORF-complete mRNA sequences from the same organism (fruitfly, mosquito, and apis) in the RefSeq collection, 1659 full-length cDNA were identified. In addition, the structure of silkworm mRNA was analyzed, and it was found that 66.8% of silkworm mRNA tailed with poly(A) contained the highly conserved AAUAAA signal and the signal located 10-17 nucleotides upstream of the putative poly(A). Finally, the composition of nucleotides in promoter region for all ESTs was surveyed. The results imply that the TTTTA box may possess some functions in regulating transcription and expression of some genes.
Collapse
Affiliation(s)
- Yao-Zhou Zhang
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Liang C, Wang G, Liu L, Ji G, Liu Y, Chen J, Webb JS, Reese G, Dean JFD. WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Res 2007; 35:W137-42. [PMID: 17488839 PMCID: PMC1933163 DOI: 10.1093/nar/gkm299] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Expressed sequence tags (ESTs) remain a dominant approach for characterizing the protein-encoding portions of various genomes. Due to inherent deficiencies, they also present serious challenges for data quality control. Before GenBank submission, EST sequences are typically screened and trimmed of vector and adapter/linker sequences, as well as polyA/T tails. Removal of these sequences presents an obstacle for data validation of error-prone ESTs and impedes data mining of certain functional motifs, whose detection relies on accurate annotation of positional information for polyA tails added posttranscriptionally. As raw DNA sequence information is made increasingly available from public repositories, such as NCBI Trace Archive, new tools will be necessary to reanalyze and mine this data for new information. WebTraceMiner (www.conifergdb.org/software/wtm) was designed as a public sequence processing service for raw EST traces, with a focus on detection and mining of sequence features that help characterize 3′ and 5′ termini of cDNA inserts, including vector fragments, adapter/linker sequences, insert-flanking restriction endonuclease recognition sites and polyA or polyT tails. WebTraceMiner complements other public EST resources and should prove to be a unique tool to facilitate data validation and mining of error-prone ESTs (e.g. discovery of new functional motifs).
Collapse
Affiliation(s)
- Chun Liang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
12
|
|
13
|
Agca C, Ries JE, Kolath SJ, Kim JH, Forrester LJ, Antoniou E, Whitworth KM, Mathialagan N, Springer GK, Prather RS, Lucy MC. Luteinization of porcine preovulatory follicles leads to systematic changes in follicular gene expression. Reproduction 2006; 132:133-45. [PMID: 16816339 DOI: 10.1530/rep.1.01163] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
The LH surge initiates the luteinization of preovulatory follicles and causes hormonal and structural changes that ultimately lead to ovulation and the formation of corpora lutea. The objective of the study was to examine gene expression in ovarian follicles (n= 11) collected from pigs (Sus scrofa domestica) approaching estrus (estrogenic preovulatory follicle;n= 6 follicles from two sows) and in ovarian follicles collected from pigs on the second day of estrus (preovulatory follicles that were luteinized but had not ovulated;n= 5 follicles from two sows). The follicular status within each follicle was confirmed by follicular fluid analyses of estradiol and progesterone ratios. Microarrays were made from expressed sequence tags that were isolated from cDNA libraries of porcine ovary. Gene expression was measured by hybridization of fluorescently labeled cDNA (preovulatory estrogenic or -luteinized) to the microarray. Microarray analyses detected 107 and 43 genes whose expression was decreased or increased (respectively) during the transition from preovulatory estrogenic to -luteinized (P<0.01). Cells within preovulatory estrogenic follicles had a gene-expression profile of proliferative and metabolically active cells that were responding to oxidative stress. Cells within preovulatory luteinized follicles had a gene-expression profile of nonproliferative and migratory cells with angiogenic properties. Approximately, 40% of the discovered genes had unknown function.
Collapse
Affiliation(s)
- Cansu Agca
- Department of Animal Science, University of Missouri, Columbia, Missouri 65211, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Chun CK, Scheetz TE, Bonaldo MDF, Brown B, Clemens A, Crookes-Goodson WJ, Crouch K, DeMartini T, Eyestone M, Goodson MS, Janssens B, Kimbell JL, Koropatnick TA, Kucaba T, Smith C, Stewart JJ, Tong D, Troll JV, Webster S, Winhall-Rice J, Yap C, Casavant TL, McFall-Ngai MJ, Soares MB. An annotated cDNA library of juvenile Euprymna scolopes with and without colonization by the symbiont Vibrio fischeri. BMC Genomics 2006; 7:154. [PMID: 16780587 PMCID: PMC1574308 DOI: 10.1186/1471-2164-7-154] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2006] [Accepted: 06/16/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biologists are becoming increasingly aware that the interaction of animals, including humans, with their coevolved bacterial partners is essential for health. This growing awareness has been a driving force for the development of models for the study of beneficial animal-bacterial interactions. In the squid-vibrio model, symbiotic Vibrio fischeri induce dramatic developmental changes in the light organ of host Euprymna scolopes over the first hours to days of their partnership. We report here the creation of a juvenile light-organ specific EST database. RESULTS We generated eleven cDNA libraries from the light organ of E. scolopes at developmentally significant time points with and without colonization by V. fischeri. Single pass 3' sequencing efforts generated 42,564 expressed sequence tags (ESTs) of which 35,421 passed our quality criteria and were then clustered via the UIcluster program into 13,962 nonredundant sequences. The cDNA clones representing these nonredundant sequences were sequenced from the 5' end of the vector and 58% of these resulting sequences overlapped significantly with the associated 3' sequence to generate 8,067 contigs with an average sequence length of 1,065 bp. All sequences were annotated with BLASTX (E-value < -03) and Gene Ontology (GO). CONCLUSION Both the number of ESTs generated from each library and GO categorizations are reflective of the activity state of the light organ during these early stages of symbiosis. Future analyses of the sequences identified in these libraries promise to provide valuable information not only about pathways involved in colonization and early development of the squid light organ, but also about pathways conserved in response to bacterial colonization across the animal kingdom.
Collapse
Affiliation(s)
- Carlene K Chun
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Todd E Scheetz
- Department of Ophthalmology and Visual Science, University of Iowa, Iowa City, IA 52242, USA
- Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242, USA
| | | | - Bartley Brown
- Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242, USA
| | - Anik Clemens
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Wendy J Crookes-Goodson
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Keith Crouch
- Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA
| | - Tad DeMartini
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Mari Eyestone
- Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA
| | - Michael S Goodson
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Bernadette Janssens
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Jennifer L Kimbell
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Tanya A Koropatnick
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Tamara Kucaba
- Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA
| | - Christina Smith
- Children's Memorial Research Center, Northwestern University, Chicago, IL, 60614, USA
| | - Jennifer J Stewart
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Deyan Tong
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Joshua V Troll
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Sarahrose Webster
- Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA
| | - Jane Winhall-Rice
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Cory Yap
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - Thomas L Casavant
- Department of Ophthalmology and Visual Science, University of Iowa, Iowa City, IA 52242, USA
- Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242, USA
- Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242, USA
| | - Margaret J McFall-Ngai
- Department of Medical Microbiology and Immunology, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Pacific Biomedical Research Center, Kewalo Marine Laboratory, University of Hawaii, Honolulu, HI, 96813, USA
| | - M Bento Soares
- Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA
- Department of Biochemistry, University of Iowa, Iowa City, IA 52242, USA
- Department of Orthopaedics, University of Iowa, Iowa City, IA 52242, USA
- Physiology and Biophysics, University of Iowa, Iowa City, IA 52242, USA
- Children's Memorial Research Center, Northwestern University, Chicago, IL, 60614, USA
| |
Collapse
|
15
|
Liang C, Sun F, Wang H, Qu J, Freeman RM, Pratt LH, Cordonnier-Pratt MM. MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools. BMC Bioinformatics 2006; 7:115. [PMID: 16522212 PMCID: PMC1421442 DOI: 10.1186/1471-2105-7-115] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2005] [Accepted: 03/07/2006] [Indexed: 11/29/2022] Open
Abstract
Background Processing raw DNA sequence data is an especially challenging task for relatively small laboratories and core facilities that produce as many as 5000 or more DNA sequences per week from multiple projects in widely differing species. To meet this challenge, we have developed the flexible, scalable, and automated sequence processing package described here. Results MAGIC-SPP is a DNA sequence processing package consisting of an Oracle 9i relational database, a Perl pipeline, and user interfaces implemented either as JavaServer Pages (JSP) or as a Java graphical user interface (GUI). The database not only serves as a data repository, but also controls processing of trace files. MAGIC-SPP includes an administrative interface, a laboratory information management system, and interfaces for exploring sequences, monitoring quality control, and troubleshooting problems related to sequencing activities. In the sequence trimming algorithm it employs new features designed to improve performance with respect to concerns such as concatenated linkers, identification of the expected start position of a vector insert, and extending the useful length of trimmed sequences by bridging short regions of low quality when the following high quality segment is sufficiently long to justify doing so. Conclusion MAGIC-SPP has been designed to minimize human error, while simultaneously being robust, versatile, flexible and automated. It offers a unique combination of features that permit administration by a biologist with little or no informatics background. It is well suited to both individual research programs and core facilities.
Collapse
Affiliation(s)
- Chun Liang
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Department of Botany, Miami University, Oxford, OH 45056, USA
| | - Feng Sun
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Nanosphere, Inc., 4088 Commercial Avenue, Northbrook, IL 60062, USA
| | - Haiming Wang
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Junfeng Qu
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Robert M Freeman
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Lee H Pratt
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| | | |
Collapse
|
16
|
Abstract
Expressed sequence tag (EST) data are a major contributor to the known plant sequence space. Organization of the data into non-redundant clusters representing tentative unique genes provides snapshots of the gene repertoires of a species. This chapter reviews availability of sequences and sequence analysis results and describes several resources and tools that should facilitate broad-based utilization of EST data for gene structure annotation, gene discovery, and comparative genomics.
Collapse
Affiliation(s)
- Qunfeng Dong
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011-3260, USA
| | | | | | | | | |
Collapse
|
17
|
Silverstein KAT, Graham MA, Paape TD, VandenBosch KA. Genome organization of more than 300 defensin-like genes in Arabidopsis. PLANT PHYSIOLOGY 2005; 138:600-10. [PMID: 15955924 PMCID: PMC1150381 DOI: 10.1104/pp.105.060079] [Citation(s) in RCA: 180] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Defensins represent an ancient and diverse set of small, cysteine-rich, antimicrobial peptides in mammals, insects, and plants. According to published accounts, most species' genomes contain 15 to 50 defensins. Starting with a set of largely nodule-specific defensin-like sequences (DEFLs) from the model legume Medicago truncatula, we built motif models to search the near-complete Arabidopsis (Arabidopsis thaliana) genome. We identified 317 DEFLs, yet 80% were unannotated at The Arabidopsis Information Resource and had no prior evidence of expression. We demonstrate that many of these DEFL genes are clustered in the Arabidopsis genome and that individual clusters have evolved from successive rounds of gene duplication and divergent or purifying selection. Sequencing reverse transcription-PCR products from five DEFL clusters confirmed our gene predictions and verified expression. For four of the largest clusters of DEFLs, we present the first evidence of expression, most frequently in floral tissues. To determine the abundance of DEFLs in other plant families, we used our motif models to search The Institute for Genomic Research's gene indices and identified approximately 1,100 DEFLs. These expressed DEFLs were found mostly in reproductive tissues, consistent with our reverse transcription-PCR results. Sequence-based clustering of all identified DEFLs revealed separate tissue- or taxon-specific subgroups. Previously, we and others showed that more than 300 DEFL genes were expressed in M. truncatula nodules, organs not present in most plants. We have used this information to annotate the Arabidopsis genome and now provide evidence of a large DEFL superfamily present in expressed tissues of all sequenced plants.
Collapse
Affiliation(s)
- Kevin A T Silverstein
- Department of Plant Biology, University of Minnesota, St. Paul, Minnesota 55108, USA
| | | | | | | |
Collapse
|
18
|
Scheetz TE, Laffin JJ, Berger B, Holte S, Baumes SA, Brown R, Chang S, Coco J, Conklin J, Crouch K, Donohue M, Doonan G, Estes C, Eyestone M, Fishler K, Gardiner J, Guo L, Johnson B, Keppel C, Kreger R, Lebeck M, Marcelino R, Miljkovich V, Perdue M, Qui L, Rehmann J, Reiter RS, Rhoads B, Schaefer K, Smith C, Sunjevaric I, Trout K, Wu N, Birkett CL, Bischof J, Gackle B, Gavin A, Grundstad AJ, Mokrzycki B, Moressi C, O'Leary B, Pedretti K, Roberts C, Robinson NL, Smith M, Tack D, Trivedi N, Kucaba T, Freeman T, Lin JJC, Bonaldo MF, Casavant TL, Sheffield VC, Soares MB. High-throughput gene discovery in the rat. Genome Res 2004; 14:733-41. [PMID: 15060017 PMCID: PMC383320 DOI: 10.1101/gr.1414204] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The rat is an important animal model for human diseases and is widely used in physiology. In this article we present a new strategy for gene discovery based on the production of ESTs from serially subtracted and normalized cDNA libraries, and we describe its application for the development of a comprehensive nonredundant collection of rat ESTs. Our new strategy appears to yield substantially more EST clusters per ESTs sequenced than do previous approaches that did not use serial subtraction. However, multiple rounds of library subtraction resulted in high frequencies of otherwise rare internally primed cDNAs, defining the limits of this powerful approach. To date, we have generated >200,000 3' ESTs from >100 cDNA libraries representing a wide range of tissues and developmental stages of the laboratory rat. Most importantly, we have contributed to approximately 50,000 rat UniGene clusters. We have identified, arrayed, and derived 5' ESTs from >30,000 unique rat cDNA clones. Complete information, including radiation hybrid mapping data, is also maintained locally at http://genome.uiowa.edu/clcg.html. All of the sequences described in this article have been submitted to the dbEST division of the NCBI.
Collapse
Affiliation(s)
- Todd E Scheetz
- Center for Bioinformatics and Computational Biology, The University of Iowa, Iowa City, Iowa 52242, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Laffin JJS, Scheetz TE, Bonaldo MDF, Reiter RS, Chang S, Eyestone M, Abdulkawy H, Brown B, Roberts C, Tack D, Kucaba T, Lin JJC, Sheffield VC, Casavant TL, Soares MB. A comprehensive nonredundant expressed sequence tag collection for the developing Rattus norvegicus heart. Physiol Genomics 2004; 17:245-52. [PMID: 14762174 DOI: 10.1152/physiolgenomics.00186.2003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Congenital heart defects affect ∼1,000,000 people in the United States, with 40,000 new births contributing to that number every year. A large percentage of these defects can be attributed to septal defects. We assembled a nonredundant collection of over 12,000 expressed sequence tags (ESTs) from a total of 30,000 ESTs, with the ultimate goal of identifying spatially and/or temporally regulated genes during heart septation. These ESTs were compiled from nonnormalized, normalized, and serially subtracted cDNA libraries derived from two sets of tissue samples. The first includes microdissected rat hearts from embryonic (E) days E13, E15, and E16.5–E18.5 and adult heart. The second includes hearts from embryonic days E17, E19, and E21 and postnatal (P) days P1, P12, P74, and P200. Over 6,000 novel ESTs were identified in the libraries derived from these two sets of tissues, all of which have been contributed to the NCBI rat UniGene collection. It is anticipated that such EST and cDNA clone resources will prove invaluable to gene expression studies aimed at the understanding of the molecular mechanisms underlying heart septation defects.
Collapse
Affiliation(s)
- Jennifer J S Laffin
- Department of Pediatrics and Interdepartmental-Genetics Graduate Program, The University of Iowa, Iowa City, Iowa 52242, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Scheetz TE, Zabner J, Welsh MJ, Coco J, Eyestone MDF, Bonaldo M, Kucaba T, Casavant TL, Soares MB, McCray PB. Large-scale gene discovery in human airway epithelia reveals novel transcripts. Physiol Genomics 2004; 17:69-77. [PMID: 14701920 DOI: 10.1152/physiolgenomics.00188.2003] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
The airway epithelium represents an important barrier between the host and the environment. It is a first site of contact with pathogens, particulates, and other stimuli, and has evolved the means to dynamically respond to these challenges. In an effort to define the transcript profile of airway epithelia, we created and sequenced cDNA libraries from cystic fibrosis (CF) and non-CF epithelia and from human lung tissue. Sequencing of these libraries produced approximately 53,000 3'-expressed sequence tags (3'-ESTs). From these, a nonredundant UniGene set of more than 19,000 sequences was generated. Despite the relatively small contribution of airway epithelia to the total mass of the lung, focused gene discovery in this tissue yielded novel results. The ESTs included several thousand transcripts (6,416) not previously identified from cDNA sequences as expressed in the lung. Among the abundant transcripts were several genes involved in host defense. Most importantly, the set also included 879 3'-ESTs that appear to be novel sequences not previously represented in the National Center for Biotechnology Information UniGene collection. This UniGene set should be useful for studies of pulmonary diseases involving the airway epithelium including cystic fibrosis, respiratory infections and asthma. It also provides a reagent for large-scale expression profiling.
Collapse
Affiliation(s)
- Todd E Scheetz
- Department of Pediatrics, Roy J. and Lucille A. Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Tuggle CK, Green JA, Fitzsimmons C, Woods R, Prather RS, Malchenko S, Soares BM, Kucaba T, Crouch K, Smith C, Tack D, Robinson N, O'Leary B, Scheetz T, Casavant T, Pomp D, Edeal BJ, Zhang Y, Rothschild MF, Garwood K, Beavis W. EST-based gene discovery in pig: virtual expression patterns and comparative mapping to human. Mamm Genome 2003; 14:565-79. [PMID: 12925889 DOI: 10.1007/s00335-002-2263-7] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2003] [Accepted: 04/03/2003] [Indexed: 10/26/2022]
Abstract
A molecular understanding of porcine reproduction is of biological interest and economic importance. Our Midwest Consortium has produced cDNA libraries containing the majority of genes expressed in major female reproductive tissues, and we have deposited into public databases 21,499 expressed sequence tag (EST) gene sequences from the 3' end of clones from these libraries. These sequences represent 10,574 different genes, based on sequence comparison among these data, and comparison with existing porcine ESTs and genes indicate as many as 4652 of these EST clusters are novel. In silico analysis identified sequences that are expressed in specific pig tissues or organs and confirmed the broad expression in pig for many genes ubiquitously expressed in human tissues. Furthermore, we have developed computer software to identify sequence similarity of these pig genes with their human counterparts, and to extract the mapping information of these human homologues from genome databases. We demonstrate the utility of this software for comparative mapping by localizing 61 genes on the porcine physical map for Chromosomes (Chrs) 5, 10, and 14.
Collapse
Affiliation(s)
- Christopher K Tuggle
- Center for Integrated Animal Genomics and Department of Animal Science, Iowa State University, 2255 Kildee Hall, Ames, Iowa 50011, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|