Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hazelhurst S, Hide W, Lipták Z, Nogueira R, Starfield R. An overview of the wcd EST clustering tool. ACTA ACUST UNITED AC 2008;24:1542-6. [PMID: 18480101 PMCID: PMC2718666 DOI: 10.1093/bioinformatics/btn203] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

For:	Hazelhurst S, Hide W, Lipták Z, Nogueira R, Starfield R. An overview of the wcd EST clustering tool. ACTA ACUST UNITED AC 2008;24:1542-6. [PMID: 18480101 PMCID: PMC2718666 DOI: 10.1093/bioinformatics/btn203] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

Number

Cited by Other Article(s)

Luan T, Muralidharan HS, Alshehri M, Mittra I, Pop M. SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets. Nucleic Acids Res 2023;51:e46. [PMID: 36912074 PMCID: PMC10164572 DOI: 10.1093/nar/gkad158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/01/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023] Open

Rahnama M, Wang B, Dostart J, Novikova O, Yackzan D, Yackzan A, Bruss H, Baker M, Jacob H, Zhang X, Lamb A, Stewart A, Heist M, Hoover J, Calie P, Chen L, Liu J, Farman ML. Telomere Roles in Fungal Genome Evolution and Adaptation. Front Genet 2021;12:676751. [PMID: 34434216 PMCID: PMC8381367 DOI: 10.3389/fgene.2021.676751] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/28/2021] [Indexed: 11/27/2022] Open

Abstract

Telomeres form the ends of linear chromosomes and usually comprise protein complexes that bind to simple repeated sequence motifs that are added to the 3′ ends of DNA by the telomerase reverse transcriptase (TERT). One of the primary functions attributed to telomeres is to solve the “end-replication problem” which, if left unaddressed, would cause gradual, inexorable attrition of sequences from the chromosome ends and, eventually, loss of viability. Telomere-binding proteins also protect the chromosome from 5′ to 3′ exonuclease action, and disguise the chromosome ends from the double-strand break repair machinery whose illegitimate action potentially generates catastrophic chromosome aberrations. Telomeres are of special interest in the blast fungus, Pyricularia, because the adjacent regions are enriched in genes controlling interactions with host plants, and the chromosome ends show enhanced polymorphism and genetic instability. Previously, we showed that telomere instability in some P. oryzae strains is caused by novel retrotransposons (MoTeRs) that insert in telomere repeats, generating interstitial telomere sequences that drive frequent, break-induced rearrangements. Here, we sought to gain further insight on telomeric involvement in shaping Pyricularia genome architecture by characterizing sequence polymorphisms at chromosome ends, and surrounding internalized MoTeR loci (relics) and interstitial telomere repeats. This provided evidence that telomere dynamics have played historical, and likely ongoing, roles in shaping the Pyricularia genome. We further demonstrate that even telomeres lacking MoTeR insertions are poorly preserved, such that the telomere-adjacent sequences exhibit frequent presence/absence polymorphism, as well as exchanges with the genome interior. Using TERT knockout experiments, we characterized chromosomal responses to failed telomere maintenance which suggested that much of the MoTeR relic-/interstitial telomere-associated polymorphism could be driven by compromised telomere function. Finally, we describe three possible examples of a phenomenon known as “Adaptive Telomere Failure,” where spontaneous losses of telomere maintenance drive rapid accumulation of sequence polymorphism with possible adaptive advantages. Together, our data suggest that telomere maintenance is frequently compromised in Pyricularia but the chromosome alterations resulting from telomere failure are not as catastrophic as prior research would predict, and may, in fact, be potent drivers of adaptive polymorphism.

Collapse

Affiliation(s)

Mostafa Rahnama Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Baohua Wang Department of Plant Pathology, University of Kentucky, Lexington, KY, United States.,State Key Laboratory for Ecological Pest Control of Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou, China
Jane Dostart Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
Olga Novikova Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Daniel Yackzan Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Andrew Yackzan Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Haley Bruss Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
Maray Baker Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
Haven Jacob Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
Xiaofei Zhang Department of Computer Sciences, University of Kentucky, Lexington, KY, United States
April Lamb Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Alex Stewart Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Melanie Heist Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Joey Hoover Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Patrick Calie Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
Li Chen Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
Jinze Liu Department of Computer Sciences, University of Kentucky, Lexington, KY, United States
Mark L Farman Department of Plant Pathology, University of Kentucky, Lexington, KY, United States

Collapse

James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res 2019;46:e83. [PMID: 29718317 PMCID: PMC6101578 DOI: 10.1093/nar/gky315] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 04/13/2018] [Indexed: 11/13/2022] Open

Mbandi SK, Hesse U, van Heusden P, Christoffels A. Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms. BMC Bioinformatics 2015;16:58. [PMID: 25880035 PMCID: PMC4344733 DOI: 10.1186/s12859-015-0492-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 02/06/2015] [Indexed: 11/19/2022] Open

Abstract

BACKGROUND

De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies.

RESULTS

Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5' and 3') regions and non-coding gene loci.

CONCLUSIONS

IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism.

Collapse

Bevilacqua V, Pietroleonardo N, Giannino E, Stroppa F, Simone D, Pesole G, Picardi E. EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinformatics 2014;15 Suppl 15:S7. [PMID: 25474441 PMCID: PMC4271567 DOI: 10.1186/1471-2105-15-s15-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms.

RESULTS

EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets.

CONCLUSIONS

EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.

Collapse

Li X, Gao W, Guo H, Zhang X, Fang DD, Lin Z. Development of EST-based SNP and InDel markers and their utilization in tetraploid cotton genetic mapping. BMC Genomics 2014;15:1046. [PMID: 25442170 PMCID: PMC4265408 DOI: 10.1186/1471-2164-15-1046] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2014] [Accepted: 11/14/2014] [Indexed: 12/19/2022] Open

Abstract

Background

Availability of molecular markers has proven to be an efficient tool in facilitating progress in plant breeding, which is particularly important in the case of less researched crops such as cotton. Considering the obvious advantages of single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (InDels), expressed sequence tags (ESTs) were analyzed in silico to identify SNPs and InDels in this study, aiming to develop more molecular markers in cotton.

Results

A total of 1,349 EST-based SNP and InDel markers were developed by comparing ESTs between Gossypium hirsutum and G. barbadense, mining G. hirsutum unigenes, and analyzing 3′ untranslated region (3′UTR) sequences. The marker polymorphisms were investigated using the two parents of the mapping population based on the single-strand conformation polymorphism (SSCP) analysis. Of all the markers, 137 (10.16%) were polymorphic, and revealed 142 loci. Linkage analysis using a BC₁ population mapped 133 loci on the 26 chromosomes. Statistical analysis of base variations in SNPs showed that base transitions accounted for 55.78% of the total base variations and gene ontology indicated that cotton genes varied greatly in harboring SNPs ranging from 1.00 to 24.00 SNPs per gene. Sanger sequencing of three randomly selected SNP markers revealed discrepancy between the in silico predicted sequences and the actual sequencing results.

Conclusions

In silico analysis is a double-edged blade to develop EST-SNP/InDel markers. On the one hand, the designed markers can be well used in tetraploid cotton genetic mapping. And it plays a certain role in revealing transition preference and SNP frequency of cotton genes. On the other hand, the developmental efficiency of markers and polymorphism of designed primers are comparatively low.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-1046) contains supplementary material, which is available to authorized users.

Collapse

Rupp O, Becker J, Brinkrolf K, Timmermann C, Borth N, Pühler A, Noll T, Goesmann A. Construction of a public CHO cell line transcript database using versatile bioinformatics analysis pipelines. PLoS One 2014;9:e85568. [PMID: 24427317 PMCID: PMC3888431 DOI: 10.1371/journal.pone.0085568] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Accepted: 12/03/2013] [Indexed: 11/19/2022] Open

Abstract

Chinese hamster ovary (CHO) cell lines represent the most commonly used mammalian expression system for the production of therapeutic proteins. In this context, detailed knowledge of the CHO cell transcriptome might help to improve biotechnological processes conducted by specific cell lines. Nevertheless, very few assembled cDNA sequences of CHO cells were publicly released until recently, which puts a severe limitation on biotechnological research. Two extended annotation systems and web-based tools, one for browsing eukaryotic genomes (GenDBE) and one for viewing eukaryotic transcriptomes (SAMS), were established as the first step towards a publicly usable CHO cell genome/transcriptome analysis platform. This is complemented by the development of a new strategy to assemble the ca. 100 million reads, sequenced from a broad range of diverse transcripts, to a high quality CHO cell transcript set. The cDNA libraries were constructed from different CHO cell lines grown under various culture conditions and sequenced using Roche/454 and Illumina sequencing technologies in addition to sequencing reads from a previous study. Two pipelines to extend and improve the CHO cell line transcripts were established. First, de novo assemblies were carried out with the Trinity and Oases assemblers, using varying k-mer sizes. The resulting contigs were screened for potential CDS using ESTScan. Redundant contigs were filtered out using cd-hit-est. The remaining CDS contigs were re-assembled with CAP3. Second, a reference-based assembly with the TopHat/Cufflinks pipeline was performed, using the recently published draft genome sequence of CHO-K1 as reference. Additionally, the de novo contigs were mapped to the reference genome using GMAP and merged with the Cufflinks assembly using the cuffmerge software. With this approach 28,874 transcripts located on 16,492 gene loci could be assembled. Combining the results of both approaches, 65,561 transcripts were identified for CHO cell lines, which could be clustered by sequence identity into 17,598 gene clusters.

Collapse

Looso M, Preussner J, Sousounis K, Bruckskotten M, Michel CS, Lignelli E, Reinhardt R, Höffner S, Krüger M, Tsonis PA, Borchardt T, Braun T. A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration. Genome Biol 2013;14:R16. [PMID: 23425577 PMCID: PMC4054090 DOI: 10.1186/gb-2013-14-2-r16] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 02/20/2013] [Indexed: 11/12/2022] Open

Abstract

Background

Notophthalmus viridescens, an urodelian amphibian, represents an excellent model organism to study regenerative processes, but mechanistic insights into molecular processes driving regeneration have been hindered by a paucity and poor annotation of coding nucleotide sequences. The enormous genome size and the lack of a closely related reference genome have so far prevented assembly of the urodelian genome.

Results

We describe the de novo assembly of the transcriptome of the newt Notophthalmus viridescens and its experimental validation. RNA pools covering embryonic and larval development, different stages of heart, appendage and lens regeneration, as well as a collection of different undamaged tissues were used to generate sequencing datasets on Sanger, Illumina and 454 platforms. Through a sequential de novo assembly strategy, hybrid datasets were converged into one comprehensive transcriptome comprising 120,922 non-redundant transcripts with a N50 of 975. From this, 38,384 putative transcripts were annotated and around 15,000 transcripts were experimentally validated as protein coding by mass spectrometry-based proteomics. Bioinformatical analysis of coding transcripts identified 826 proteins specific for urodeles. Several newly identified proteins establish novel protein families based on the presence of new sequence motifs without counterparts in public databases, while others containing known protein domains extend already existing families and also constitute new ones.

Conclusions

We demonstrate that our multistep assembly approach allows de novo assembly of the newt transcriptome with an annotation grade comparable to well characterized organisms. Our data provide the groundwork for mechanistic experiments to answer the question whether urodeles utilize proprietary sets of genes for tissue regeneration.

Collapse

Passos MAN, de Cruz VO, Emediato FL, de Teixeira CC, Azevedo VCR, Brasileiro ACM, Amorim EP, Ferreira CF, Martins NF, Togawa RC, Pappas GJ, da Silva OB, Miller RNG. Analysis of the leaf transcriptome of Musa acuminata during interaction with Mycosphaerella musicola: gene assembly, annotation and marker development. BMC Genomics 2013;14:78. [PMID: 23379821 PMCID: PMC3635893 DOI: 10.1186/1471-2164-14-78] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 02/01/2013] [Indexed: 11/21/2022] Open

Abstract

BACKGROUND

Although banana (Musa sp.) is an important edible crop, contributing towards poverty alleviation and food security, limited transcriptome datasets are available for use in accelerated molecular-based breeding in this genus. 454 GS-FLX Titanium technology was employed to determine the sequence of gene transcripts in genotypes of Musa acuminata ssp. burmannicoides Calcutta 4 and M. acuminata subgroup Cavendish cv. Grande Naine, contrasting in resistance to the fungal pathogen Mycosphaerella musicola, causal organism of Sigatoka leaf spot disease. To enrich for transcripts under biotic stress responses, full length-enriched cDNA libraries were prepared from whole plant leaf materials, both uninfected and artificially challenged with pathogen conidiospores.

RESULTS

The study generated 846,762 high quality sequence reads, with an average length of 334 bp and totalling 283 Mbp. De novo assembly generated 36,384 and 35,269 unigene sequences for M. acuminata Calcutta 4 and Cavendish Grande Naine, respectively. A total of 64.4% of the unigenes were annotated through Basic Local Alignment Search Tool (BLAST) similarity analyses against public databases.Assembled sequences were functionally mapped to Gene Ontology (GO) terms, with unigene functions covering a diverse range of molecular functions, biological processes and cellular components. Genes from a number of defense-related pathways were observed in transcripts from each cDNA library. Over 99% of contig unigenes mapped to exon regions in the reference M. acuminata DH Pahang whole genome sequence. A total of 4068 genic-SSR loci were identified in Calcutta 4 and 4095 in Cavendish Grande Naine. A subset of 95 potential defense-related gene-derived simple sequence repeat (SSR) loci were validated for specific amplification and polymorphism across M. acuminata accessions. Fourteen loci were polymorphic, with alleles per polymorphic locus ranging from 3 to 8 and polymorphism information content ranging from 0.34 to 0.82.

CONCLUSIONS

A large set of unigenes were characterized in this study for both M. acuminata Calcutta 4 and Cavendish Grande Naine, increasing the number of public domain Musa ESTs. This transcriptome is an invaluable resource for furthering our understanding of biological processes elicited during biotic stresses in Musa. Gene-based markers will facilitate molecular breeding strategies, forming the basis of genetic linkage mapping and analysis of quantitative trait loci.

Collapse

Affiliation(s)

Marco A N Passos Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
Viviane Oliveira de Cruz Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
Flavia L Emediato Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
Cristiane Camargo de Teixeira Universidade Católica de Brasília, SGAN 916, Módulo B, CEP 70.790-160, Brasília, D.F, Brazil
Vânia C Rennó Azevedo EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
Ana C M Brasileiro EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
Edson P Amorim EMBRAPA Mandioca e Fruticultura Tropical, Rua Embrapa, CEP 44.380-000, Cruz das Almas, BA, Brazil
Claudia F Ferreira EMBRAPA Mandioca e Fruticultura Tropical, Rua Embrapa, CEP 44.380-000, Cruz das Almas, BA, Brazil
Natalia F Martins EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
Roberto C Togawa EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
Georgios J Pappas Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
Orzenil Bonfim da Silva EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
Robert NG Miller Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil

Collapse

Molnár I, Lopez D, Wisecaver JH, Devarenne TP, Weiss TL, Pellegrini M, Hackett JD. Bio-crude transcriptomics: gene discovery and metabolic network reconstruction for the biosynthesis of the terpenome of the hydrocarbon oil-producing green alga, Botryococcus braunii race B (Showa). BMC Genomics 2012;13:576. [PMID: 23110428 PMCID: PMC3533583 DOI: 10.1186/1471-2164-13-576] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 10/19/2012] [Indexed: 12/16/2022] Open

Abstract

Background

Microalgae hold promise for yielding a biofuel feedstock that is sustainable, carbon-neutral, distributed, and only minimally disruptive for the production of food and feed by traditional agriculture. Amongst oleaginous eukaryotic algae, the B race of Botryococcus braunii is unique in that it produces large amounts of liquid hydrocarbons of terpenoid origin. These are comparable to fossil crude oil, and are sequestered outside the cells in a communal extracellular polymeric matrix material. Biosynthetic engineering of terpenoid bio-crude production requires identification of genes and reconstruction of metabolic pathways responsible for production of both hydrocarbons and other metabolites of the alga that compete for photosynthetic carbon and energy.

Results

A de novo assembly of 1,334,609 next-generation pyrosequencing reads form the Showa strain of the B race of B. braunii yielded a transcriptomic database of 46,422 contigs with an average length of 756 bp. Contigs were annotated with pathway, ontology, and protein domain identifiers. Manual curation allowed the reconstruction of pathways that produce terpenoid liquid hydrocarbons from primary metabolites, and pathways that divert photosynthetic carbon into tetraterpenoid carotenoids, diterpenoids, and the prenyl chains of meroterpenoid quinones and chlorophyll. Inventories of machine-assembled contigs are also presented for reconstructed pathways for the biosynthesis of competing storage compounds including triacylglycerol and starch. Regeneration of S-adenosylmethionine, and the extracellular localization of the hydrocarbon oils by active transport and possibly autophagy are also investigated.

Conclusions

The construction of an annotated transcriptomic database, publicly available in a web-based data depository and annotation tool, provides a foundation for metabolic pathway and network reconstruction, and facilitates further omics studies in the absence of a genome sequence for the Showa strain of B. braunii, race B. Further, the transcriptome database empowers future biosynthetic engineering approaches for strain improvement and the transfer of desirable traits to heterologous hosts.

Collapse

Ng KH, Ho CK, Phon-Amnuaisuk S. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family. PLoS One 2012;7:e47216. [PMID: 23071763 PMCID: PMC3469558 DOI: 10.1371/journal.pone.0047216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 09/10/2012] [Indexed: 01/22/2023] Open

Xu R, Wunsch DC. Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 2012;3:120-54. [PMID: 22275205 DOI: 10.1109/rbme.2010.2083647] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Hackett JD, Wisecaver JH, Brosnahan ML, Kulis DM, Anderson DM, Bhattacharya D, Plumley FG, Erdner DL. Evolution of saxitoxin synthesis in cyanobacteria and dinoflagellates. Mol Biol Evol 2012;30:70-8. [PMID: 22628533 DOI: 10.1093/molbev/mss142] [Citation(s) in RCA: 128] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Abstract

Dinoflagellates produce a variety of toxic secondary metabolites that have a significant impact on marine ecosystems and fisheries. Saxitoxin (STX), the cause of paralytic shellfish poisoning, is produced by three marine dinoflagellate genera and is also made by some freshwater cyanobacteria. Genes involved in STX synthesis have been identified in cyanobacteria but are yet to be reported in the massive genomes of dinoflagellates. We have assembled comprehensive transcriptome data sets for several STX-producing dinoflagellates and a related non-toxic species and have identified 265 putative homologs of 13 cyanobacterial STX synthesis genes, including all of the genes directly involved in toxin synthesis. Putative homologs of four proteins group closely in phylogenies with cyanobacteria and are likely the functional homologs of sxtA, sxtG, and sxtB in dinoflagellates. However, the phylogenies do not support the transfer of these genes directly between toxic cyanobacteria and dinoflagellates. SxtA is split into two proteins in the dinoflagellates corresponding to the N-terminal portion containing the methyltransferase and acyl carrier protein domains and a C-terminal portion with the aminotransferase domain. Homologs of sxtB and N-terminal sxtA are present in non-toxic strains, suggesting their functions may not be limited to saxitoxin production. Only homologs of the C-terminus of sxtA and sxtG were found exclusively in toxic strains. A more thorough survey of STX+ dinoflagellates will be needed to determine if these two genes may be specific to SXT production in dinoflagellates. The A. tamarense transcriptome does not contain homologs for the remaining STX genes. Nevertheless, we identified candidate genes with similar predicted biochemical activities that account for the missing functions. These results suggest that the STX synthesis pathway was likely assembled independently in the distantly related cyanobacteria and dinoflagellates, although using some evolutionarily related proteins. The biological role of STX is not well understood in either cyanobacteria or dinoflagellates. However, STX production in these two ecologically distinct groups of organisms suggests that this toxin confers a benefit to producers that we do not yet fully understand.

Collapse

Dlugosch KM, Bonin A. Allele identification in assembled genomic sequence datasets. Methods Mol Biol 2012;888:197-211. [PMID: 22665283 DOI: 10.1007/978-1-61779-870-2_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]

Hazelhurst S, Lipták Z. KABOOM! A new suffix array based algorithm for clustering expression data. ACTA ACUST UNITED AC 2011;27:3348-55. [PMID: 21984769 DOI: 10.1093/bioinformatics/btr560] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Zenoni S, D'Agostino N, Tornielli GB, Quattrocchio F, Chiusano ML, Koes R, Zethof J, Guzzo F, Delledonne M, Frusciante L, Gerats T, Pezzotti M. Revealing impaired pathways in the an11 mutant by high-throughput characterization of Petunia axillaris and Petunia inflata transcriptomes. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2011;68:11-27. [PMID: 21623977 DOI: 10.1111/j.1365-313x.2011.04661.x] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]

Jones DB, Zenger KR, Jerry DR. In silico whole-genome EST analysis reveals 2322 novel microsatellites for the silver-lipped pearl oyster, Pinctada maxima. Mar Genomics 2011;4:287-90. [PMID: 22118641 DOI: 10.1016/j.margen.2011.06.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2011] [Revised: 06/24/2011] [Accepted: 06/28/2011] [Indexed: 12/01/2022]

Bao E, Jiang T, Kaloshian I, Girke T. SEED: efficient clustering of next-generation sequences. ACTA ACUST UNITED AC 2011;27:2502-9. [PMID: 21810899 PMCID: PMC3167058 DOI: 10.1093/bioinformatics/btr447] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Abstract

Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.

Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

Contact:thomas.girke@ucr.edu

Supplementary information:Supplementary data are available at Bioinformatics online

Collapse

Grattapaglia D, Silva-Junior OB, Kirst M, de Lima BM, Faria DA, Pappas GJ. High-throughput SNP genotyping in the highly heterozygous genome of Eucalyptus: assay success, polymorphism and transferability across species. BMC PLANT BIOLOGY 2011;11:65. [PMID: 21492434 PMCID: PMC3090336 DOI: 10.1186/1471-2229-11-65] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2010] [Accepted: 04/14/2011] [Indexed: 05/17/2023]

Abstract

BACKGROUND

High-throughput SNP genotyping has become an essential requirement for molecular breeding and population genomics studies in plant species. Large scale SNP developments have been reported for several mainstream crops. A growing interest now exists to expand the speed and resolution of genetic analysis to outbred species with highly heterozygous genomes. When nucleotide diversity is high, a refined diagnosis of the target SNP sequence context is needed to convert queried SNPs into high-quality genotypes using the Golden Gate Genotyping Technology (GGGT). This issue becomes exacerbated when attempting to transfer SNPs across species, a scarcely explored topic in plants, and likely to become significant for population genomics and inter specific breeding applications in less domesticated and less funded plant genera.

RESULTS

We have successfully developed the first set of 768 SNPs assayed by the GGGT for the highly heterozygous genome of Eucalyptus from a mixed Sanger/454 database with 1,164,695 ESTs and the preliminary 4.5X draft genome sequence for E. grandis. A systematic assessment of in silico SNP filtering requirements showed that stringent constraints on the SNP surrounding sequences have a significant impact on SNP genotyping performance and polymorphism. SNP assay success was high for the 288 SNPs selected with more rigorous in silico constraints; 93% of them provided high quality genotype calls and 71% of them were polymorphic in a diverse panel of 96 individuals of five different species.SNP reliability was high across nine Eucalyptus species belonging to three sections within subgenus Symphomyrtus and still satisfactory across species of two additional subgenera, although polymorphism declined as phylogenetic distance increased.

CONCLUSIONS

This study indicates that the GGGT performs well both within and across species of Eucalyptus notwithstanding its nucleotide diversity ≥ 2%. The development of a much larger array of informative SNPs across multiple Eucalyptus species is feasible, although strongly dependent on having a representative and sufficiently deep collection of sequences from many individuals of each target species. A higher density SNP platform will be instrumental to undertake genome-wide phylogenetic and population genomics studies and to implement molecular breeding by Genomic Selection in Eucalyptus.

Collapse

Rao DM, Moler JC, Ozden M, Zhang Y, Liang C, Karro JE. PEACE: Parallel Environment for Assembly and Clustering of Gene Expression. Nucleic Acids Res 2010;38:W737-42. [PMID: 20522511 PMCID: PMC2896108 DOI: 10.1093/nar/gkq470] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

D'Agostino N, Traini A, Frusciante L, Chiusano ML. SolEST database: a "one-stop shop" approach to the study of Solanaceae transcriptomes. BMC PLANT BIOLOGY 2009;9:142. [PMID: 19948013 PMCID: PMC2794286 DOI: 10.1186/1471-2229-9-142] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 11/30/2009] [Indexed: 05/21/2023]

Bragg LM, Stone G. k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 2009;25:2302-8. [PMID: 19570806 PMCID: PMC2735666 DOI: 10.1093/bioinformatics/btp410] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Picardi E, Mignone F, Pesole G. EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data. BMC Bioinformatics 2009;10 Suppl 6:S10. [PMID: 19534735 PMCID: PMC2697633 DOI: 10.1186/1471-2105-10-s6-s10] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open

Abstract

Background

ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping.

Methods

EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.

Results

The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

Collapse

Imelfort M. Sequence Comparison Tools. Bioinformatics 2009. [DOI: 10.1007/978-0-387-92738-1_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open