1
|
Luan T, Muralidharan HS, Alshehri M, Mittra I, Pop M. SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets. Nucleic Acids Res 2023; 51:e46. [PMID: 36912074 PMCID: PMC10164572 DOI: 10.1093/nar/gkad158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/01/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023] Open
Abstract
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
Collapse
Affiliation(s)
- Tu Luan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Harihara Subrahmaniam Muralidharan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Marwan Alshehri
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Ipsa Mittra
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
2
|
Rahnama M, Wang B, Dostart J, Novikova O, Yackzan D, Yackzan A, Bruss H, Baker M, Jacob H, Zhang X, Lamb A, Stewart A, Heist M, Hoover J, Calie P, Chen L, Liu J, Farman ML. Telomere Roles in Fungal Genome Evolution and Adaptation. Front Genet 2021; 12:676751. [PMID: 34434216 PMCID: PMC8381367 DOI: 10.3389/fgene.2021.676751] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/28/2021] [Indexed: 11/27/2022] Open
Abstract
Telomeres form the ends of linear chromosomes and usually comprise protein complexes that bind to simple repeated sequence motifs that are added to the 3′ ends of DNA by the telomerase reverse transcriptase (TERT). One of the primary functions attributed to telomeres is to solve the “end-replication problem” which, if left unaddressed, would cause gradual, inexorable attrition of sequences from the chromosome ends and, eventually, loss of viability. Telomere-binding proteins also protect the chromosome from 5′ to 3′ exonuclease action, and disguise the chromosome ends from the double-strand break repair machinery whose illegitimate action potentially generates catastrophic chromosome aberrations. Telomeres are of special interest in the blast fungus, Pyricularia, because the adjacent regions are enriched in genes controlling interactions with host plants, and the chromosome ends show enhanced polymorphism and genetic instability. Previously, we showed that telomere instability in some P. oryzae strains is caused by novel retrotransposons (MoTeRs) that insert in telomere repeats, generating interstitial telomere sequences that drive frequent, break-induced rearrangements. Here, we sought to gain further insight on telomeric involvement in shaping Pyricularia genome architecture by characterizing sequence polymorphisms at chromosome ends, and surrounding internalized MoTeR loci (relics) and interstitial telomere repeats. This provided evidence that telomere dynamics have played historical, and likely ongoing, roles in shaping the Pyricularia genome. We further demonstrate that even telomeres lacking MoTeR insertions are poorly preserved, such that the telomere-adjacent sequences exhibit frequent presence/absence polymorphism, as well as exchanges with the genome interior. Using TERT knockout experiments, we characterized chromosomal responses to failed telomere maintenance which suggested that much of the MoTeR relic-/interstitial telomere-associated polymorphism could be driven by compromised telomere function. Finally, we describe three possible examples of a phenomenon known as “Adaptive Telomere Failure,” where spontaneous losses of telomere maintenance drive rapid accumulation of sequence polymorphism with possible adaptive advantages. Together, our data suggest that telomere maintenance is frequently compromised in Pyricularia but the chromosome alterations resulting from telomere failure are not as catastrophic as prior research would predict, and may, in fact, be potent drivers of adaptive polymorphism.
Collapse
Affiliation(s)
- Mostafa Rahnama
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Baohua Wang
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States.,State Key Laboratory for Ecological Pest Control of Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Jane Dostart
- Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
| | - Olga Novikova
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Daniel Yackzan
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Andrew Yackzan
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Haley Bruss
- Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
| | - Maray Baker
- Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
| | - Haven Jacob
- Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
| | - Xiaofei Zhang
- Department of Computer Sciences, University of Kentucky, Lexington, KY, United States
| | - April Lamb
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Alex Stewart
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Melanie Heist
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Joey Hoover
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Patrick Calie
- Department of Biological Sciences, Eastern Kentucky University, Richmond, KY, United States
| | - Li Chen
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| | - Jinze Liu
- Department of Computer Sciences, University of Kentucky, Lexington, KY, United States
| | - Mark L Farman
- Department of Plant Pathology, University of Kentucky, Lexington, KY, United States
| |
Collapse
|
3
|
James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res 2019; 46:e83. [PMID: 29718317 PMCID: PMC6101578 DOI: 10.1093/nar/gky315] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 04/13/2018] [Indexed: 11/13/2022] Open
Abstract
Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.
Collapse
Affiliation(s)
- Benjamin T James
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA.,Mathematics Department, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| | - Brian B Luczak
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA.,Mathematics Department, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| | - Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA
| |
Collapse
|
4
|
Mbandi SK, Hesse U, van Heusden P, Christoffels A. Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms. BMC Bioinformatics 2015; 16:58. [PMID: 25880035 PMCID: PMC4344733 DOI: 10.1186/s12859-015-0492-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 02/06/2015] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. RESULTS Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5' and 3') regions and non-coding gene loci. CONCLUSIONS IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism.
Collapse
Affiliation(s)
- Stanley Kimbung Mbandi
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Uljana Hesse
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Peter van Heusden
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| | - Alan Christoffels
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.
| |
Collapse
|
5
|
Bevilacqua V, Pietroleonardo N, Giannino E, Stroppa F, Simone D, Pesole G, Picardi E. EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinformatics 2014; 15 Suppl 15:S7. [PMID: 25474441 PMCID: PMC4271567 DOI: 10.1186/1471-2105-15-s15-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.
Collapse
|
6
|
Li X, Gao W, Guo H, Zhang X, Fang DD, Lin Z. Development of EST-based SNP and InDel markers and their utilization in tetraploid cotton genetic mapping. BMC Genomics 2014; 15:1046. [PMID: 25442170 PMCID: PMC4265408 DOI: 10.1186/1471-2164-15-1046] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2014] [Accepted: 11/14/2014] [Indexed: 12/19/2022] Open
Abstract
Background Availability of molecular markers has proven to be an efficient tool in facilitating progress in plant breeding, which is particularly important in the case of less researched crops such as cotton. Considering the obvious advantages of single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (InDels), expressed sequence tags (ESTs) were analyzed in silico to identify SNPs and InDels in this study, aiming to develop more molecular markers in cotton. Results A total of 1,349 EST-based SNP and InDel markers were developed by comparing ESTs between Gossypium hirsutum and G. barbadense, mining G. hirsutum unigenes, and analyzing 3′ untranslated region (3′UTR) sequences. The marker polymorphisms were investigated using the two parents of the mapping population based on the single-strand conformation polymorphism (SSCP) analysis. Of all the markers, 137 (10.16%) were polymorphic, and revealed 142 loci. Linkage analysis using a BC1 population mapped 133 loci on the 26 chromosomes. Statistical analysis of base variations in SNPs showed that base transitions accounted for 55.78% of the total base variations and gene ontology indicated that cotton genes varied greatly in harboring SNPs ranging from 1.00 to 24.00 SNPs per gene. Sanger sequencing of three randomly selected SNP markers revealed discrepancy between the in silico predicted sequences and the actual sequencing results. Conclusions In silico analysis is a double-edged blade to develop EST-SNP/InDel markers. On the one hand, the designed markers can be well used in tetraploid cotton genetic mapping. And it plays a certain role in revealing transition preference and SNP frequency of cotton genes. On the other hand, the developmental efficiency of markers and polymorphism of designed primers are comparatively low. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-1046) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | - Zhongxu Lin
- National Key Laboratory of Crop Genetic Improvement & National Centre of Plant Gene Research (Wuhan), Huazhong Agricultural University, Wuhan 430070, Hubei, China.
| |
Collapse
|
7
|
Rupp O, Becker J, Brinkrolf K, Timmermann C, Borth N, Pühler A, Noll T, Goesmann A. Construction of a public CHO cell line transcript database using versatile bioinformatics analysis pipelines. PLoS One 2014; 9:e85568. [PMID: 24427317 PMCID: PMC3888431 DOI: 10.1371/journal.pone.0085568] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Accepted: 12/03/2013] [Indexed: 11/19/2022] Open
Abstract
Chinese hamster ovary (CHO) cell lines represent the most commonly used mammalian expression system for the production of therapeutic proteins. In this context, detailed knowledge of the CHO cell transcriptome might help to improve biotechnological processes conducted by specific cell lines. Nevertheless, very few assembled cDNA sequences of CHO cells were publicly released until recently, which puts a severe limitation on biotechnological research. Two extended annotation systems and web-based tools, one for browsing eukaryotic genomes (GenDBE) and one for viewing eukaryotic transcriptomes (SAMS), were established as the first step towards a publicly usable CHO cell genome/transcriptome analysis platform. This is complemented by the development of a new strategy to assemble the ca. 100 million reads, sequenced from a broad range of diverse transcripts, to a high quality CHO cell transcript set. The cDNA libraries were constructed from different CHO cell lines grown under various culture conditions and sequenced using Roche/454 and Illumina sequencing technologies in addition to sequencing reads from a previous study. Two pipelines to extend and improve the CHO cell line transcripts were established. First, de novo assemblies were carried out with the Trinity and Oases assemblers, using varying k-mer sizes. The resulting contigs were screened for potential CDS using ESTScan. Redundant contigs were filtered out using cd-hit-est. The remaining CDS contigs were re-assembled with CAP3. Second, a reference-based assembly with the TopHat/Cufflinks pipeline was performed, using the recently published draft genome sequence of CHO-K1 as reference. Additionally, the de novo contigs were mapped to the reference genome using GMAP and merged with the Cufflinks assembly using the cuffmerge software. With this approach 28,874 transcripts located on 16,492 gene loci could be assembled. Combining the results of both approaches, 65,561 transcripts were identified for CHO cell lines, which could be clustered by sequence identity into 17,598 gene clusters.
Collapse
Affiliation(s)
- Oliver Rupp
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Cell Culture Technology, Bielefeld University, Bielefeld, Germany
- Bioinformatics and Systems Biology, Justus-Liebig-University, Giessen, Germany
- * E-mail:
| | - Jennifer Becker
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Cell Culture Technology, Bielefeld University, Bielefeld, Germany
| | - Karina Brinkrolf
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | | | - Nicole Borth
- Department for Biotechnology, Universität für Bodenkultur Wien, Vienna, Austria
- ACIB, Austrian Center of Industrial Biotechnology, Graz and Vienna, Austria
| | - Alfred Pühler
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Thomas Noll
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Cell Culture Technology, Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Bioinformatics and Systems Biology, Justus-Liebig-University, Giessen, Germany
| |
Collapse
|
8
|
Looso M, Preussner J, Sousounis K, Bruckskotten M, Michel CS, Lignelli E, Reinhardt R, Höffner S, Krüger M, Tsonis PA, Borchardt T, Braun T. A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration. Genome Biol 2013; 14:R16. [PMID: 23425577 PMCID: PMC4054090 DOI: 10.1186/gb-2013-14-2-r16] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 02/20/2013] [Indexed: 11/12/2022] Open
Abstract
Background Notophthalmus viridescens, an urodelian amphibian, represents an excellent model organism to study regenerative processes, but mechanistic insights into molecular processes driving regeneration have been hindered by a paucity and poor annotation of coding nucleotide sequences. The enormous genome size and the lack of a closely related reference genome have so far prevented assembly of the urodelian genome. Results We describe the de novo assembly of the transcriptome of the newt Notophthalmus viridescens and its experimental validation. RNA pools covering embryonic and larval development, different stages of heart, appendage and lens regeneration, as well as a collection of different undamaged tissues were used to generate sequencing datasets on Sanger, Illumina and 454 platforms. Through a sequential de novo assembly strategy, hybrid datasets were converged into one comprehensive transcriptome comprising 120,922 non-redundant transcripts with a N50 of 975. From this, 38,384 putative transcripts were annotated and around 15,000 transcripts were experimentally validated as protein coding by mass spectrometry-based proteomics. Bioinformatical analysis of coding transcripts identified 826 proteins specific for urodeles. Several newly identified proteins establish novel protein families based on the presence of new sequence motifs without counterparts in public databases, while others containing known protein domains extend already existing families and also constitute new ones. Conclusions We demonstrate that our multistep assembly approach allows de novo assembly of the newt transcriptome with an annotation grade comparable to well characterized organisms. Our data provide the groundwork for mechanistic experiments to answer the question whether urodeles utilize proprietary sets of genes for tissue regeneration.
Collapse
|
9
|
Passos MAN, de Cruz VO, Emediato FL, de Teixeira CC, Azevedo VCR, Brasileiro ACM, Amorim EP, Ferreira CF, Martins NF, Togawa RC, Pappas GJ, da Silva OB, Miller RNG. Analysis of the leaf transcriptome of Musa acuminata during interaction with Mycosphaerella musicola: gene assembly, annotation and marker development. BMC Genomics 2013; 14:78. [PMID: 23379821 PMCID: PMC3635893 DOI: 10.1186/1471-2164-14-78] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 02/01/2013] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Although banana (Musa sp.) is an important edible crop, contributing towards poverty alleviation and food security, limited transcriptome datasets are available for use in accelerated molecular-based breeding in this genus. 454 GS-FLX Titanium technology was employed to determine the sequence of gene transcripts in genotypes of Musa acuminata ssp. burmannicoides Calcutta 4 and M. acuminata subgroup Cavendish cv. Grande Naine, contrasting in resistance to the fungal pathogen Mycosphaerella musicola, causal organism of Sigatoka leaf spot disease. To enrich for transcripts under biotic stress responses, full length-enriched cDNA libraries were prepared from whole plant leaf materials, both uninfected and artificially challenged with pathogen conidiospores. RESULTS The study generated 846,762 high quality sequence reads, with an average length of 334 bp and totalling 283 Mbp. De novo assembly generated 36,384 and 35,269 unigene sequences for M. acuminata Calcutta 4 and Cavendish Grande Naine, respectively. A total of 64.4% of the unigenes were annotated through Basic Local Alignment Search Tool (BLAST) similarity analyses against public databases.Assembled sequences were functionally mapped to Gene Ontology (GO) terms, with unigene functions covering a diverse range of molecular functions, biological processes and cellular components. Genes from a number of defense-related pathways were observed in transcripts from each cDNA library. Over 99% of contig unigenes mapped to exon regions in the reference M. acuminata DH Pahang whole genome sequence. A total of 4068 genic-SSR loci were identified in Calcutta 4 and 4095 in Cavendish Grande Naine. A subset of 95 potential defense-related gene-derived simple sequence repeat (SSR) loci were validated for specific amplification and polymorphism across M. acuminata accessions. Fourteen loci were polymorphic, with alleles per polymorphic locus ranging from 3 to 8 and polymorphism information content ranging from 0.34 to 0.82. CONCLUSIONS A large set of unigenes were characterized in this study for both M. acuminata Calcutta 4 and Cavendish Grande Naine, increasing the number of public domain Musa ESTs. This transcriptome is an invaluable resource for furthering our understanding of biological processes elicited during biotic stresses in Musa. Gene-based markers will facilitate molecular breeding strategies, forming the basis of genetic linkage mapping and analysis of quantitative trait loci.
Collapse
Affiliation(s)
- Marco A N Passos
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
| | - Viviane Oliveira de Cruz
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
| | - Flavia L Emediato
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
| | | | - Vânia C Rennó Azevedo
- EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
| | - Ana C M Brasileiro
- EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
| | - Edson P Amorim
- EMBRAPA Mandioca e Fruticultura Tropical, Rua Embrapa, CEP 44.380-000, Cruz das Almas, BA, Brazil
| | - Claudia F Ferreira
- EMBRAPA Mandioca e Fruticultura Tropical, Rua Embrapa, CEP 44.380-000, Cruz das Almas, BA, Brazil
| | - Natalia F Martins
- EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
| | - Roberto C Togawa
- EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
| | - Georgios J Pappas
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
| | - Orzenil Bonfim da Silva
- EMBRAPA Recursos Genéticos e Biotecnologia, Parque Estação Biológica, CP 02372, CEP 70.770-900, Brasília, D.F, Brazil
| | - Robert NG Miller
- Universidade de Brasília, Campus Universitário Darcy Ribeiro, Instituto de Ciências Biológicas, Departamento de Biologia Celular, CEP 70.910-900, Brasília, D.F, Brazil
| |
Collapse
|
10
|
Molnár I, Lopez D, Wisecaver JH, Devarenne TP, Weiss TL, Pellegrini M, Hackett JD. Bio-crude transcriptomics: gene discovery and metabolic network reconstruction for the biosynthesis of the terpenome of the hydrocarbon oil-producing green alga, Botryococcus braunii race B (Showa). BMC Genomics 2012; 13:576. [PMID: 23110428 PMCID: PMC3533583 DOI: 10.1186/1471-2164-13-576] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 10/19/2012] [Indexed: 12/16/2022] Open
Abstract
Background Microalgae hold promise for yielding a biofuel feedstock that is sustainable, carbon-neutral, distributed, and only minimally disruptive for the production of food and feed by traditional agriculture. Amongst oleaginous eukaryotic algae, the B race of Botryococcus braunii is unique in that it produces large amounts of liquid hydrocarbons of terpenoid origin. These are comparable to fossil crude oil, and are sequestered outside the cells in a communal extracellular polymeric matrix material. Biosynthetic engineering of terpenoid bio-crude production requires identification of genes and reconstruction of metabolic pathways responsible for production of both hydrocarbons and other metabolites of the alga that compete for photosynthetic carbon and energy. Results A de novo assembly of 1,334,609 next-generation pyrosequencing reads form the Showa strain of the B race of B. braunii yielded a transcriptomic database of 46,422 contigs with an average length of 756 bp. Contigs were annotated with pathway, ontology, and protein domain identifiers. Manual curation allowed the reconstruction of pathways that produce terpenoid liquid hydrocarbons from primary metabolites, and pathways that divert photosynthetic carbon into tetraterpenoid carotenoids, diterpenoids, and the prenyl chains of meroterpenoid quinones and chlorophyll. Inventories of machine-assembled contigs are also presented for reconstructed pathways for the biosynthesis of competing storage compounds including triacylglycerol and starch. Regeneration of S-adenosylmethionine, and the extracellular localization of the hydrocarbon oils by active transport and possibly autophagy are also investigated. Conclusions The construction of an annotated transcriptomic database, publicly available in a web-based data depository and annotation tool, provides a foundation for metabolic pathway and network reconstruction, and facilitates further omics studies in the absence of a genome sequence for the Showa strain of B. braunii, race B. Further, the transcriptome database empowers future biosynthetic engineering approaches for strain improvement and the transfer of desirable traits to heterologous hosts.
Collapse
Affiliation(s)
- István Molnár
- Natural Products Center, School of Natural Resources and the Environment, The University of Arizona, Tucson, 85739, USA.
| | | | | | | | | | | | | |
Collapse
|
11
|
Ng KH, Ho CK, Phon-Amnuaisuk S. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family. PLoS One 2012; 7:e47216. [PMID: 23071763 PMCID: PMC3469558 DOI: 10.1371/journal.pone.0047216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 09/10/2012] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Clustering is a key step in the processing of Expressed Sequence Tags (ESTs). The primary goal of clustering is to put ESTs from the same transcript of a single gene into a unique cluster. Recent EST clustering algorithms mostly adopt the alignment-free distance measures, where they tend to yield acceptable clustering accuracies with reasonable computational time. Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes. METHODOLOGY/PRINCIPAL FINDINGS We propose a hybrid distance measure that combines the global and local features extracted from ESTs, with the aim to address the clustering problem faced by ESTs derived from the same gene family. The clustering process is implemented using the DBSCAN algorithm. We test the hybrid distance measure on the ten EST datasets, and the clustering results are compared with the two alignment-free EST clustering tools, i.e. wcd and PEACE. The clustering results indicate that the proposed hybrid distance measure performs relatively better (in terms of clustering accuracy) than both EST clustering tools. CONCLUSIONS/SIGNIFICANCE The clustering results provide support for the effectiveness of the proposed hybrid distance measure in solving the clustering problem for ESTs that originate from the same gene family. The improvement of clustering accuracies on the experimental datasets has supported the claim that the sensitivity of the hybrid distance measure is sufficient to solve the clustering problem.
Collapse
Affiliation(s)
- Keng-Hoong Ng
- Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Malaysia.
| | | | | |
Collapse
|
12
|
Abstract
Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications.
Collapse
Affiliation(s)
- Rui Xu
- Industrial Artificial Intelligence Laboratory, GE Global Research Center, Niskayuna, NY 12309, USA.
| | | |
Collapse
|
13
|
Hackett JD, Wisecaver JH, Brosnahan ML, Kulis DM, Anderson DM, Bhattacharya D, Plumley FG, Erdner DL. Evolution of saxitoxin synthesis in cyanobacteria and dinoflagellates. Mol Biol Evol 2012; 30:70-8. [PMID: 22628533 DOI: 10.1093/molbev/mss142] [Citation(s) in RCA: 128] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Dinoflagellates produce a variety of toxic secondary metabolites that have a significant impact on marine ecosystems and fisheries. Saxitoxin (STX), the cause of paralytic shellfish poisoning, is produced by three marine dinoflagellate genera and is also made by some freshwater cyanobacteria. Genes involved in STX synthesis have been identified in cyanobacteria but are yet to be reported in the massive genomes of dinoflagellates. We have assembled comprehensive transcriptome data sets for several STX-producing dinoflagellates and a related non-toxic species and have identified 265 putative homologs of 13 cyanobacterial STX synthesis genes, including all of the genes directly involved in toxin synthesis. Putative homologs of four proteins group closely in phylogenies with cyanobacteria and are likely the functional homologs of sxtA, sxtG, and sxtB in dinoflagellates. However, the phylogenies do not support the transfer of these genes directly between toxic cyanobacteria and dinoflagellates. SxtA is split into two proteins in the dinoflagellates corresponding to the N-terminal portion containing the methyltransferase and acyl carrier protein domains and a C-terminal portion with the aminotransferase domain. Homologs of sxtB and N-terminal sxtA are present in non-toxic strains, suggesting their functions may not be limited to saxitoxin production. Only homologs of the C-terminus of sxtA and sxtG were found exclusively in toxic strains. A more thorough survey of STX+ dinoflagellates will be needed to determine if these two genes may be specific to SXT production in dinoflagellates. The A. tamarense transcriptome does not contain homologs for the remaining STX genes. Nevertheless, we identified candidate genes with similar predicted biochemical activities that account for the missing functions. These results suggest that the STX synthesis pathway was likely assembled independently in the distantly related cyanobacteria and dinoflagellates, although using some evolutionarily related proteins. The biological role of STX is not well understood in either cyanobacteria or dinoflagellates. However, STX production in these two ecologically distinct groups of organisms suggests that this toxin confers a benefit to producers that we do not yet fully understand.
Collapse
Affiliation(s)
- Jeremiah D Hackett
- Department of Ecology and Evolutionary Biology, University of Arizona, AZ, USA.
| | | | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.
Collapse
Affiliation(s)
- Katrina M Dlugosch
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA.
| | | |
Collapse
|
15
|
Hazelhurst S, Lipták Z. KABOOM! A new suffix array based algorithm for clustering expression data. ACTA ACUST UNITED AC 2011; 27:3348-55. [PMID: 21984769 DOI: 10.1093/bioinformatics/btr560] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets. RESULTS We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time. AVAILABILITY Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X. CONTACT scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Scott Hazelhurst
- Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.
| | | |
Collapse
|
16
|
Zenoni S, D'Agostino N, Tornielli GB, Quattrocchio F, Chiusano ML, Koes R, Zethof J, Guzzo F, Delledonne M, Frusciante L, Gerats T, Pezzotti M. Revealing impaired pathways in the an11 mutant by high-throughput characterization of Petunia axillaris and Petunia inflata transcriptomes. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2011; 68:11-27. [PMID: 21623977 DOI: 10.1111/j.1365-313x.2011.04661.x] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Petunia is an excellent model system, especially for genetic, physiological and molecular studies. Thus far, however, genome-wide expression analysis has been applied rarely because of the lack of sequence information. We applied next-generation sequencing to generate, through de novo read assembly, a large catalogue of transcripts for Petunia axillaris and Petunia inflata. On the basis of both transcriptomes, comprehensive microarray chips for gene expression analysis were established and used for the analysis of global- and organ-specific gene expression in Petunia axillaris and Petunia inflata and to explore the molecular basis of the seed coat defects in a Petunia hybrida mutant, anthocyanin 11 (an11), lacking a WD40-repeat (WDR) transcription regulator. Among the transcripts differentially expressed in an11 seeds compared with wild type, many expected targets of AN11 were found but also several interesting new candidates that might play a role in morphogenesis of the seed coat. Our results validate the combination of next-generation sequencing with microarray analyses strategies to identify the transcriptome of two petunia species without previous knowledge of their genome, and to develop comprehensive chips as useful tools for the analysis of gene expression in P. axillaris, P. inflata and P. hybrida.
Collapse
Affiliation(s)
- Sara Zenoni
- Department of Biotechnology, University of Verona, Strada Le Grazie 15, 37134 Verona, Italy
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Jones DB, Zenger KR, Jerry DR. In silico whole-genome EST analysis reveals 2322 novel microsatellites for the silver-lipped pearl oyster, Pinctada maxima. Mar Genomics 2011; 4:287-90. [PMID: 22118641 DOI: 10.1016/j.margen.2011.06.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2011] [Revised: 06/24/2011] [Accepted: 06/28/2011] [Indexed: 12/01/2022]
Abstract
Molecular stock improvement techniques such as marker assisted selection have great potential in accelerating selective breeding programmes for animal production industries. However, the discovery and application of trait/marker associations usually requires a large number of genome-wide polymorphic loci. Here, we present 2322 unique microsatellites for the silver-lipped pearl oyster, Pinctada maxima, a species of aquaculture importance throughout the Indo-Australian Archipelago for production of the highly valued South Sea pearl. More than 1.2 million Roche 454 expressed sequence tag (EST) reads were screened for microsatellite repeat motifs. A total of 12,604 sequences contained either a di, tri, tetra, penta or hexa microsatellite repeat motif (n ≥ 6), with 6435 of these sequences having sufficient flanking regions for primer development. All identified microsatellites with designed primers were condensed into 2322 unique clusters (i.e., unique loci) of which 360 were shown to be polymorphic based on multiple sequence reads with different repeat motifs. Genotyping of five microsatellite loci demonstrated that in silico evaluation of polymorphism levels was a very useful method for identification of polymorphic loci, with the variation uncovered being a lower bound. Gene Ontology annotations of sequences containing microsatellites suggest that most are derived from a diverse array of unique genes. This EST derived microsatellite database will be a valuable resource for future studies in genetic map construction, diversity analysis, quantitative trait loci analysis, association mapping and marker assisted selection, not only for P. maxima, but also closely related species within the genus Pinctada.
Collapse
Affiliation(s)
- D B Jones
- Aquaculture Genetics Research Program, School of Marine and Tropical Biology, James Cook University, Townsville, QLD 4811, Australia.
| | | | | |
Collapse
|
18
|
Abstract
Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. Contact:thomas.girke@ucr.edu Supplementary information:Supplementary data are available at Bioinformatics online
Collapse
Affiliation(s)
- Ergude Bao
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | | | | | | |
Collapse
|
19
|
Grattapaglia D, Silva-Junior OB, Kirst M, de Lima BM, Faria DA, Pappas GJ. High-throughput SNP genotyping in the highly heterozygous genome of Eucalyptus: assay success, polymorphism and transferability across species. BMC PLANT BIOLOGY 2011; 11:65. [PMID: 21492434 PMCID: PMC3090336 DOI: 10.1186/1471-2229-11-65] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2010] [Accepted: 04/14/2011] [Indexed: 05/17/2023]
Abstract
BACKGROUND High-throughput SNP genotyping has become an essential requirement for molecular breeding and population genomics studies in plant species. Large scale SNP developments have been reported for several mainstream crops. A growing interest now exists to expand the speed and resolution of genetic analysis to outbred species with highly heterozygous genomes. When nucleotide diversity is high, a refined diagnosis of the target SNP sequence context is needed to convert queried SNPs into high-quality genotypes using the Golden Gate Genotyping Technology (GGGT). This issue becomes exacerbated when attempting to transfer SNPs across species, a scarcely explored topic in plants, and likely to become significant for population genomics and inter specific breeding applications in less domesticated and less funded plant genera. RESULTS We have successfully developed the first set of 768 SNPs assayed by the GGGT for the highly heterozygous genome of Eucalyptus from a mixed Sanger/454 database with 1,164,695 ESTs and the preliminary 4.5X draft genome sequence for E. grandis. A systematic assessment of in silico SNP filtering requirements showed that stringent constraints on the SNP surrounding sequences have a significant impact on SNP genotyping performance and polymorphism. SNP assay success was high for the 288 SNPs selected with more rigorous in silico constraints; 93% of them provided high quality genotype calls and 71% of them were polymorphic in a diverse panel of 96 individuals of five different species.SNP reliability was high across nine Eucalyptus species belonging to three sections within subgenus Symphomyrtus and still satisfactory across species of two additional subgenera, although polymorphism declined as phylogenetic distance increased. CONCLUSIONS This study indicates that the GGGT performs well both within and across species of Eucalyptus notwithstanding its nucleotide diversity ≥ 2%. The development of a much larger array of informative SNPs across multiple Eucalyptus species is feasible, although strongly dependent on having a representative and sufficiently deep collection of sequences from many individuals of each target species. A higher density SNP platform will be instrumental to undertake genome-wide phylogenetic and population genomics studies and to implement molecular breeding by Genomic Selection in Eucalyptus.
Collapse
Affiliation(s)
- Dario Grattapaglia
- EMBRAPA Genetic Resources and Biotechnology - Estação Parque Biológico, final W5 norte, Brasilia, Brazil
- Genomic Sciences Program - Universidade Catolica de Brasília- SGAN, 916 modulo B, 70790-160 Brasília - DF, Brazil
| | - Orzenil B Silva-Junior
- EMBRAPA Genetic Resources and Biotechnology - Estação Parque Biológico, final W5 norte, Brasilia, Brazil
| | - Matias Kirst
- School of Forest Resources and Conservation, Genetics Institute, University of Florida, PO Box 110410, Gainesville, USA
| | - Bruno Marco de Lima
- EMBRAPA Genetic Resources and Biotechnology - Estação Parque Biológico, final W5 norte, Brasilia, Brazil
- Department of Genetics - Universidade de São Paulo - ESALQ/USP - Av. Pádua Dias, 11 - Caixa Postal 9 13418-900 Piracicaba, SP, Brazil
| | - Danielle A Faria
- EMBRAPA Genetic Resources and Biotechnology - Estação Parque Biológico, final W5 norte, Brasilia, Brazil
| | - Georgios J Pappas
- EMBRAPA Genetic Resources and Biotechnology - Estação Parque Biológico, final W5 norte, Brasilia, Brazil
- Genomic Sciences Program - Universidade Catolica de Brasília- SGAN, 916 modulo B, 70790-160 Brasília - DF, Brazil
| |
Collapse
|
20
|
Rao DM, Moler JC, Ozden M, Zhang Y, Liang C, Karro JE. PEACE: Parallel Environment for Assembly and Clustering of Gene Expression. Nucleic Acids Res 2010; 38:W737-42. [PMID: 20522511 PMCID: PMC2896108 DOI: 10.1093/nar/gkq470] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
We present PEACE, a stand-alone tool for high-throughput ab initio clustering of transcript fragment sequences produced by Next Generation or Sanger Sequencing technologies. It is freely available from www.peace-tools.org. Installed and managed through a downloadable user-friendly graphical user interface (GUI), PEACE can process large data sets of transcript fragments of length 50 bases or greater, grouping the fragments by gene associations with a sensitivity comparable to leading clustering tools. Once clustered, the user can employ the GUI's analysis functions, facilitating the easy collection of statistics and allowing them to single out specific clusters for more comprehensive study or assembly. Using a novel minimum spanning tree-based clustering method, PEACE is the equal of leading tools in the literature, with an interface making it accessible to any user. It produces results of quality virtually identical to those of the WCD tool when applied to Sanger sequences, significantly improved results over WCD and TGICL when applied to the products of Next Generation Sequencing Technology and significantly improved results over Cap3 in both cases. In short, PEACE provides an intuitive GUI and a feature-rich, parallel clustering engine that proves to be a valuable addition to the leading cDNA clustering tools.
Collapse
Affiliation(s)
- D M Rao
- Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio 45056, USA
| | | | | | | | | | | |
Collapse
|
21
|
D'Agostino N, Traini A, Frusciante L, Chiusano ML. SolEST database: a "one-stop shop" approach to the study of Solanaceae transcriptomes. BMC PLANT BIOLOGY 2009; 9:142. [PMID: 19948013 PMCID: PMC2794286 DOI: 10.1186/1471-2229-9-142] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 11/30/2009] [Indexed: 05/21/2023]
Abstract
BACKGROUND Since no genome sequences of solanaceous plants have yet been completed, expressed sequence tag (EST) collections represent a reliable tool for broad sampling of Solanaceae transcriptomes, an attractive route for understanding Solanaceae genome functionality and a powerful reference for the structural annotation of emerging Solanaceae genome sequences. DESCRIPTION We describe the SolEST database http://biosrv.cab.unina.it/solestdb which integrates different EST datasets from both cultivated and wild Solanaceae species and from two species of the genus Coffea. Background as well as processed data contained in the database, extensively linked to external related resources, represent an invaluable source of information for these plant families. Two novel features differentiate SolEST from other resources: i) the option of accessing and then visualizing Solanaceae EST/TC alignments along the emerging tomato and potato genome sequences; ii) the opportunity to compare different Solanaceae assemblies generated by diverse research groups in the attempt to address a common complaint in the SOL community. CONCLUSION Different databases have been established worldwide for collecting Solanaceae ESTs and are related in concept, content and utility to the one presented herein. However, the SolEST database has several distinguishing features that make it appealing for the research community and facilitates a "one-stop shop" for the study of Solanaceae transcriptomes.
Collapse
Affiliation(s)
- Nunzio D'Agostino
- University of Naples 'Federico II', Dept of Soil, Plant, Environmental and Animal Production Sciences, Via Università 100, 80055 Portici, Italy
| | - Alessandra Traini
- University of Naples 'Federico II', Dept of Soil, Plant, Environmental and Animal Production Sciences, Via Università 100, 80055 Portici, Italy
| | - Luigi Frusciante
- University of Naples 'Federico II', Dept of Soil, Plant, Environmental and Animal Production Sciences, Via Università 100, 80055 Portici, Italy
| | - Maria Luisa Chiusano
- University of Naples 'Federico II', Dept of Soil, Plant, Environmental and Animal Production Sciences, Via Università 100, 80055 Portici, Italy
| |
Collapse
|
22
|
Bragg LM, Stone G. k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 2009; 25:2302-8. [PMID: 19570806 PMCID: PMC2735666 DOI: 10.1093/bioinformatics/btp410] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact:lauren.bragg@csiro.au Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lauren M Bragg
- CSIRO Mathematical and Information Sciences, North Ryde, NSW, Australia.
| | | |
Collapse
|
23
|
Picardi E, Mignone F, Pesole G. EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data. BMC Bioinformatics 2009; 10 Suppl 6:S10. [PMID: 19534735 PMCID: PMC2697633 DOI: 10.1186/1471-2105-10-s6-s10] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
Background ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping. Methods EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site. Results The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.
Collapse
Affiliation(s)
- Ernesto Picardi
- Dipartimento di Biochimica e Biologia Molecolare E, Quagliariello, Università degli Studi di Bari, 70126 Bari, Italy.
| | | | | |
Collapse
|
24
|
|