1
|
Li R, He X, Dai C, Zhu H, Lang X, Chen W, Li X, Zhao D, Zhang Y, Han X, Niu T, Zhao Y, Cao R, He R, Lu Z, Chi X, Li W, Niu B. Gclust: A Parallel Clustering Tool for Microbial Genomic Data. Genomics Proteomics Bioinformatics 2020; 17:496-502. [PMID: 31917259 PMCID: PMC7056916 DOI: 10.1016/j.gpb.2018.10.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Revised: 05/29/2018] [Accepted: 10/23/2018] [Indexed: 11/12/2022]
Abstract
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
Collapse
Affiliation(s)
- Ruilin Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaoyu He
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Chuangchuang Dai
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Haidong Zhu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xianyu Lang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Wei Chen
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaodong Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Dan Zhao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Yu Zhang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xinyin Han
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Tie Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Yi Zhao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Rongqiang Cao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Rong He
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Zhonghua Lu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Xuebin Chi
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China; Center of Scientific Computing Applications & Research, Chinese Academy of Sciences, Beijing 100190, China
| | - Weizhong Li
- J. Craig Venter Institute, La Jolla, CA 92037, USA.
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China; Guizhou University School of Medicine, Guiyang 550025, China.
| |
Collapse
|
2
|
Bragina EY, Tiys ES, Freidin MB, Koneva LA, Demenkov PS, Ivanisenko VA, Kolchanov NA, Puzyrev VP. Insights into pathophysiology of dystropy through the analysis of gene networks: an example of bronchial asthma and tuberculosis. Immunogenetics 2014; 66:457-65. [PMID: 24954693 DOI: 10.1007/s00251-014-0786-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Accepted: 06/12/2014] [Indexed: 01/18/2023]
Abstract
Co-existence of bronchial asthma (BA) and tuberculosis (TB) is extremely uncommon (dystropic). We assume that this is caused by the interplay between genes involved into specific pathophysiological pathways that arrest simultaneous manifestation of BA and TB. Identification of common and specific genes may be important to determine the molecular genetic mechanisms leading to rare co-occurrence of these diseases and may contribute to the identification of susceptibility genes for each of these dystropic diseases. To address the issue, we propose a new methodological strategy that is based on reconstruction of associative networks that represent molecular relationships between proteins/genes associated with BA and TB, thus facilitating a better understanding of the biological context of antagonistic relationships between the diseases. The results of our study revealed a number of proteins/genes important for the development of both BA and TB.
Collapse
Affiliation(s)
- Elena Yu Bragina
- Laboratory of Population Genetics, Research Institute of Medical Genetics, Siberian Branch of Russian Academy of Medical Sciences, Nabereznaya Ushaiki str. 10, Tomsk, Russian Federation, 634050,
| | | | | | | | | | | | | | | |
Collapse
|
3
|
Herrera-Galeano JE, Hirschberg DL, Mokashi V, Solka J. OGA: an ontological tool of human phenotypes with genetic associations. BMC Res Notes 2013; 6:511. [PMID: 24308566 PMCID: PMC4234991 DOI: 10.1186/1756-0500-6-511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 11/28/2013] [Indexed: 11/29/2022] Open
Abstract
Background The availability of genetic data has increased dramatically in recent years. The greatest value of this data is its potential for personalized medicine. Many new associations are reported every day from Genome Wide Association Studies (GWAS). However, robust, reproducible associations are elusive for some complex diseases. Ontologies present a potential way to distinguish between spurious associations and those with a potential influence on the phenotype. Such an approach would be based on finding associations of the same genetic variant with closely related, but distinct, phenotypes. This approach can be accomplished with a phenotype ontology that also holds genetic association data. Results Here, we report a structured knowledge application to navigate and to facilitate the discovery of relationships between different phenotypes and their genetic associations. Conclusions OGA allows users to (1) find the intersecting set of genes for phenotypes of interest, (2) find empirical p values for such observations and (3) OGA outperforms similar applications in number of total concepts and genes mapped.
Collapse
|
4
|
Ng KH, Ho CK, Phon-Amnuaisuk S. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family. PLoS One 2012; 7:e47216. [PMID: 23071763 PMCID: PMC3469558 DOI: 10.1371/journal.pone.0047216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 09/10/2012] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Clustering is a key step in the processing of Expressed Sequence Tags (ESTs). The primary goal of clustering is to put ESTs from the same transcript of a single gene into a unique cluster. Recent EST clustering algorithms mostly adopt the alignment-free distance measures, where they tend to yield acceptable clustering accuracies with reasonable computational time. Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes. METHODOLOGY/PRINCIPAL FINDINGS We propose a hybrid distance measure that combines the global and local features extracted from ESTs, with the aim to address the clustering problem faced by ESTs derived from the same gene family. The clustering process is implemented using the DBSCAN algorithm. We test the hybrid distance measure on the ten EST datasets, and the clustering results are compared with the two alignment-free EST clustering tools, i.e. wcd and PEACE. The clustering results indicate that the proposed hybrid distance measure performs relatively better (in terms of clustering accuracy) than both EST clustering tools. CONCLUSIONS/SIGNIFICANCE The clustering results provide support for the effectiveness of the proposed hybrid distance measure in solving the clustering problem for ESTs that originate from the same gene family. The improvement of clustering accuracies on the experimental datasets has supported the claim that the sensitivity of the hybrid distance measure is sufficient to solve the clustering problem.
Collapse
Affiliation(s)
- Keng-Hoong Ng
- Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Malaysia.
| | | | | |
Collapse
|
5
|
Abstract
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.
Collapse
Affiliation(s)
- Weizhong Li
- Center for Research in Biological Systems, University of California San Diego, USA.
| | | | | | | | | |
Collapse
|
6
|
Abstract
MOTIVATION Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets. RESULTS We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time. AVAILABILITY Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X. CONTACT scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Scott Hazelhurst
- Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.
| | | |
Collapse
|
7
|
Rashidi P, Cook DJ, Holder LB, Schmitter-Edgecombe M. Discovering Activities to Recognize and Track in a Smart Environment. IEEE Trans Knowl Data Eng 2011; 23:527-539. [PMID: 21617742 PMCID: PMC3100559 DOI: 10.1109/tkde.2010.148] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. Although approaches do exist for recognizing activities, the approaches are applied to activities that have been pre-selected and for which labeled training data is available. In contrast, we introduce an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine. With this capability we can then track the occurrence of regular activities to monitor functional health and to detect changes in an individual's patterns and lifestyle. In this paper we describe our activity mining and tracking approach and validate our algorithms on data collected in physical smart environments.
Collapse
Affiliation(s)
- Parisa Rashidi
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99163
| | - Diane J. Cook
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99163
| | - Lawrence B. Holder
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99163
| | | |
Collapse
|
8
|
Abstract
We present PEACE, a stand-alone tool for high-throughput ab initio clustering of transcript fragment sequences produced by Next Generation or Sanger Sequencing technologies. It is freely available from www.peace-tools.org. Installed and managed through a downloadable user-friendly graphical user interface (GUI), PEACE can process large data sets of transcript fragments of length 50 bases or greater, grouping the fragments by gene associations with a sensitivity comparable to leading clustering tools. Once clustered, the user can employ the GUI's analysis functions, facilitating the easy collection of statistics and allowing them to single out specific clusters for more comprehensive study or assembly. Using a novel minimum spanning tree-based clustering method, PEACE is the equal of leading tools in the literature, with an interface making it accessible to any user. It produces results of quality virtually identical to those of the WCD tool when applied to Sanger sequences, significantly improved results over WCD and TGICL when applied to the products of Next Generation Sequencing Technology and significantly improved results over Cap3 in both cases. In short, PEACE provides an intuitive GUI and a feature-rich, parallel clustering engine that proves to be a valuable addition to the leading cDNA clustering tools.
Collapse
Affiliation(s)
- D M Rao
- Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio 45056, USA
| | | | | | | | | | | |
Collapse
|
9
|
Patel S, Malde K, Lanzén A, Olsen RH, Nerland AH. Identification of immune related genes in Atlantic halibut (Hippoglossus hippoglossus L.) following in vivo antigenic and in vitro mitogenic stimulation. Fish Shellfish Immunol 2009; 27:729-738. [PMID: 19751833 DOI: 10.1016/j.fsi.2009.09.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2009] [Revised: 09/03/2009] [Accepted: 09/03/2009] [Indexed: 05/28/2023]
Abstract
To identify and characterize genes and proteins of the Atlantic halibut (Hippoglossus hippoglossus) immune system, six cDNA libraries were constructed from liver, kidney, spleen, peripheral blood, and thymus. Halibut were injected with nodavirus, infectious pancreatic necrosis virus (IPNV), or vibriosis vaccine and tissue samples were collected at various time points. Leukocytes from peripheral blood and spleen from stimulated and mock-injected fish were isolated and further in vitro activated with the mitogens, concanavalin A (Con A) and phorbol myristate acetate (PMA) to facilitate activation and proliferation. A total of 5117 high quality expressed sequence tags (ESTs) were identified and assembled into 781 contigs and 2796 singletons. Amongst these ESTs, 147 different putative immune related genes were identified. Several genes involved in innate and adaptive immune responses such as complement proteins, immunoglobulins, cell surface receptors, and cytokines and chemokines were identified. Of the immune related genes identified in this study, 44% had no match against any of the publicly available sequence data for halibut and thus can be considered as novel identification in halibut species. The approach of combining in vivo antigenic with in vitro mitogen stimulation, in addition to preparation of cDNA libraries from thymus enabled identification of many of the interesting genes including those involved in T-cell receptor complex.
Collapse
Affiliation(s)
- Sonal Patel
- Institute of Marine Research (IMR), Bergen, Norway.
| | | | | | | | | |
Collapse
|
10
|
van Hooff SR, Koster J, Hulsen T, van Schaik BDC, Roos M, van Batenburg MF, Versteeg R, van Kampen AHC. The construction of genome-based transcriptional units. OMICS 2009; 13:105-14. [PMID: 19320556 DOI: 10.1089/omi.2008.0036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Gene-oriented sequence clusters (transcriptional units) have found many applications in genomics research including the construction of transcriptome maps and identification of splice variants. We developed a new method to construct transcriptional that uses the genomic sequence as a template. We present and discuss our method in detail together with an evaluation of the transcriptional units for human. We constructed 33,007 and 27,792 transcriptional units for human and mouse, respectively. The sensitivity (81%) and specificity (90%) of our method compares favorably to other established methods. We evaluated the representation of experimentally validated and predicted intergenic spliced transcripts in humans and show that we correctly represent a large fraction of these cases by single transcriptional units. Our method performs well, but the evaluation of the final set of transcriptional units show that improvements to the algorithm are still possible. However, because the precise number and types of errors are difficult to track, it is not obvious how to significantly improve the algorithm. We believe that ongoing research efforts are necessary to further improve current methods. This should include detailed documentation, comparison, and evaluation of current methods.
Collapse
Affiliation(s)
- Sander R van Hooff
- Bioinformatics Laboratory, Academic Medical Center, Meibergdreef 9, Amsterdam, The Netherlands
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Abstract
Summary: The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d2 distance function or edit distance, improving existing implementations of d2. It supports merging, refinement and reclustering of clusters. It is ‘drop in’ compatible with the StackPack clustering package. wcd supports parallelization under both shared memory and cluster architectures. It is distributed with an EMBOSS wrapper allowing wcd to be installed as part of an EMBOSS installation (and so provided by a web server). Availability: wcd is distributed under a GPL licence and is available from http://code.google.com/p/wcdest Contact:scott.hazelhurst@wits.ac.za Supplementary information: Additional experimental results. The wcd manual, a companion paper describing underlying algorithms, and all datasets used for experimentation can also be found at www.bioinf.wits.ac.za/~scott/wcdsupp.html
Collapse
Affiliation(s)
- Scott Hazelhurst
- Wits Bioinformatics, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.
| | | | | | | | | |
Collapse
|
12
|
Poddar A, Chandra N, Ganapathiraju M, Sekar K, Klein-Seetharaman J, Reddy R, Balakrishnan N. Evolutionary insights from suffix array-based genome sequence analysis. J Biosci 2007; 32:871-81. [PMID: 17914229 DOI: 10.1007/s12038-007-0087-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG,coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.
Collapse
Affiliation(s)
- Anindya Poddar
- Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560 012, India
| | | | | | | | | | | | | |
Collapse
|
13
|
Abstract
We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
Collapse
Affiliation(s)
- Michael Cameron
- School of Computer Science and Information Technology, RMIT University, Melbourne, Australia.
| | | | | |
Collapse
|
14
|
Abstract
MOTIVATION Repeat sequences in ESTs are a source of problems, in particular for clustering. ESTs are therefore commonly masked against a library of known repeats. High quality repeat libraries are available for the widely studied organisms, but for most other organisms the lack of such libraries is likely to compromise the quality of EST analysis. RESULTS We present a fast, flexible and library-less method for masking repeats in EST sequences, based on match statistics within the EST collection. The method is not linked to a particular clustering algorithm. Extensive testing on datasets using different clustering methods and a genomic mapping as reference shows that this method gives results that are better than or as good as those obtained using RepeatMasker with a repeat library. AVAILABILITY The implementation of RBR is available under the terms of the GPL from http://www.ii.uib.no/~ketil/bioinformatics CONTACT ketil.malde@bccs.uib.no SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ketil Malde
- Computational Biology Unit, Bergen Centre for Computational Sciences, University of Bergen, Norway.
| | | | | | | |
Collapse
|
15
|
Abstract
Background The continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Clustered EST data, accumulated in databases such as UniGene, STACK and TIGR Gene Indices have proven to be crucial in research areas from gene discovery to regulation of gene expression. Results We have developed a new nucleotide sequence matching algorithm and its implementation for clustering EST sequences. The program is based on the original CLU match detection algorithm, which has improved performance over the widely used d2_cluster. The CLU algorithm automatically ignores low-complexity regions like poly-tracts and short tandem repeats. Conclusion CLU represents a new generation of EST clustering algorithm with improved performance over current approaches. An early implementation can be applied in small and medium-size projects. The CLU program is available on an open source basis free of charge. It can be downloaded from
Collapse
Affiliation(s)
- Andrey Ptitsyn
- Pennington Biomedical Research Center, 6400 Perkins Rd. Baton Rouge LA 70808
| | - Winston Hide
- South African National Bioinformatics Institute, P/b X17 UWC SANBI Bellville 7535
| |
Collapse
|
16
|
Schneeberger K, Malde K, Coward E, Jonassen I. Masking repeats while clustering ESTs. Nucleic Acids Res 2005; 33:2176-80. [PMID: 15831790 PMCID: PMC1079970 DOI: 10.1093/nar/gki511] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2004] [Revised: 03/10/2005] [Accepted: 03/28/2005] [Indexed: 11/15/2022] Open
Abstract
A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering.
Collapse
Affiliation(s)
| | - Ketil Malde
- Department of Informatics, University of BergenBergen, Norway
| | - Eivind Coward
- Department of Informatics, University of BergenBergen, Norway
| | - Inge Jonassen
- Computational Biology Unit, University of BergenBergen, Norway
- Department of Informatics, University of BergenBergen, Norway
| |
Collapse
|
17
|
Abstract
MOTIVATION EST sequences constitute an abundant, yet error prone resource for computational biology. Expressed sequences are important in gene discovery and identification, and they are also crucial for the discovery and classification of alternative splicing. An important challenge when processing EST sequences is the reconstruction of mRNA by assembling EST clusters into consensus sequences. RESULTS In contrast to the more established assembly tools, we propose an algorithm that constructs a graph over sequence fragments of fixed size, and produces consensus sequences as traversals of this graph. We provide a tool implementing this algorithm, and perform an experiment where the consensus sequences produced by our implementation, as well as by currently available tools, are compared to mRNA. The results show that our proposed algorithm in a majority of the cases produces consensus of higher quality than the established sequence assemblers and at a competitive speed. AVAILABILITY The source code for the implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bioinformatics/ CONTACT ketil@ii.uib.no.
Collapse
Affiliation(s)
- Ketil Malde
- Department of Informatics, University of Bergen, Norway.
| | | | | |
Collapse
|
18
|
Flikka K, Yadetie F, Laegreid A, Jonassen I. XHM: a system for detection of potential cross hybridizations in DNA microarrays. BMC Bioinformatics 2004; 5:117. [PMID: 15333145 PMCID: PMC517492 DOI: 10.1186/1471-2105-5-117] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2004] [Accepted: 08/27/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Microarrays have emerged as the preferred platform for high throughput gene expression analysis. Cross-hybridization among genes with high sequence similarities can be a source of error reducing the reliability of DNA microarray results. RESULTS We have developed a tool called XHM (cross hybridization on microarrays) for assessment of the reliability of hybridization signals by detecting potential cross-hybridizations on DNA microarrays. This is done by comparing the sequences of the probes against an extensive database representing the transcriptome of the organism in question. XHM is available online at http://www.bioinfo.no/tools/xhm/. CONCLUSIONS Using XHM with its user-adjustable parameters will enable scientists to check their lists of differentially expressed genes from microarray experiments for potential cross-hybridizations. This provides information that may be useful in the validation of the microarray results.
Collapse
Affiliation(s)
- Kristian Flikka
- Computational Biology Unit, Bergen Center for Computational Science, UNIFOB/UiB, Thormoehlensgt.55, N-5008 Bergen, Norway
| | - Fekadu Yadetie
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, NO-7489 Trondheim, Norway
- Sars International Centre for Marine Molecular Biology, Bergen High Technology Centre, Thormoehlensgt. 55, N-5008 Bergen, Norway
| | - Astrid Laegreid
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, NO-7489 Trondheim, Norway
| | - Inge Jonassen
- Computational Biology Unit, Bergen Center for Computational Science, UNIFOB/UiB, Thormoehlensgt.55, N-5008 Bergen, Norway
- Department of Informatics, University of Bergen, PB. 7800, N-5020 Bergen, Norway
| |
Collapse
|