1
|
Lin HN, Hsu WL. DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics 2018; 34:190-197. [PMID: 28968831 PMCID: PMC5860201 DOI: 10.1093/bioinformatics/btx558] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 08/29/2017] [Accepted: 09/03/2017] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. RESULTS We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. AVAILABILITY AND IMPLEMENTATION https://github.com/hsinnan75/DART. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
2
|
Khelik K, Lagesen K, Sandve GK, Rognes T, Nederbragt AJ. NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences. BMC Bioinformatics 2017; 18:338. [PMID: 28701187 PMCID: PMC5508607 DOI: 10.1186/s12859-017-1748-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 07/04/2017] [Indexed: 12/05/2022] Open
Abstract
Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1748-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ksenia Khelik
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Karin Lagesen
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Norwegian Veterinary Institute, PO Box 750 Sentrum, 0106, Oslo, Norway
| | - Geir Kjetil Sandve
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Torbjørn Rognes
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Department of Microbiology, Oslo University Hospital, Rikshospitalet, PO Box 4950 Nydalen, 0424, Oslo, Norway
| | - Alexander Johan Nederbragt
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway. .,Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066 Blindern, 0316, Oslo, Norway.
| |
Collapse
|
3
|
Khiste N, Ilie L. E-MEM: efficient computation of maximal exact matches for very large genomes. Bioinformatics 2014; 31:509-14. [DOI: 10.1093/bioinformatics/btu687] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
|
4
|
Heuristic alignment methods. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2013; 1079:29-43. [PMID: 24170393 DOI: 10.1007/978-1-62703-646-7_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Computation of multiple sequence alignment (MSA) is usually formulated as a combinatory optimization problem of an objective function. Solving the problem for virtually all sensible objective functions is known to be NP-complete implying that some heuristics must be adopted. Several general strategies have been proven effective to obtain accurate MSAs in reasonable computational costs. This chapter is devoted to a brief summary of most successful heuristic approaches.
Collapse
|
5
|
Abstract
Motivation: The explosive growth of next-generation sequencing datasets poses a challenge to the mapping of reads to reference genomes in terms of alignment quality and execution speed. With the continuing progress of high-throughput sequencing technologies, read length is constantly increasing and many existing aligners are becoming inefficient as generated reads grow larger. Results: We present CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. Our aligner is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments. We have evaluated and compared CUSHAW2 to the three other long read aligners BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The performance evaluation shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment, while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads. Availability: CUSHAW2, written in C++, and all simulated datasets are available at http://cushaw2.sourceforge.net Contact:liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yongchao Liu
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz 55099, Germany.
| | | |
Collapse
|
6
|
Choi JH, Li Y, Guo J, Pei L, Rauch TA, Kramer RS, Macmil SL, Wiley GB, Bennett LB, Schnabel JL, Taylor KH, Kim S, Xu D, Sreekumar A, Pfeifer GP, Roe BA, Caldwell CW, Bhalla KN, Shi H. Genome-wide DNA methylation maps in follicular lymphoma cells determined by methylation-enriched bisulfite sequencing. PLoS One 2010; 5:e13020. [PMID: 20927367 PMCID: PMC2947499 DOI: 10.1371/journal.pone.0013020] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2010] [Accepted: 08/21/2010] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Follicular lymphoma (FL) is a form of non-Hodgkin's lymphoma (NHL) that arises from germinal center (GC) B-cells. Despite the significant advances in immunotherapy, FL is still not curable. Beyond transcriptional profiling and genomics datasets, there currently is no epigenome-scale dataset or integrative biology approach that can adequately model this disease and therefore identify novel mechanisms and targets for successful prevention and treatment of FL. METHODOLOGY/PRINCIPAL FINDINGS We performed methylation-enriched genome-wide bisulfite sequencing of FL cells and normal CD19(+) B-cells using 454 sequencing technology. The methylated DNA fragments were enriched with methyl-binding proteins, treated with bisulfite, and sequenced using the Roche-454 GS FLX sequencer. The total number of bases covered in the human genome was 18.2 and 49.3 million including 726,003 and 1.3 million CpGs in FL and CD19(+) B-cells, respectively. 11,971 and 7,882 methylated regions of interest (MRIs) were identified respectively. The genome-wide distribution of these MRIs displayed significant differences between FL and normal B-cells. A reverse trend in the distribution of MRIs between the promoter and the gene body was observed in FL and CD19(+) B-cells. The MRIs identified in FL cells also correlated well with transcriptomic data and ChIP-on-Chip analyses of genome-wide histone modifications such as tri-methyl-H3K27, and tri-methyl-H3K4, indicating a concerted epigenetic alteration in FL cells. CONCLUSIONS/SIGNIFICANCE This study is the first to provide a large scale and comprehensive analysis of the DNA methylation sequence composition and distribution in the FL epigenome. These integrated approaches have led to the discovery of novel and frequent targets of aberrant epigenetic alterations. The genome-wide bisulfite sequencing approach developed here can be a useful tool for profiling DNA methylation in clinical samples.
Collapse
Affiliation(s)
- Jeong-Hyeon Choi
- Center of Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, United States of America
| | - Yajun Li
- Medical College of Georgia Cancer Center, Medical College of Georgia, Augusta, Georgia, United States of America
| | - Juyuan Guo
- Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Lirong Pei
- Medical College of Georgia Cancer Center, Medical College of Georgia, Augusta, Georgia, United States of America
| | - Tibor A. Rauch
- Division of Biology, City of Hope Beckman Research Institute, Duarte, California, United States of America
| | - Robin S. Kramer
- Department of Computer Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Simone L. Macmil
- Advanced Center for Genome Technology, University of Oklahoma, Norman, Oklahoma, United States of America
| | - Graham B. Wiley
- Advanced Center for Genome Technology, University of Oklahoma, Norman, Oklahoma, United States of America
| | - Lynda B. Bennett
- Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Jennifer L. Schnabel
- Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Kristen H. Taylor
- Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Sun Kim
- Center of Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, United States of America
| | - Dong Xu
- Division of Biology, City of Hope Beckman Research Institute, Duarte, California, United States of America
| | - Arun Sreekumar
- Medical College of Georgia Cancer Center, Medical College of Georgia, Augusta, Georgia, United States of America
| | - Gerd P. Pfeifer
- Department of Computer Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Bruce A. Roe
- Advanced Center for Genome Technology, University of Oklahoma, Norman, Oklahoma, United States of America
| | - Charles W. Caldwell
- Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Kapil N. Bhalla
- Medical College of Georgia Cancer Center, Medical College of Georgia, Augusta, Georgia, United States of America
| | - Huidong Shi
- Medical College of Georgia Cancer Center, Medical College of Georgia, Augusta, Georgia, United States of America
| |
Collapse
|
7
|
Khan Z, Bloom JS, Kruglyak L, Singh M. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. ACTA ACUST UNITED AC 2009; 25:1609-16. [PMID: 19389736 DOI: 10.1093/bioinformatics/btp275] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION High-throughput sequencing technologies place ever increasing demands on existing algorithms for sequence analysis. Algorithms for computing maximal exact matches (MEMs) between sequences appear in two contexts where high-throughput sequencing will vastly increase the volume of sequence data: (i) seeding alignments of high-throughput reads for genome assembly and (ii) designating anchor points for genome-genome comparisons. RESULTS We introduce a new algorithm for finding MEMs. The algorithm leverages a sparse suffix array (SA), a text index that stores every K-th position of the text. In contrast to a full text index that stores every position of the text, a sparse SA occupies much less memory. Even though we use a sparse index, the output of our algorithm is the same as a full text index algorithm as long as the space between the indexed suffixes is not greater than a minimum length of a MEM. By relying on partial matches and additional text scanning between indexed positions, the algorithm trades memory for extra computation. The reduced memory usage makes it possible to determine MEMs between significantly longer sequences. AVAILABILITY Source code for the algorithm is available under a BSD open source license at http://compbio.cs.princeton.edu/mems. The implementation can serve as a drop-in replacement for the MEMs algorithm in MUMmer 3.
Collapse
Affiliation(s)
- Zia Khan
- Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.
| | | | | | | |
Collapse
|
8
|
Rho M, Choi JH, Kim S, Lynch M, Tang H. De novo identification of LTR retrotransposons in eukaryotic genomes. BMC Genomics 2007; 8:90. [PMID: 17407597 PMCID: PMC1858694 DOI: 10.1186/1471-2164-8-90] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2007] [Accepted: 04/03/2007] [Indexed: 12/03/2022] Open
Abstract
Background LTR retrotransposons are a class of mobile genetic elements containing two similar long terminal repeats (LTRs). Currently, LTR retrotransposons are annotated in eukaryotic genomes mainly through the conventional homology searching approach. Hence, it is limited to annotating known elements. Results In this paper, we report a de novo computational method that can identify new LTR retrotransposons without relying on a library of known elements. Specifically, our method identifies intact LTR retrotransposons by using an approximate string matching technique and protein domain analysis. In addition, it identifies partially deleted or solo LTRs using profile Hidden Markov Models (pHMMs). As a result, this method can de novo identify all types of LTR retrotransposons. We tested this method on the two pairs of eukaryotic genomes, C. elegans vs. C. briggsae and D. melanogaster vs. D. pseudoobscura. LTR retrotransposons in C. elegans and D. melanogaster have been intensively studied using conventional annotation methods. Comparing with previous work, we identified new intact LTR retroelements and new putative families, which may imply that there may still be new retroelements that are left to be discovered even in well-studied organisms. To assess the sensitivity and accuracy of our method, we compared our results with a previously published method, LTR_STRUC, which predominantly identifies full-length LTR retrotransposons. In summary, both methods identified comparable number of intact LTR retroelements. But our method can identify nearly all known elements in C. elegans, while LTR_STRUCT missed about 1/3 of them. Our method also identified more known LTR retroelements than LTR_STRUCT in the D. melanogaster genome. We also identified some LTR retroelements in the other two genomes, C. briggsae and D. pseudoobscura, which have not been completely finished. In contrast, the conventional method failed to identify those elements. Finally, the phylogenetic and chromosomal distributions of the identified elements are discussed. Conclusion We report a novel method for de novo identification of LTR retrotransposons in eukaryotic genomes with favorable performance over the existing methods.
Collapse
Affiliation(s)
- Mina Rho
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Jeong-Hyeon Choi
- Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47405, USA
| | - Sun Kim
- Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47405, USA
- School of Informatics, Indiana University, Bloomington, IN 47408, USA
| | - Michael Lynch
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Haixu Tang
- Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47405, USA
- School of Informatics, Indiana University, Bloomington, IN 47408, USA
| |
Collapse
|
9
|
Uchiyama I, Higuchi T, Kobayashi I. CGAT: a comparative genome analysis tool for visualizing alignments in the analysis of complex evolutionary changes between closely related genomes. BMC Bioinformatics 2006; 7:472. [PMID: 17062155 PMCID: PMC1643837 DOI: 10.1186/1471-2105-7-472] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2006] [Accepted: 10/24/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The recent accumulation of closely related genomic sequences provides a valuable resource for the elucidation of the evolutionary histories of various organisms. However, although numerous alignment calculation and visualization tools have been developed to date, the analysis of complex genomic changes, such as large insertions, deletions, inversions, translocations and duplications, still presents certain difficulties. RESULTS We have developed a comparative genome analysis tool, named CGAT, which allows detailed comparisons of closely related bacteria-sized genomes mainly through visualizing middle-to-large-scale changes to infer underlying mechanisms. CGAT displays precomputed pairwise genome alignments on both dotplot and alignment viewers with scrolling and zooming functions, and allows users to move along the pre-identified orthologous alignments. Users can place several types of information on this alignment, such as the presence of tandem repeats or interspersed repetitive sequences and changes in G+C contents or codon usage bias, thereby facilitating the interpretation of the observed genomic changes. In addition to displaying precomputed alignments, the viewer can dynamically calculate the alignments between specified regions; this feature is especially useful for examining the alignment boundaries, as these boundaries are often obscure and can vary between programs. Besides the alignment browser functionalities, CGAT also contains an alignment data construction module, which contains various procedures that are commonly used for pre- and post-processing for large-scale alignment calculation, such as the split-and-merge protocol for calculating long alignments, chaining adjacent alignments, and ortholog identification. Indeed, CGAT provides a general framework for the calculation of genome-scale alignments using various existing programs as alignment engines, which allows users to compare the outputs of different alignment programs. Earlier versions of this program have been used successfully in our research to infer the evolutionary history of apparently complex genome changes between closely related eubacteria and archaea. CONCLUSION CGAT is a practical tool for analyzing complex genomic changes between closely related genomes using existing alignment programs and other sequence analysis tools combined with extensive manual inspection.
Collapse
Affiliation(s)
- Ikuo Uchiyama
- National Institute for Basic Biology, National Institutes of Natural Sciences, Nishigonaka 38, Myodaiji, Okazaki, Aichi 444-8585, Japan
| | - Toshio Higuchi
- INTEC Web and Genome Informatics Corporation, 1-3-3 Shinsuna, Koto-ku, Tokyo 136-0075, Japan
| | - Ichizo Kobayashi
- Department of Medical Genome Sciences, Graduate School of Frontier Science & Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
- Graduate Program of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
| |
Collapse
|