Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015;31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open

For:	Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015;31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open

Number

Cited by Other Article(s)

Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023;24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open

Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022;21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open

Abstract

Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.

Collapse

Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022;38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open

Huo H, Chen X, Guo X, Vitter JS. Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:2394-2408. [PMID: 31985436 DOI: 10.1109/tcbb.2020.2968323] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Liu Y, Li J. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. PLoS Comput Biol 2021;17:e1009229. [PMID: 34280186 PMCID: PMC8321399 DOI: 10.1371/journal.pcbi.1009229] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 07/29/2021] [Accepted: 06/30/2021] [Indexed: 11/21/2022] Open

Pal S, Mondal S, Das G, Khatua S, Ghosh Z. Big data in biology: The hope and present-day challenges in it. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100869] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Kowalski TM, Grabowski S. PgRC: pseudogenome-based read compressor. Bioinformatics 2020;36:2082-2089. [PMID: 31893286 DOI: 10.1093/bioinformatics/btz919] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/28/2019] [Accepted: 12/05/2019] [Indexed: 01/26/2023] Open

Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020;35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open

Abstract

MOTIVATION

Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs.

RESULTS

We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yuansliu/minicom.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Hernaez M, Pavlichin D, Weissman T, Ochoa I. Genomic Data Compression. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021229] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019;13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open

Wang R, Li J, Bai Y, Zang T, Wang Y. BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs. PeerJ 2018;6:e5611. [PMID: 30364599 PMCID: PMC6197042 DOI: 10.7717/peerj.5611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 09/13/2018] [Indexed: 02/01/2023] Open

Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018;34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open

Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. J Comput Biol 2018;25:825-836. [PMID: 30011247 DOI: 10.1089/cmb.2018.0068] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Bioinformatics and Translation Elongation. BIOINFORMATICS AND THE CELL 2018:197-238. [PMCID: PMC7121122 DOI: 10.1007/978-3-319-90684-3_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2023]

Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 2018;9:566. [PMID: 29422526 PMCID: PMC5805770 DOI: 10.1038/s41467-017-02480-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Accepted: 12/05/2017] [Indexed: 12/21/2022] Open

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data. G3-GENES GENOMES GENETICS 2017;7:3839-3848. [PMID: 29079682 PMCID: PMC5714481 DOI: 10.1534/g3.117.300271] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Sarkar H, Patro R. Quark enables semi-reference-based compression of RNA-seq data. Bioinformatics 2017;33:3380-3386. [DOI: 10.1093/bioinformatics/btx428] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Accepted: 06/29/2017] [Indexed: 12/19/2022] Open

Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]

Comparison of high-throughput sequencing data compression tools. Nat Methods 2016;13:1005-1008. [PMID: 27776113 DOI: 10.1038/nmeth.4037] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]

Lelieveld SH, Veltman JA, Gilissen C. Novel bioinformatic developments for exome sequencing. Hum Genet 2016;135:603-14. [PMID: 27075447 PMCID: PMC4883269 DOI: 10.1007/s00439-016-1658-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 03/15/2016] [Indexed: 01/19/2023]

Lelieveld SH, Veltman JA, Gilissen C. Novel bioinformatic developments for exome sequencing. Hum Genet 2016. [PMID: 27075447 DOI: 10.1007/s00439‐016‐1658‐6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]

Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 2015;16:288. [PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open

Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 2015;31:2770-7. [PMID: 25910696 PMCID: PMC4547610 DOI: 10.1093/bioinformatics/btv248] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 04/11/2015] [Accepted: 04/20/2015] [Indexed: 01/24/2023] Open