1
|
Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020; 15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open
Abstract
The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.
Collapse
Affiliation(s)
- Kelvin V. Kredens
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Juliano V. Martins
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Osmar B. Dordal
- Polytechnic School, Centro Universitário UniDomBosco, Curitiba, Paraná, Brazil
| | - Mauri Ferrandin
- Department of Control, Automation and Computing Engineering, Universidade Federal de Santa Catarina (UFSC), Blumenau, Brazil
| | - Roberto H. Herai
- Graduate Program in Health Sciences, School of Medicine, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, Paraná, Brazil
| | - Edson E. Scalabrin
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Bráulio C. Ávila
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| |
Collapse
|
2
|
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data. BIOMED RESEARCH INTERNATIONAL 2020; 2019:3108950. [PMID: 31915686 PMCID: PMC6930768 DOI: 10.1155/2019/3108950] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Revised: 09/14/2019] [Accepted: 10/22/2019] [Indexed: 12/22/2022]
Abstract
With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.
Collapse
|
3
|
Bianchi L, Liò P. Opportunities for community awareness platforms in personal genomics and bioinformatics education. Brief Bioinform 2018; 18:1082-1090. [PMID: 27580620 DOI: 10.1093/bib/bbw078] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Indexed: 01/16/2023] Open
Abstract
Precision and personalized medicine will be increasingly based on the integration of various type of information, particularly electronic health records and genome sequences. The availability of cheap genome sequencing services and the information interoperability will increase the role of online bioinformatics analysis. Being on the Internet poses constant threats to security and privacy. While we are connected and we share information, websites and internet services collect various types of personal data with or without the user consent. It is likely that genomics will merge with the internet culture of connectivity. This process will increase incidental findings, exposure and vulnerability. Here we discuss the social vulnerability owing to the genome and Internet combined security and privacy weaknesses. This urges more efforts in education and social awareness on how biomedical data are analysed and transferred through the internet and how inferential methods could integrate information from different sources. We propose that digital social platforms, used for raising collective awareness in different fields, could be developed for collaborative and bottom-up efforts in education. In this context, bioinformaticians could play a meaningful role in mitigating the future risk of digital-genomic divide.
Collapse
|
4
|
Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol 2017; 261:157-168. [PMID: 28888961 DOI: 10.1016/j.jbiotec.2017.07.017] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2017] [Revised: 07/17/2017] [Accepted: 07/19/2017] [Indexed: 11/27/2022]
Abstract
BACKGROUND The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (Döring et al., 2008). RESULTS The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools. CONCLUSIONS We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.
Collapse
Affiliation(s)
- Knut Reinert
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany.
| | - Temesgen Hailemariam Dadi
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Marcel Ehrhardt
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Hannes Hauswedell
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - René Rahn
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Jongkyu Kim
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Christopher Pockrandt
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Jörg Winkler
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | | | - Gianvito Urgese
- Department of Control and Computer Engineering, Politecnico di Torino, Italy
| | | |
Collapse
|