1
|
Jung Y, Han D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 2022; 38:2404-2413. [PMID: 35253835 DOI: 10.1093/bioinformatics/btac137] [Citation(s) in RCA: 100] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 12/30/2021] [Accepted: 03/03/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses. RESULTS This paper presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2. AVAILABILITY The source code and test scripts are available for academic use at https://github.com/kaist-ina/BWA-MEME/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Youngmok Jung
- Department of electrical engineering, KAIST, Daejeon, 34141, REP. OF KOREA
| | - Dongsu Han
- Department of electrical engineering, KAIST, Daejeon, 34141, REP. OF KOREA
| |
Collapse
|
2
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
3
|
Noninvasive Prenatal Testing: Comparison of Two Mappers and Influence in the Diagnostic Yield. BIOMED RESEARCH INTERNATIONAL 2018; 2018:9498140. [PMID: 29977923 PMCID: PMC6011118 DOI: 10.1155/2018/9498140] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 04/16/2018] [Accepted: 05/07/2018] [Indexed: 11/18/2022]
Abstract
Objective The aim of this study was to determine if the use of different mappers for NIPT may vary the results considerably. Methods Peripheral blood was collected from 217 pregnant women, 58 pathological (34 pregnancies with trisomy 21, 18 with trisomy 18, and 6 with trisomy 13) and 159 euploid. MPS was performed following a manufacturer's modified protocol of semiconductor sequencing. Obtained reads were mapped with two different software programs: TMAP and HPG-Aligner, comparing the results. Results Using TMAP, 57 pathological samples were correctly detected (sensitivity 98.28%, specificity 93.08%): 33 samples as trisomy 21 (sensitivity 97.06%, specificity 99.45%), 16 as trisomy 18 (sensibility 88.89%, specificity 93.97%), and 6 as trisomy 13 (sensibility 100%, specificity 100%). 11 false positives, 1 false negative, and 2 samples incorrectly identified were obtained. Using HPG-Aligner, all the 58 pathological samples were correctly identified (sensibility 100%, specificity 96.86%): 34 as trisomy 21 (sensibility 100%, specificity 98.91%), 18 as trisomy 18 (sensibility 100%, specificity 98.99%), and 6 as trisomy 13 (sensibility 100%, specificity 99.53%). 5 false positives were obtained. Conclusion Different mappers use slightly different algorithms, so the use of one mapper or another with the same batch file can provide different results.
Collapse
|
4
|
Lin HN, Hsu WL. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 2018; 33:2281-2287. [PMID: 28379292 PMCID: PMC5860120 DOI: 10.1093/bioinformatics/btx189] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 04/05/2017] [Indexed: 02/02/2023] Open
Abstract
Motivation Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. Results We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. Availability and Implementation Kart is available at https://github.com/hsinnan75/Kart/ . Contact hsu@iis.sinica.edu.tw. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
5
|
Juanes JM, Gallego A, Tárraga J, Chaves FJ, Marín-Garcia P, Medina I, Arnau V, Dopazo J. VISMapper: ultra-fast exhaustive cartography of viral insertion sites for gene therapy. BMC Bioinformatics 2017; 18:421. [PMID: 28931371 PMCID: PMC5607581 DOI: 10.1186/s12859-017-1837-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Accepted: 09/12/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The possibility of integrating viral vectors to become a persistent part of the host genome makes them a crucial element of clinical gene therapy. However, viral integration has associated risks, such as the unintentional activation of oncogenes that can result in cancer. Therefore, the analysis of integration sites of retroviral vectors is a crucial step in developing safer vectors for therapeutic use. RESULTS Here we present VISMapper, a vector integration site analysis web server, to analyze next-generation sequencing data for retroviral vector integration sites. VISMapper can be found at: http://vismapper.babelomics.org . CONCLUSIONS Because it uses novel mapping algorithms VISMapper is remarkably faster than previous available programs. It also provides a useful graphical interface to analyze the integration sites found in the genomic context.
Collapse
Affiliation(s)
- José M Juanes
- Departamento de Informática, Escuela Técnica Superior de Ingeniería (ETSE), Universidad de Valencia, 46100, Valencia, Burjassot, Spain.,Computational Genomics Department, Prince Felipe Research Center, 46012, Valencia, Spain
| | - Asunción Gallego
- Clinical Bioinformatics Research Area, Fundación Progreso y Salud, Hospital Virgen del Rocío, 41013, Sevilla, Spain.,Bioinformatics in Rare Diseases (BiER), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Hospital Virgen del Rocío, 41013, Sevilla, Spain
| | - Joaquín Tárraga
- Computational Genomics Department, Prince Felipe Research Center, 46012, Valencia, Spain.,HPC Service, University Information Services, University of Cambridge, Cambridge, UK
| | - Felipe J Chaves
- Genotyping and Genetic Diagnosis Unit, Health Research Institute, INCLIVA, Valencia, Spain.,CIBERDem, Health Institute Carlos III, Madrid, Spain
| | - Pablo Marín-Garcia
- Genotyping and Genetic Diagnosis Unit, Health Research Institute, INCLIVA, Valencia, Spain.,Institute for Integrative Systems Biology (I2SysBio), Universidad de Valencia-CSIC, 46980, Valencia, Paterna, Spain
| | - Ignacio Medina
- HPC Service, University Information Services, University of Cambridge, Cambridge, UK
| | - Vicente Arnau
- Departamento de Informática, Escuela Técnica Superior de Ingeniería (ETSE), Universidad de Valencia, 46100, Valencia, Burjassot, Spain.,Computational Genomics Department, Prince Felipe Research Center, 46012, Valencia, Spain.,Institute for Integrative Systems Biology (I2SysBio), Universidad de Valencia-CSIC, 46980, Valencia, Paterna, Spain
| | - Joaquín Dopazo
- Clinical Bioinformatics Research Area, Fundación Progreso y Salud, Hospital Virgen del Rocío, 41013, Sevilla, Spain. .,Bioinformatics and Data Analysis Unit, Genomic Medicine Institute Imegen, Valencia, Spain. .,Functional Genomics Node, INB-ELIXIR-es, Hospital Virgen del Rocío, 42013, Sevilla, Spain.
| |
Collapse
|
6
|
A new parallel pipeline for DNA methylation analysis of long reads datasets. BMC Bioinformatics 2017; 18:161. [PMID: 28274198 PMCID: PMC5343294 DOI: 10.1186/s12859-017-1574-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 03/01/2017] [Indexed: 12/02/2022] Open
Abstract
Background DNA methylation is an important mechanism of epigenetic regulation in development and disease. New generation sequencers allow genome-wide measurements of the methylation status by reading short stretches of the DNA sequence (Methyl-seq). Several software tools for methylation analysis have been proposed over recent years. However, the current trend is that the new sequencers and the ones expected for an upcoming future yield sequences of increasing length, making these software tools inefficient and obsolete. Results In this paper, we propose a new software based on a strategy for methylation analysis of Methyl-seq sequencing data that requires much shorter execution times while yielding a better level of sensitivity, particularly for datasets composed of long reads. This strategy can be exported to other methylation, DNA and RNA analysis tools. Conclusions The developed software tool achieves execution times one order of magnitude shorter than the existing tools, while yielding equal sensitivity for short reads and even better sensitivity for long reads. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1574-3) contains supplementary material, which is available to authorized users.
Collapse
|
7
|
Tarraga J, Gallego A, Arnau V, Medina I, Dopazo J. HPG pore: an efficient and scalable framework for nanopore sequencing data. BMC Bioinformatics 2016; 17:107. [PMID: 26921234 PMCID: PMC4769497 DOI: 10.1186/s12859-016-0966-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 02/22/2016] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND The use of nanopore technologies is expected to spread in the future because they are portable and can sequence long fragments of DNA molecules without prior amplification. The first nanopore sequencer available, the MinION™ from Oxford Nanopore Technologies, is a USB-connected, portable device that allows real-time DNA analysis. In addition, other new instruments are expected to be released soon, which promise to outperform the current short-read technologies in terms of throughput. Despite the flood of data expected from this technology, the data analysis solutions currently available are only designed to manage small projects and are not scalable. RESULTS Here we present HPG Pore, a toolkit for exploring and analysing nanopore sequencing data. HPG Pore can run on both individual computers and in the Hadoop distributed computing framework, which allows easy scale-up to manage the large amounts of data expected to result from extensive use of nanopore technologies in the future. CONCLUSIONS HPG Pore allows for virtually unlimited sequencing data scalability, thus guaranteeing its continued management in near future scenarios. HPG Pore is available in GitHub at http://github.com/opencb/hpg-pore.
Collapse
Affiliation(s)
- Joaquin Tarraga
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain.
| | - Asunción Gallego
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain.
| | - Vicente Arnau
- Departamento de Informática, ETSE, Universidad de Valencia, Valencia, Spain.
| | - Ignacio Medina
- HPC Service, University Information Services, University of Cambridge, Cambridge, UK.
| | - Joaquin Dopazo
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain.
- Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, Spain.
- Functional Genomics Node, (INB) at CIPF, Valencia, 46012, Spain.
| |
Collapse
|
8
|
Al-Ghalith GA, Montassier E, Ward HN, Knights D. NINJA-OPS: Fast Accurate Marker Gene Alignment Using Concatenated Ribosomes. PLoS Comput Biol 2016; 12:e1004658. [PMID: 26820746 PMCID: PMC4731464 DOI: 10.1371/journal.pcbi.1004658] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 11/12/2015] [Indexed: 11/28/2022] Open
Abstract
The explosion of bioinformatics technologies in the form of next generation sequencing (NGS) has facilitated a massive influx of genomics data in the form of short reads. Short read mapping is therefore a fundamental component of next generation sequencing pipelines which routinely match these short reads against reference genomes for contig assembly. However, such techniques have seldom been applied to microbial marker gene sequencing studies, which have mostly relied on novel heuristic approaches. We propose NINJA Is Not Just Another OTU-Picking Solution (NINJA-OPS, or NINJA for short), a fast and highly accurate novel method enabling reference-based marker gene matching (picking Operational Taxonomic Units, or OTUs). NINJA takes advantage of the Burrows-Wheeler (BW) alignment using an artificial reference chromosome composed of concatenated reference sequences, the “concatesome,” as the BW input. Other features include automatic support for paired-end reads with arbitrary insert sizes. NINJA is also free and open source and implements several pre-filtering methods that elicit substantial speedup when coupled with existing tools. We applied NINJA to several published microbiome studies, obtaining accuracy similar to or better than previous reference-based OTU-picking methods while achieving an order of magnitude or more speedup and using a fraction of the memory footprint. NINJA is a complete pipeline that takes a FASTA-formatted input file and outputs a QIIME-formatted taxonomy-annotated BIOM file for an entire MiSeq run of human gut microbiome 16S genes in under 10 minutes on a dual-core laptop. The analysis of the microbial communities in and around us is a growing field of study, partly because of its major implications for human health, and partly because high-throughput DNA sequencing technology has only recently emerged to enable us to quantitatively study them. One of the most fundamental steps in analyzing these microbial communities is matching the microbial marker genes in environmental samples with existing databases to determine which microbes are present. The current techniques for doing this analysis are either slow or closed-source. We present an alternative technique that takes advantage of a high-speed Burrows-Wheeler alignment procedure combined with rapid filtering and parsing of the data to remove bottlenecks in the pipeline. We achieve an order-of-magnitude speedup over conventional techniques without sacrificing accuracy or memory use, and in some cases improving both significantly. Thus our method allows more biologists to process their own sequencing data without specialized computing resources, and it obtains more accurate and even optimal taxonomic annotation for their marker gene sequencing data.
Collapse
Affiliation(s)
- Gabriel A. Al-Ghalith
- Biomedical Informatics and Computational Biology, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Emmanuel Montassier
- University of Nantes, Nantes, France
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Henry N. Ward
- Lawrence University, Appleton, Wisconsin, United States of America
| | - Dan Knights
- Biomedical Informatics and Computational Biology, University of Minnesota, Minneapolis, Minnesota, United States of America
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
- * E-mail:
| |
Collapse
|
9
|
Medina I, Tárraga J, Martínez H, Barrachina S, Castillo MI, Paschall J, Salavert-Torres J, Blanquer-Espert I, Hernández-García V, Quintana-Ortí ES, Dopazo J. Highly sensitive and ultrafast read mapping for RNA-seq analysis. DNA Res 2016; 23:93-100. [PMID: 26740642 PMCID: PMC4833417 DOI: 10.1093/dnares/dsv039] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Accepted: 11/21/2015] [Indexed: 01/24/2023] Open
Abstract
As sequencing technologies progress, the amount of data produced grows exponentially, shifting the bottleneck of discovery towards the data analysis phase. In particular, currently available mapping solutions for RNA-seq leave room for improvement in terms of sensitivity and performance, hindering an efficient analysis of transcriptomes by massive sequencing. Here, we present an innovative approach that combines re-engineering, optimization and parallelization. This solution results in a significant increase of mapping sensitivity over a wide range of read lengths and substantial shorter runtimes when compared with current RNA-seq mapping methods available.
Collapse
Affiliation(s)
- I Medina
- HPC Service, UIS, University of Cambridge, Cambridge, UK
| | - J Tárraga
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - H Martínez
- Departamento de Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón de la Plana, Spain
| | - S Barrachina
- Departamento de Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón de la Plana, Spain
| | - M I Castillo
- Departamento de Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón de la Plana, Spain
| | - J Paschall
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - J Salavert-Torres
- Instituto de Instrumentación para Imagen Molecular, Universitat Politècnica de València, Valencia, Spain
| | - I Blanquer-Espert
- Instituto de Instrumentación para Imagen Molecular, Universitat Politècnica de València, Valencia, Spain Grupo de Investigación Biomédica de Imagen (GIBI 2^30), La Fe Polytechnic University Hospital, Valencia, Spain
| | - V Hernández-García
- Instituto de Instrumentación para Imagen Molecular, Universitat Politècnica de València, Valencia, Spain
| | - E S Quintana-Ortí
- Departamento de Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón de la Plana, Spain
| | - J Dopazo
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain Functional Genomics Node, (INB) at CIPF, Valencia, Spain CIBER de Enfermedades Raras (CIBERER), Valencia, Spain
| |
Collapse
|
10
|
Tárraga J, Pérez M, Orduña JM, Duato J, Medina I, Dopazo J. A parallel and sensitive software tool for methylation analysis on multicore platforms. Bioinformatics 2015; 31:3130-8. [PMID: 26069264 PMCID: PMC4679392 DOI: 10.1093/bioinformatics/btv357] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2014] [Accepted: 06/05/2015] [Indexed: 11/17/2022] Open
Abstract
Motivation: DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. Results: We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows–Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith–Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Availability and implementation: Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password ‘anonymous’). Contact:juan.orduna@uv.es or jdopazo@cipf.es
Collapse
Affiliation(s)
- Joaquín Tárraga
- Department of Computational Genomics, Centro de Investigación Príncipe Felipe
| | - Mariano Pérez
- Departamento de Informática, Universidad de Valencia and
| | - Juan M Orduña
- Departamento de Informática, Universidad de Valencia and
| | - José Duato
- DISCA, Universidad Politécnica de Valencia, Valencia, Spain
| | - Ignacio Medina
- Department of Computational Genomics, Centro de Investigación Príncipe Felipe
| | - Joaquín Dopazo
- Department of Computational Genomics, Centro de Investigación Príncipe Felipe
| |
Collapse
|