1
|
Ji F, Zhou Q, Ruan J, Zhu Z, Liu X. A compressive seeding algorithm in conjunction with reordering-based compression. Bioinformatics 2024; 40:btae100. [PMID: 38377404 PMCID: PMC10955252 DOI: 10.1093/bioinformatics/btae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/29/2024] [Accepted: 02/19/2024] [Indexed: 02/22/2024] Open
Abstract
MOTIVATION Seeding is a rate-limiting stage in sequence alignment for next-generation sequencing reads. The existing optimization algorithms typically utilize hardware and machine-learning techniques to accelerate seeding. However, an efficient solution provided by professional next-generation sequencing compressors has been largely overlooked by far. In addition to achieving remarkable compression ratios by reordering reads, these compressors provide valuable insights for downstream alignment that reveal the repetitive computations accounting for more than 50% of seeding procedure in commonly used short read aligner BWA-MEM at typical sequencing coverage. Nevertheless, the exploited redundancy information is not fully realized or utilized. RESULTS In this study, we present a compressive seeding algorithm, named CompSeed, to fill the gap. CompSeed, in collaboration with the existing reordering-based compression tools, finishes the BWA-MEM seeding process in about half the time by caching all intermediate seeding results in compact trie structures to directly answer repetitive inquiries that frequently cause random memory accesses. Furthermore, CompSeed demonstrates better performance as sequencing coverage increases, as it focuses solely on the small informative portion of sequencing reads after compression. The innovative strategy highlights the promising potential of integrating sequence compression and alignment to tackle the ever-growing volume of sequencing data. AVAILABILITY AND IMPLEMENTATION CompSeed is available at https://github.com/i-xiaohu/CompSeed.
Collapse
Affiliation(s)
- Fahu Ji
- School of Computer Science and Technology, Harbin Institute of Technology, Nan Gang District, Harbin 150080, China
| | - Qian Zhou
- Peng Cheng Laboratory, Nanshan District, Shenzhen 518055, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Dapeng District, Shenzhen 518120, China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Nanshan District, Shenzhen 518060, China
| | - Xianming Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Nan Gang District, Harbin 150080, China
- Peng Cheng Laboratory, Nanshan District, Shenzhen 518055, China
| |
Collapse
|
2
|
Wei ZG, Zhang XD, Fan XG, Qian Y, Liu F, Wu FX. pathMap: a path-based mapping tool for long noisy reads with high sensitivity. Brief Bioinform 2024; 25:bbae107. [PMID: 38517696 PMCID: PMC10959152 DOI: 10.1093/bib/bbae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 12/25/2023] [Accepted: 02/28/2024] [Indexed: 03/24/2024] Open
Abstract
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Xiao-Dan Zhang
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Xing-Guo Fan
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Yu Qian
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fei Liu
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
3
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
4
|
Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 2022; 19:705-710. [PMID: 35365778 PMCID: PMC10510034 DOI: 10.1038/s41592-022-01457-8] [Citation(s) in RCA: 106] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Accepted: 03/17/2022] [Indexed: 01/10/2023]
Abstract
Approximately 5-10% of the human genome remains inaccessible due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. We show that existing long-read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as they remain vulnerable to allelic bias. In the presence of a nonreference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy. To address this limitation, we developed a new long-read mapping method, Winnowmap2, by using minimal confidently alignable substrings. Winnowmap2 computes each read mapping through a collection of confident subalignments. This approach is more tolerant of structural variation and more sensitive to paralog-specific variants within repeats. Our experiments highlight that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India.
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Nancy F Hansen
- Comparative Genomics Analysis Unit, National Human Genome Research Institute, Bethesda, MD, USA
| | - Sergey Koren
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| |
Collapse
|
5
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
6
|
Nodehi HM, Tabatabaiefar MA, Sehhati M. Selection of Optimal Bioinformatic Tools and Proper Reference for Reducing the Alignment Error in Targeted Sequencing Data. JOURNAL OF MEDICAL SIGNALS & SENSORS 2021; 11:37-44. [PMID: 34026589 PMCID: PMC8043119 DOI: 10.4103/jmss.jmss_7_20] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 01/28/2020] [Accepted: 02/12/2020] [Indexed: 11/04/2022]
Abstract
Background Careful design in the primary steps of a next-generation sequencing study is critical for obtaining successful results in downstream analysis. Methods In this study, a framework is proposed to evaluate and improve the sequence mapping in targeted regions of the reference genome. In this regard, simulated short reads were produced from the coding regions of the human genome and mapped to a Customized Target-Based Reference (CTBR) by the alignment tools that have been introduced recently. The short reads produced by different sequencing technologies aligned to the standard genome and also CTBR with and without well-defined mutation types where the amount of unmapped and misaligned reads and runtime was measured for comparison. Results The results showed that the mapping accuracy of the reads generated from Illumina Hiseq2500 using Stampy as the alignment tool whenever the CTBR was used as reference was significantly better than other evaluated pipelines. Using CTBR for alignment significantly decreased the mapping error in comparison to other expanded or more limited references. While intentional mutations were imported in the reads, Stampy showed the minimum error of 1.67% using CTBR. However, the lowest error obtained by stampy too using whole genome and one chromosome as references was 3.78% and 20%, respectively. Maximum and minimum misalignment errors were observed on chromosome Y and 20, respectively. Conclusion Therefore using the proposed framework in a clinical targeted sequencing study may lead to predict the error and improve the performance of variant calling regarding the genomic regions targeted in a clinical study.
Collapse
Affiliation(s)
- Hannane Mohammadi Nodehi
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammad Amin Tabatabaiefar
- Department of Medical Genetics, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.,Department of Bioinformatics, Medical Image and Signal Processing Research Center, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammadreza Sehhati
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
7
|
Wei ZG, Zhang SW, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics 2020; 21:341. [PMID: 32753028 PMCID: PMC7430848 DOI: 10.1186/s12859-020-03698-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Accepted: 07/23/2020] [Indexed: 01/09/2023] Open
Abstract
Background Single Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. Most existing mapping tools generally adopt the traditional seed-and-extend strategy, and the candidate aligned regions for each query read are selected either by counting the number of matched seeds or chaining a group of seeds. However, for all the existing mapping tools, the coverage ratio of the alignment region to the query read is lower, and the read alignment quality and efficiency need to be improved. Here, we introduce smsMap, a novel mapping tool that is specifically designed to map the long reads of SMS to a reference genome. Results smsMap was evaluated with other existing seven SMS mapping tools (e.g., BLASR, minimap2, and BWA-MEM) on both simulated and real-life SMS datasets. The experimental results show that smsMap can efficiently achieve higher aligned read coverage ratio and has higher sensitivity that can align more sequences and bases to the reference genome. Additionally, smsMap is more robust to sequencing errors. Conclusions smsMap is computationally efficient to align SMS reads, especially for the larger size of the reference genome (e.g., H. sapiens genome with over 3 billion base pairs). The source code of smsMap can be freely downloaded from https://github.com/NWPU-903PR/smsMap.
Collapse
Affiliation(s)
- Ze-Gang Wei
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China.,Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| |
Collapse
|
8
|
Kumar S, Agarwal S, Ranvijay. Fast and memory efficient approach for mapping NGS reads to a reference genome. J Bioinform Comput Biol 2020; 17:1950008. [PMID: 31057068 DOI: 10.1142/s0219720019500082] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows-Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .
Collapse
Affiliation(s)
| | | | - Ranvijay
- 1 CSED, NIT Allahabad, 211004, India
| |
Collapse
|
9
|
Morgulis A, Agarwala R. SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees. Gigascience 2020; 9:giaa023. [PMID: 32315028 PMCID: PMC7172022 DOI: 10.1093/gigascience/giaa023] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 08/15/2019] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. FINDINGS This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to ∼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. CONCLUSIONS We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.
Collapse
Affiliation(s)
- Aleksandr Morgulis
- National Center for Biotechnology Information, National Library of Medicine, 8600 Rockville Pike Bethesda, MD 20894, USA
| | - Richa Agarwala
- National Center for Biotechnology Information, National Library of Medicine, 8600 Rockville Pike Bethesda, MD 20894, USA
| |
Collapse
|
10
|
O'Neill K, Brocks D, Hammell MG. Mobile genomics: tools and techniques for tackling transposons. Philos Trans R Soc Lond B Biol Sci 2020; 375:20190345. [PMID: 32075565 PMCID: PMC7061981 DOI: 10.1098/rstb.2019.0345] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/20/2019] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing approaches have fundamentally changed the types of questions that can be asked about gene function and regulation. With the goal of approaching truly genome-wide quantifications of all the interaction partners and downstream effects of particular genes, these quantitative assays have allowed for an unprecedented level of detail in exploring biological interactions. However, many challenges remain in our ability to accurately describe and quantify the interactions that take place in those hard to reach and extremely repetitive regions of our genome comprised mostly of transposable elements (TEs). Tools dedicated to TE-derived sequences have lagged behind, making the inclusion of these sequences in genome-wide analyses difficult. Recent improvements, both computational and experimental, allow for the better inclusion of TE sequences in genomic assays and a renewed appreciation for the importance of TE biology. This review will discuss the recent improvements that have been made in the computational analysis of TE-derived sequences as well as the areas where such analysis still proves difficult. This article is part of a discussion meeting issue 'Crossroads between transposons and gene regulation'.
Collapse
Affiliation(s)
- Kathryn O'Neill
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - David Brocks
- Department of Computer Science and Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel
| | - Molly Gale Hammell
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
11
|
Lin HN, Hsu WL. GSAlign: an efficient sequence alignment tool for intra-species genomes. BMC Genomics 2020; 21:182. [PMID: 32093618 PMCID: PMC7041101 DOI: 10.1186/s12864-020-6569-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 02/10/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Personal genomics and comparative genomics are becoming more important in clinical practice and genome research. Both fields require sequence alignment to discover sequence conservation and variation. Though many methods have been developed, some are designed for small genome comparison while some are not efficient for large genome comparison. Moreover, most existing genome comparison tools have not been evaluated the correctness of sequence alignments systematically. A wrong sequence alignment would produce false sequence variants. RESULTS In this study, we present GSAlign that handles large genome sequence alignment efficiently and identifies sequence variants from the alignment result. GSAlign is an efficient sequence alignment tool for intra-species genomes. It identifies sequence variations from the sequence alignments. We estimate performance by measuring the correctness of predicted sequence variations. The experiment results demonstrated that GSAlign is not only faster than most existing state-of-the-art methods, but also identifies sequence variants with high accuracy. CONCLUSIONS As more genome sequences become available, the demand for genome comparison is increasing. Therefore an efficient and robust algorithm is most desirable. We believe GSAlign can be a useful tool. It exhibits the abilities of ultra-fast alignment as well as high accuracy and sensitivity for detecting sequence variations.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan.
| |
Collapse
|
12
|
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2019; 34:3094-3100. [PMID: 29750242 DOI: 10.1093/bioinformatics/bty191] [Citation(s) in RCA: 7379] [Impact Index Per Article: 1229.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2018] [Accepted: 05/04/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Heng Li
- Department of Medical Population Genetics Program, Broad Institute, Cambridge, MA, USA
| |
Collapse
|
13
|
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2019; 19:329-346. [PMID: 29599501 DOI: 10.1038/s41576-018-0003-4] [Citation(s) in RCA: 320] [Impact Index Per Article: 53.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
Collapse
Affiliation(s)
- Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Hayan Lee
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
14
|
Deorowicz S, Debudaj-Grabysz A, Gudyś A, Grabowski S. Whisper: read sorting allows robust mapping of DNA sequencing data. Bioinformatics 2019; 35:2043-2050. [PMID: 30407485 DOI: 10.1093/bioinformatics/bty927] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 10/16/2018] [Accepted: 11/06/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Agnieszka Debudaj-Grabysz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Adam Gudyś
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Faculty of Electrical, Electronic, Computer and Control Engineering, Lodz University of Technology, Stefanowskiego 18/22, Łódź, PL, Poland
| |
Collapse
|
15
|
Yang LA, Chang YJ, Chen SH, Lin CY, Ho JM. SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies. BMC Genomics 2019; 19:238. [PMID: 30999844 PMCID: PMC7402383 DOI: 10.1186/s12864-019-5445-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 01/10/2019] [Indexed: 01/03/2023] Open
Abstract
Background With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying. Results We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers’ attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads. Availability The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT. Electronic supplementary material The online version of this article (10.1186/s12864-019-5445-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Li-An Yang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yu-Jung Chang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan.
| | - Shu-Hwa Chen
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Chung-Yen Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Jan-Ming Ho
- Institute of Information Science, Academia Sinica, Taipei, Taiwan.,Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
16
|
Gamaarachchi H, Parameswaran S, Smith MA. Featherweight long read alignment using partitioned reference indexes. Sci Rep 2019; 9:4318. [PMID: 30867495 PMCID: PMC6416333 DOI: 10.1038/s41598-019-40739-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Accepted: 02/22/2019] [Indexed: 02/06/2023] Open
Abstract
The advent of Nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimisation and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We incorporated multi-index merging into the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2 GB RAM with negligible impact on accuracy.
Collapse
Affiliation(s)
- Hasindu Gamaarachchi
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria St, Darlinghurst, NSW, Australia.,School of Computer Science and Engineering, UNSW Sydney, Kensington, NSW, Australia
| | - Sri Parameswaran
- School of Computer Science and Engineering, UNSW Sydney, Kensington, NSW, Australia
| | - Martin A Smith
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria St, Darlinghurst, NSW, Australia. .,St-Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Darlinghurst, NSW, Australia.
| |
Collapse
|
17
|
Farashi S, Kryza T, Clements J, Batra J. Post-GWAS in prostate cancer: from genetic association to biological contribution. Nat Rev Cancer 2019; 19:46-59. [PMID: 30538273 DOI: 10.1038/s41568-018-0087-3] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Genome-wide association studies (GWAS) have been successful in deciphering the genetic component of predisposition to many human complex diseases including prostate cancer. Germline variants identified by GWAS progressively unravelled the substantial knowledge gap concerning prostate cancer heritability. With the beginning of the post-GWAS era, more and more studies reveal that, in addition to their value as risk markers, germline variants can exert active roles in prostate oncogenesis. Consequently, current research efforts focus on exploring the biological mechanisms underlying specific susceptibility loci known as causal variants by applying novel and precise analytical methods to available GWAS data. Results obtained from these post-GWAS analyses have highlighted the potential of exploiting prostate cancer risk-associated germline variants to identify new gene networks and signalling pathways involved in prostate tumorigenesis. In this Review, we describe the molecular basis of several important prostate cancer-causal variants with an emphasis on using post-GWAS analysis to gain insight into cancer aetiology. In addition to discussing the current status of post-GWAS studies, we also summarize the main molecular mechanisms of potential causal variants at prostate cancer risk loci and explore the major challenges in moving from association to functional studies and their implication in clinical translation.
Collapse
Affiliation(s)
- Samaneh Farashi
- Cancer Program, School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Queensland, Australia
- Australian Prostate Cancer Research Centre - Queensland, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
| | - Thomas Kryza
- Cancer Program, School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Queensland, Australia
- Australian Prostate Cancer Research Centre - Queensland, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
| | - Judith Clements
- Cancer Program, School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Queensland, Australia
- Australian Prostate Cancer Research Centre - Queensland, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
| | - Jyotsna Batra
- Cancer Program, School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Queensland, Australia.
- Australian Prostate Cancer Research Centre - Queensland, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia.
| |
Collapse
|
18
|
Lin HN, Hsu WL. DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics 2018; 34:190-197. [PMID: 28968831 PMCID: PMC5860201 DOI: 10.1093/bioinformatics/btx558] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 08/29/2017] [Accepted: 09/03/2017] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. RESULTS We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. AVAILABILITY AND IMPLEMENTATION https://github.com/hsinnan75/DART. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
19
|
Lee H, Lee KW, Lee T, Park D, Chung J, Lee C, Park WY, Son DS. Performance evaluation method for read mapping tool in clinical panel sequencing. Genes Genomics 2017; 40:189-197. [PMID: 29568413 PMCID: PMC5846869 DOI: 10.1007/s13258-017-0621-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/11/2017] [Indexed: 01/28/2023]
Abstract
In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA–MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.
Collapse
Affiliation(s)
- Hojun Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Ki-Wook Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,2Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Taeseob Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Donghyun Park
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| | - Jongsuk Chung
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,3Department of Molecular Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419 South Korea
| | - Chung Lee
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,4Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Woong-Yang Park
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea.,3Department of Molecular Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419 South Korea.,4Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, 06351 South Korea
| | - Dae-Soon Son
- 1Samsung Genome Institute (SGI), Samsung Medical Center (SMC), Seoul, 06351 South Korea
| |
Collapse
|