1
|
Guo L, Huo H. An efficient Burrows-Wheeler transform-based aligner for short read mapping. Comput Biol Chem 2024; 110:108050. [PMID: 38447272 DOI: 10.1016/j.compbiolchem.2024.108050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 02/15/2024] [Accepted: 03/01/2024] [Indexed: 03/08/2024]
Abstract
Read mapping as the foundation of computational biology is a bottleneck task under the pressure of sequencing throughput explodes. In this work, we present an efficient Burrows-Wheeler transform-based aligner for next-generation sequencing (NGS) short read. Firstly, we propose a difference-aware classification strategy to assign specific reads to the computationally more economical search modes, and present some acceleration techniques, such as a seed pruning method based on the property of maximum coverage interval to reduce the redundant locating for candidate regions, redesigning LF calculation to support fast query. Then, we propose a heuristic verification to determine the best mapping from amounts of flanking sequences. Incorporated with low-distortion string embedding, most dissimilar sequences are filtered out cheaply, and the highly similar sequences left are just right for the wavefront alignment algorithm's preference. We provide a full spectrum benchmark with different read lengths, the results show that our method is 1.3-1.4 times faster than state-of-the-art Burrows-Wheeler transform-based methods (including bowtie2, bwa-MEM, and hisat2) over 101bp reads and has a speedup with 1.5-13 times faster over 750bp to 1000bp reads; meanwhile, our method has comparable memory usage and accuracy. However, hash-based methods (including Strobealign, Minimap2, and Accel-Align) are significantly faster, in part because Burrows-Wheeler transform-based methods calculate on the compressed space. The source code is available: https://github.com/Lilu-guo/Effaln.
Collapse
Affiliation(s)
- Lilu Guo
- Department of Computer Science, Xidian University, Xi'an, Shaanxi, 710071, China
| | - Hongwei Huo
- Department of Computer Science, Xidian University, Xi'an, Shaanxi, 710071, China.
| |
Collapse
|
2
|
Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L, Didelot X, Monti DC, Pirola Y. RecGraph: recombination-aware alignment of sequences to variation graphs. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae292. [PMID: 38676570 DOI: 10.1093/bioinformatics/btae292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 02/23/2024] [Accepted: 04/25/2024] [Indexed: 04/29/2024]
Abstract
MOTIVATION Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.
Collapse
Affiliation(s)
- Jorge Avila Cartes
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Simone Ciccolella
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Luca Denti
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Xavier Didelot
- Department of Statistics and School of Life Sciences, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Davide Cesare Monti
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication, University of Milano - Bicocca. Viale Sarca 336, Milano 20126, Italy
| |
Collapse
|
3
|
Duchen D, Clipman SJ, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction. PLoS One 2024; 19:e0301069. [PMID: 38669259 PMCID: PMC11051683 DOI: 10.1371/journal.pone.0301069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 03/09/2024] [Indexed: 04/28/2024] Open
Abstract
Nearly 300 million individuals live with chronic hepatitis B virus (HBV) infection (CHB), for which no curative therapy is available. As viral diversity is associated with pathogenesis and immunological control of infection, improved methods to characterize this diversity could aid drug development efforts. Conventionally, viral sequencing data are mapped/aligned to a reference genome, and only the aligned sequences are retained for analysis. Thus, reference selection is critical, yet selecting the most representative reference a priori remains difficult. We investigate an alternative pangenome approach which can combine multiple reference sequences into a graph which can be used during alignment. Using simulated short-read sequencing data generated from publicly available HBV genomes and real sequencing data from an individual living with CHB, we demonstrate alignment to a phylogenetically representative 'genome graph' can improve alignment, avoid issues of reference ambiguity, and facilitate the construction of sample-specific consensus sequences more genetically similar to the individual's infection. Graph-based methods can, therefore, improve efforts to characterize the genetics of viral pathogens, including HBV, and have broader implications in host-pathogen research.
Collapse
Affiliation(s)
- Dylan Duchen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, CT, United States of America
| | - Steven J. Clipman
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Candelaria Vergara
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Chloe L. Thio
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - David L. Thomas
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Genevieve L. Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| |
Collapse
|
4
|
Schloissnig S, Pani S, Rodriguez-Martin B, Ebler J, Hain C, Tsapalou V, Söylev A, Hüther P, Ashraf H, Prodanov T, Asparuhova M, Hunt S, Rausch T, Marschall T, Korbel JO. Long-read sequencing and structural variant characterization in 1,019 samples from the 1000 Genomes Project. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.18.590093. [PMID: 38659906 PMCID: PMC11042266 DOI: 10.1101/2024.04.18.590093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Structural variants (SVs) contribute significantly to human genetic diversity and disease 1-4 . Previously, SVs have remained incompletely resolved by population genomics, with short-read sequencing facing limitations in capturing the whole spectrum of SVs at nucleotide resolution 5-7 . Here we leveraged nanopore sequencing 8 to construct an intermediate coverage resource of 1,019 long-read genomes sampled within 26 human populations from the 1000 Genomes Project. By integrating linear and graph-based approaches for SV analysis via pangenome graph-augmentation, we uncover 167,291 sequence-resolved SVs in these samples, considerably advancing SV characterization compared to population-wide short-read sequencing studies 3,4 . Our analysis details diverse SV classes-deletions, duplications, insertions, and inversions-at population-scale. LINE-1 and SVA retrotransposition activities frequently mediate transductions 9,10 of unique sequences, with both mobile element classes transducing sequences at either the 3'- or 5'-end, depending on the source element locus. Furthermore, analyses of SV breakpoint junctions suggest a continuum of homology-mediated rearrangement processes are integral to SV formation, and highlight evidence for SV recurrence involving repeat sequences. Our open-access dataset underscores the transformative impact of long-read sequencing in advancing the characterisation of polymorphic genomic architectures, and provides a resource for guiding variant prioritisation in future long-read sequencing-based disease studies.
Collapse
|
5
|
Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson Z, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E, McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.05.24303792. [PMID: 38496498 PMCID: PMC10942501 DOI: 10.1101/2024.03.05.24303792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Collapse
Affiliation(s)
- Jonas A. Gustafson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
| | - Sophia B. Gibson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Nikhita Damaraju
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Institute for Public Health Genetics, University of Washington, Seattle, WA, USA
| | - Miranda PG Zalusky
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - David Twesigomwe
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Lei Yang
- Pacific Northwest Research Institute, Seattle, WA, USA
| | | | | | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Human Technopole, Milan, Italy
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Angela L. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Joy Goffena
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Zachery Anderson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sophie HR Storz
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sydney A. Ward
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Maisha Sinha
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Claudia Gonzaga-Jauregui
- International Laboratory for Human Genome Research, Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México
| | - Wayne E. Clarke
- New York Genome Center, New York, NY, USA
- Outlier Informatics Inc., Saskatoon, SK, Canada
| | | | | | | | | | | | - Mahler Revsine
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Cate R. Paschal
- Department of Laboratories, Seattle Children’s Hospital, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
| | - Christina Zakarian
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | | | - Esther Robb
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Richard N. McLaughlin
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
- Pacific Northwest Research Institute, Seattle, WA, USA
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | | | - Matt Loose
- Deep Seq, School of Life Sciences, University of Nottingham, Nottingham, England
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Khoury College of Computer Sciences, Northeastern University, Boston, MA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Danny E. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
| |
Collapse
|
6
|
de Oliveira Martins L, Mather AE, Page AJ. Scalable neighbour search and alignment with uvaia. PeerJ 2024; 12:e16890. [PMID: 38464752 PMCID: PMC10924453 DOI: 10.7717/peerj.16890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 01/15/2024] [Indexed: 03/12/2024] Open
Abstract
Despite millions of SARS-CoV-2 genomes being sequenced and shared globally, manipulating such data sets is still challenging, especially selecting sequences for focused phylogenetic analysis. We present a novel method, uvaia, which is based on partial and exact sequence similarity for quickly extracting database sequences similar to query sequences of interest. Many SARS-CoV-2 phylogenetic analyses rely on very low numbers of ambiguous sites as a measure of quality since ambiguous sites do not contribute to single nucleotide polymorphism (SNP) differences. Uvaia overcomes this limitation by using measures of sequence similarity which consider partially ambiguous sites, allowing for more ambiguous sequences to be included in the analysis if needed. Such fine-grained definition of similarity allows not only for better phylogenetic analyses, but could also lead to improved classification and biogeographical inferences. Uvaia works natively with compressed files, can use multiple cores and efficiently utilises memory, being able to analyse large data sets on a standard desktop.
Collapse
Affiliation(s)
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich, United Kingdom
- University of East Anglia, Norwich, United Kingdom
| | | |
Collapse
|
7
|
Groot Koerkamp R, Ivanov P. Exact global alignment using A* with chaining seed heuristic and match pruning. Bioinformatics 2024; 40:btae032. [PMID: 38265119 PMCID: PMC10932610 DOI: 10.1093/bioinformatics/btae032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 11/14/2023] [Accepted: 01/20/2024] [Indexed: 01/25/2024] Open
Abstract
MOTIVATION Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time. RESULTS We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically. On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best fit n1.06, n≤107 bp). A similar scaling remains up to d=12% (best fit n1.24, n≤107 bp). For n=107 bp and d=4%, A*PA reaches >500× speedup compared to the leading exact aligners Edlib and BiWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500kb) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3× median speedup compared to Edlib and BiWFA. When the sequences come from different human samples, A*PA performs 1.7× faster than Edlib and BiWFA. AVAILABILITY AND IMPLEMENTATION github.com/RagnarGrootKoerkamp/astar-pairwise-aligner.
Collapse
Affiliation(s)
- Ragnar Groot Koerkamp
- Department of Computer Science, ETH Zurich, Rämistrasse 101, Zurich 8092, Switzerland
| | - Pesho Ivanov
- Department of Computer Science, ETH Zurich, Rämistrasse 101, Zurich 8092, Switzerland
| |
Collapse
|
8
|
Song B, Buckler ES, Stitzer MC. New whole-genome alignment tools are needed for tapping into plant diversity. TRENDS IN PLANT SCIENCE 2024; 29:355-369. [PMID: 37749022 DOI: 10.1016/j.tplants.2023.08.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 07/19/2023] [Accepted: 08/23/2023] [Indexed: 09/27/2023]
Abstract
Genome alignment is one of the most foundational methods for genome sequence studies. With rapid advances in sequencing and assembly technologies, these newly assembled genomes present challenges for alignment tools to meet the increased complexity and scale. Plant genome alignment is technologically challenging because of frequent whole-genome duplications (WGDs) as well as chromosome rearrangements and fractionation, high nucleotide diversity, widespread structural variation, and high transposable element (TE) activity causing large proportions of repeat elements. We summarize classical pairwise and multiple genome alignment (MGA) methods, and highlight techniques that are widely used or are being developed by the plant research community. We also outline the remaining challenges for precise genome alignment and the interpretation of alignment results in plants.
Collapse
Affiliation(s)
- Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong 261325, China; Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi 712100, China.
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA; Agricultural Research Service, United States Department of Agriculture, Ithaca, NY 14853, USA
| | - Michelle C Stitzer
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA.
| |
Collapse
|
9
|
Holt JM, Saunders CT, Rowell WJ, Kronenberg Z, Wenger AM, Eberle M. HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing. Bioinformatics 2024; 40:btae042. [PMID: 38269623 PMCID: PMC10868326 DOI: 10.1093/bioinformatics/btae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 12/13/2023] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
MOTIVATION In diploid organisms, phasing is the problem of assigning the alleles at heterozygous variants to one of two haplotypes. Reads from PacBio HiFi sequencing provide long, accurate observations that can be used as the basis for both calling and phasing variants. HiFi reads also excel at calling larger classes of variation, such as structural or tandem repeat variants. However, current phasing tools typically only phase small variants, leaving larger variants unphased. RESULTS We developed HiPhase, a tool that jointly phases SNVs, indels, structural, and tandem repeat variants. The main benefits of HiPhase are (i) dual mode allele assignment for detecting large variants, (ii) a novel application of the A*-algorithm to phasing, and (iii) logic allowing phase blocks to span breaks caused by alignment issues around reference gaps and homozygous deletions. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. Additionally, HiPhase jointly phases SNVs, indels, structural, and tandem repeat variants and includes innate multi-threading, statistics gathering, and concurrent phased alignment output generation. AVAILABILITY AND IMPLEMENTATION HiPhase is available as source code and a pre-compiled Linux binary with a user guide at https://github.com/PacificBiosciences/HiPhase.
Collapse
Affiliation(s)
- James M Holt
- Computational Biology, PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025, United States
| | | | - William J Rowell
- Computational Biology, PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025, United States
| | - Zev Kronenberg
- Computational Biology, PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025, United States
| | - Aaron M Wenger
- Computational Biology, PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025, United States
| | - Michael Eberle
- Computational Biology, PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025, United States
| |
Collapse
|
10
|
Benoit G, Raguideau S, James R, Phillippy AM, Chikhi R, Quince C. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat Biotechnol 2024:10.1038/s41587-023-01983-6. [PMID: 38168989 DOI: 10.1038/s41587-023-01983-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 09/08/2023] [Indexed: 01/05/2024]
Abstract
We introduce metaMDBG, a metagenomics assembler for PacBio HiFi reads. MetaMDBG combines a de Bruijn graph assembly in a minimizer space with an iterative assembly over sequences of minimizers to address variations in genome coverage depth and an abundance-based filtering strategy to simplify strain complexity. For complex communities, we obtained up to twice as many high-quality circularized prokaryotic metagenome-assembled genomes as existing methods and had better recovery of viruses and plasmids.
Collapse
Affiliation(s)
- Gaëtan Benoit
- Organisms and Ecosystems, Earlham Institute, Norwich, UK
| | | | - Robert James
- Gut Microbes and Health, Quadram Institute, Norwich, UK
| | - Adam M Phillippy
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Rayan Chikhi
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| | - Christopher Quince
- Organisms and Ecosystems, Earlham Institute, Norwich, UK.
- Gut Microbes and Health, Quadram Institute, Norwich, UK.
- School of Biological Sciences, University of East Anglia, Norwich, UK.
- Warwick Medical School, University of Warwick, Coventry, UK.
| |
Collapse
|
11
|
LoTempio J, Delot E, Vilain E. Benchmarking long-read genome sequence alignment tools for human genomics applications. PeerJ 2023; 11:e16515. [PMID: 38130927 PMCID: PMC10734412 DOI: 10.7717/peerj.16515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 11/02/2023] [Indexed: 12/23/2023] Open
Abstract
Background The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation. Results For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001-100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385. Conclusions These results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.
Collapse
Affiliation(s)
- Jonathan LoTempio
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| | - Emmanuele Delot
- Center for Genetic Medicine Research, Children’s National Hospital, Washington, DC, United States of America
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, United States of America
| | - Eric Vilain
- Institute for Clinical and Translational Science, University of California, Irvine, CA, United States of America
- International Research Laboratory (IRL2006) “Epigenetics, Data, Politics (EpiDaPo)”, Centre National de la Recherche Scientifique, Washington, DC, United States of America
| |
Collapse
|
12
|
Dunn T, Narayanasamy S. vcfdist: accurately benchmarking phased small variant calls in human genomes. Nat Commun 2023; 14:8149. [PMID: 38071244 PMCID: PMC10710436 DOI: 10.1038/s41467-023-43876-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 11/22/2023] [Indexed: 12/18/2023] Open
Abstract
Accurately benchmarking small variant calling accuracy is critical for the continued improvement of human whole genome sequencing. In this work, we show that current variant calling evaluations are biased towards certain variant representations and may misrepresent the relative performance of different variant calling pipelines. We propose solutions, first exploring the affine gap parameter design space for complex variant representation and suggesting a standard. Next, we present our tool vcfdist and demonstrate the importance of enforcing local phasing for evaluation accuracy. We then introduce the notion of partial credit for mostly-correct calls and present an algorithm for clustering dependent variants. Lastly, we motivate using alignment distance metrics to supplement precision-recall curves for understanding variant calling performance. We evaluate the performance of 64 phased Truth Challenge V2 submissions and show that vcfdist improves measured insertion and deletion performance consistency across variant representations from R2 = 0.97243 for baseline vcfeval to 0.99996 for vcfdist.
Collapse
Affiliation(s)
- Tim Dunn
- Computer Science and Engineering, University of Michigan, 2260 Hayward Street, Ann Arbor, MI, 48109, USA.
| | - Satish Narayanasamy
- Computer Science and Engineering, University of Michigan, 2260 Hayward Street, Ann Arbor, MI, 48109, USA
| |
Collapse
|
13
|
Aguado-Puig Q, Doblas M, Matzoros C, Espinosa A, Moure JC, Marco-Sola S, Moreto M. WFA-GPU: gap-affine pairwise read-alignment using GPUs. Bioinformatics 2023; 39:btad701. [PMID: 37975878 PMCID: PMC10697739 DOI: 10.1093/bioinformatics/btad701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 11/09/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio and Nanopore technologies. The recently proposed wavefront alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures. RESULTS This paper presents WFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU-GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3× and up to 18.2× when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29× faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU. AVAILABILITY AND IMPLEMENTATION WFA-GPU code and documentation are publicly available at https://github.com/quim0/WFA-GPU.
Collapse
Affiliation(s)
- Quim Aguado-Puig
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Max Doblas
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Christos Matzoros
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Antonio Espinosa
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Juan Carlos Moure
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Miquel Moreto
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
14
|
Rice ES, Alberdi A, Alfieri J, Athrey G, Balacco JR, Bardou P, Blackmon H, Charles M, Cheng HH, Fedrigo O, Fiddaman SR, Formenti G, Frantz LAF, Gilbert MTP, Hearn CJ, Jarvis ED, Klopp C, Marcos S, Mason AS, Velez-Irizarry D, Xu L, Warren WC. A pangenome graph reference of 30 chicken genomes allows genotyping of large and complex structural variants. BMC Biol 2023; 21:267. [PMID: 37993882 PMCID: PMC10664547 DOI: 10.1186/s12915-023-01758-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/02/2023] [Indexed: 11/24/2023] Open
Abstract
BACKGROUND The red junglefowl, the wild outgroup of domestic chickens, has historically served as a reference for genomic studies of domestic chickens. These studies have provided insight into the etiology of traits of commercial importance. However, the use of a single reference genome does not capture diversity present among modern breeds, many of which have accumulated molecular changes due to drift and selection. While reference-based resequencing is well-suited to cataloging simple variants such as single-nucleotide changes and short insertions and deletions, it is mostly inadequate to discover more complex structural variation in the genome. METHODS We present a pangenome for the domestic chicken consisting of thirty assemblies of chickens from different breeds and research lines. RESULTS We demonstrate how this pangenome can be used to catalog structural variants present in modern breeds and untangle complex nested variation. We show that alignment of short reads from 100 diverse wild and domestic chickens to this pangenome reduces reference bias by 38%, which affects downstream genotyping results. This approach also allows for the accurate genotyping of a large and complex pair of structural variants at the K feathering locus using short reads, which would not be possible using a linear reference. CONCLUSIONS We expect that this new paradigm of genomic reference will allow better pinpointing of exact mutations responsible for specific phenotypes, which will in turn be necessary for breeding chickens that meet new sustainability criteria and are resilient to quickly evolving pathogen threats.
Collapse
Affiliation(s)
- Edward S Rice
- Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
- Faculty of Veterinary Medicine, Ludwig-Maximilians-Universität, Munich, Germany
| | - Antton Alberdi
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
| | - James Alfieri
- Department of Ecology & Evolutionary Biology, Texas A&M University, College Station, TX, USA
| | - Giridhar Athrey
- Department of Poultry Science, Texas A&M University, College Station, TX, USA
| | - Jennifer R Balacco
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Philippe Bardou
- Sigenae, GenPhySE, Université de Toulouse, INRAE, ENVT, Castanet Tolosan, 31326, France
| | - Heath Blackmon
- Department of Biology, Texas A&M University, College Station, TX, USA
| | - Mathieu Charles
- University Paris-Saclay, INRAE, AgroParisTech, GABI, Sigenae, Jouy-en-Josas, France
| | - Hans H Cheng
- Avian Disease and Oncology Laboratory, USDA, ARS, USNPRC, East Lansing, MI, USA
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | | | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Laurent A F Frantz
- Faculty of Veterinary Medicine, Ludwig-Maximilians-Universität, Munich, Germany
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, E1 4DQ, UK
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
| | - Cari J Hearn
- Avian Disease and Oncology Laboratory, USDA, ARS, USNPRC, East Lansing, MI, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- The Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Christophe Klopp
- Sigenae, Genotoul Bioinfo, MIAT UR875, INRAE, Castanet Tolosan, France
| | - Sofia Marcos
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
- Applied Genomics and Bioinformatics, University of the Basque Country (UPV/EHU), Leioa, Bilbao, Spain
| | | | | | - Luohao Xu
- Key Laboratory of Freshwater Fish Reproduction and Development (Ministry of Education), Key Laboratory of Aquatic Science of Chongqing, School of Life Sciences, Southwest University, Chongqing, 400715, China
| | - Wesley C Warren
- Department of Animal Sciences, University of Missouri, Columbia, MO, USA.
| |
Collapse
|
15
|
Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. Bioinformatics 2023; 39:btad512. [PMID: 37603771 PMCID: PMC10505501 DOI: 10.1093/bioinformatics/btad512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/19/2023] [Accepted: 08/18/2023] [Indexed: 08/23/2023] Open
Abstract
MOTIVATION The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION MashMap3 is available at https://github.com/marbl/MashMap.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
16
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
17
|
Liu D, Steinegger M. Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices. Bioinformatics 2023; 39:btad487. [PMID: 37535681 PMCID: PMC10457662 DOI: 10.1093/bioinformatics/btad487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 06/10/2023] [Accepted: 08/01/2023] [Indexed: 08/05/2023] Open
Abstract
MOTIVATION Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions). RESULTS We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities. AVAILABILITY AND IMPLEMENTATION Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.
Collapse
Affiliation(s)
- Daniel Liu
- University of California Los Angeles, Los Angeles, CA, United States
| | - Martin Steinegger
- School of Biological Sciences, Artificial Intelligence Institute, Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea
| |
Collapse
|
18
|
Benoit G, Raguideau S, James R, Phillippy AM, Chikhi R, Quince C. Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.07.548136. [PMID: 37786716 PMCID: PMC10541625 DOI: 10.1101/2023.07.07.548136] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
We introduce a novel metagenomics assembler for high-accuracy long reads. Our approach, implemented as metaMDBG, combines highly efficient de Bruijn graph assembly in minimizer space, with both a multi-k' approach for dealing with variations in genome coverage depth and an abundance-based filtering strategy for simplifying strain complexity. The resulting algorithm is more efficient than the state-of-the-art but with better assembly results. metaMDBG was 1.5 to 12 times faster than competing assemblers and requires between one-tenth and one-thirtieth of the memory across a range of data sets. We obtained up to twice as many high-quality circularised prokaryotic metagenome assembled genomes (MAGs) on the most complex communities, and a better recovery of viruses and plasmids. metaMDBG performs particularly well for abundant organisms whilst being robust to the presence of strain diversity. The result is that for the first time it is possible to efficiently reconstruct the majority of complex communities by abundance as near-complete MAGs.
Collapse
Affiliation(s)
- Gaëtan Benoit
- Organisms and Ecosystems, Earlham Institute, Norwich, NR4 7UZ, UK
| | | | - Robert James
- Gut Microbes and Health, Quadram Institute, Norwich, NR4 7UQ, UK
| | - Adam M. Phillippy
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Rayan Chikhi
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| | - Christopher Quince
- Organisms and Ecosystems, Earlham Institute, Norwich, NR4 7UZ, UK
- Gut Microbes and Health, Quadram Institute, Norwich, NR4 7UQ, UK
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, UK
| |
Collapse
|
19
|
Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O( m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Res 2023; 33:1175-1187. [PMID: 36990779 PMCID: PMC10538486 DOI: 10.1101/gr.277637.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 03/16/2023] [Indexed: 03/31/2023]
Abstract
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn f (θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than [Formula: see text] fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada;
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, Ontario M1C 1A4, Canada
| |
Collapse
|
20
|
Santus L, Garriga E, Deorowicz S, Gudyś A, Notredame C. Towards the accurate alignment of over a million protein sequences: Current state of the art. Curr Opin Struct Biol 2023; 80:102577. [PMID: 37012200 DOI: 10.1016/j.sbi.2023.102577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/21/2023] [Accepted: 02/27/2023] [Indexed: 04/04/2023]
Abstract
Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.
Collapse
|
21
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
22
|
Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.16.540882. [PMID: 37325780 PMCID: PMC10268037 DOI: 10.1101/2023.05.16.540882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Motivation The Jaccard similarity on k -mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. Results To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k -mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
23
|
Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, Hagmann J, Vorbrugg S, Marco-Sola S, Kubica C, Ashbrook DG, Thorell K, Rusholme-Pilcher RL, Liti G, Rudbeck E, Nahnsen S, Yang Z, Moses MN, Nobrega FL, Wu Y, Chen H, de Ligt J, Sudmant PH, Soranzo N, Colonna V, Williams RW, Prins P. Building pangenome graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.05.535718. [PMID: 37066137 PMCID: PMC10104075 DOI: 10.1101/2023.04.05.535718] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Pangenome graphs can represent all variation between multiple genomes, but existing methods for constructing them are biased due to reference-guided approaches. In response, we have developed PanGenome Graph Builder (PGGB), a reference-free pipeline for constructing unbi-ased pangenome graphs. PGGB uses all-to-all whole-genome alignments and learned graph embeddings to build and iteratively refine a model in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships.
Collapse
|
24
|
Marco-Sola S, Eizenga JM, Guarracino A, Paten B, Garrison E, Moreto M. Optimal gap-affine alignment in O(s) space. Bioinformatics 2023; 39:7030690. [PMID: 36749013 PMCID: PMC9940620 DOI: 10.1093/bioinformatics/btad074] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 01/02/2023] [Indexed: 02/08/2023] Open
Abstract
MOTIVATION Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. RESULTS In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. AVAILABILITY AND IMPLEMENTATION All code is publicly available at https://github.com/smarco/BiWFA-paper. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.,Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Jordan M Eizenga
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Andrea Guarracino
- Genomics Research Centre, Human Technopole, Milan 20157, Italy.,Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Benedict Paten
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Miquel Moreto
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.,Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
25
|
Duchen D, Clipman S, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves sequence alignment and sample-specific consensus sequence construction for genetic analysis of HBV. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.11.523611. [PMID: 36711598 PMCID: PMC9882026 DOI: 10.1101/2023.01.11.523611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Hepatitis B virus (HBV) remains a global public health concern, with over 250 million individuals living with chronic HBV infection (CHB) and no curative therapy currently available. Viral diversity is associated with CHB pathogenesis and immunological control of infection. Improved methods to characterize the viral genome at both the population and intra-host level could aid drug development efforts. Conventionally, HBV sequencing data are aligned to a linear reference genome and only sequences capable of aligning to the reference are captured for analysis. Reference selection has additional consequences, including sample-specific 'consensus' sequence construction. It remains unclear how to select a reference from available sequences and whether a single reference is sufficient for genetic analyses. Using simulated short-read sequencing data generated from full-length publicly available HBV genome sequences and HBV sequencing data from a longitudinally sampled individual with CHB, we investigate alternative graph-based alignment approaches. We demonstrate that using a phylogenetically representative 'genome graph' for alignment, rather than linear reference sequences, avoids issues of reference ambiguity, improves alignment, and facilitates the construction of sample-specific consensus sequences genetically similar to an individual's infection. Graph-based methods can therefore improve efforts to characterize the genetics of viral pathogens, including HBV, and may have broad implications in host pathogen research.
Collapse
Affiliation(s)
- Dylan Duchen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Steven Clipman
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Candelaria Vergara
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Chloe L Thio
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - David L Thomas
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| |
Collapse
|
26
|
Kovaka S, Ou S, Jenike KM, Schatz MC. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nat Methods 2023; 20:12-16. [PMID: 36635537 PMCID: PMC10068675 DOI: 10.1038/s41592-022-01716-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The year 2022 will be remembered as the turning point for accurate long-read sequencing, which now establishes the gold standard for speed and accuracy at competitive costs. We discuss the key bioinformatics techniques needed to power long reads across application areas and close with our vision for long-read sequencing over the coming years.
Collapse
Affiliation(s)
- Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Shujun Ou
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Molecular Genetics, Ohio State University, Columbus, OH, USA
| | - Katharine M Jenike
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
27
|
Abstract
MOTIVATION Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. RESULTS We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. AVAILABILITY AND IMPLEMENTATION seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.
Collapse
Affiliation(s)
| | - Andrea Guarracino
- Genomics Research Centre, Human Technopole, Viale Rita Levi-Montalcini 1, Milan 20157, Italy
| |
Collapse
|
28
|
Sahlin K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol 2022; 23:260. [PMID: 36522758 PMCID: PMC9753264 DOI: 10.1186/s13059-022-02831-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 12/02/2022] [Indexed: 12/23/2022] Open
Abstract
Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- grid.10548.380000 0004 1936 9377Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91 Stockholm, Sweden
| |
Collapse
|
29
|
Chromosome-scale haplotype-resolved pangenomics. Trends Genet 2022; 38:1103-1107. [PMID: 35817620 DOI: 10.1016/j.tig.2022.06.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/14/2022] [Accepted: 06/16/2022] [Indexed: 01/24/2023]
Abstract
Complete pangenomics is crucial for understanding genetic diversity and evolution across the tree of life. Chromosome-scale, haplotype-resolved pangenomics allows complex structural variations, long-range interactions, and associated functions to be discerned in species populations. We explore the need for high-resolution pangenomes, discuss computational strategies for their development, and describe applications in biodiversity and human health.
Collapse
|
30
|
Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6980335. [PMID: 36120669 PMCID: PMC9477578 DOI: 10.1155/2022/6980335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 06/28/2022] [Accepted: 07/29/2022] [Indexed: 11/18/2022]
Abstract
An area of medical science, that is, gaining prominence, is DNA sequencing. Genetic mutations responsible for the disease have been detected using DNA sequencing. The research is focusing on pattern identification methodologies for dealing with DNA-sequencing problems relating to various applications. A few examples of such problems are alignment and assembly of short reads from next generation sequencing (NGS), comparing DNA sequences, and determining the frequency of a pattern in a sequence. The approximate matching of DNA sequences is also well suited for many applications equivalent to the exact matching of the sequence since the DNA sequences are often subject to mutation. Consequently, recognizing pattern similarity becomes necessary. Furthermore, it can also be used in virtually every application that calls for pattern matching, for example, spell-checking, spam filtering, and search engines. According to the traditional approach, finding a similar pattern in the case where the sequence length is ls and the pattern length is lp occurs in O (ls∗lp). This heavy processing is caused by comparing every character of the sequence repeatedly with the pattern. The research intended to reduce the time complexity of the pattern matching by introducing an approach named “optimized pattern similarity identification” (OPSI). This methodology constructs a table, entitled “shift beyond for avoiding redundant comparison” (SBARC), to bypass the characters in the texts that are already compared with the pattern. The table pertains to the information about the character distance to be skipped in the matching. OPSI discovers at most spots of similar patterns occur in the sequence (by ignoring è mismatches). The experiment resulted in the time complexity identified as O (ls. è). In comparison to the size of the pattern, the allowed number of mismatches will be much smaller. Aspects such as scalability, generalizability, and performance of the OPSI algorithm are discussed. In comparison with the hamming distance-based approximate pattern matching algorithm, the proposed algorithm is found to be 69% more efficient.
Collapse
|
31
|
Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 2022; 19:705-710. [PMID: 35365778 PMCID: PMC10510034 DOI: 10.1038/s41592-022-01457-8] [Citation(s) in RCA: 51] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Accepted: 03/17/2022] [Indexed: 01/10/2023]
Abstract
Approximately 5-10% of the human genome remains inaccessible due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. We show that existing long-read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as they remain vulnerable to allelic bias. In the presence of a nonreference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy. To address this limitation, we developed a new long-read mapping method, Winnowmap2, by using minimal confidently alignable substrings. Winnowmap2 computes each read mapping through a collection of confident subalignments. This approach is more tolerant of structural variation and more sensitive to paralog-specific variants within repeats. Our experiments highlight that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India.
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Nancy F Hansen
- Comparative Genomics Analysis Unit, National Human Genome Research Institute, Bethesda, MD, USA
| | - Sergey Koren
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD, USA
| |
Collapse
|
32
|
Song B, Marco-Sola S, Moreto M, Johnson L, Buckler ES, Stitzer MC. AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proc Natl Acad Sci U S A 2022; 119:e2113075119. [PMID: 34934012 PMCID: PMC8740769 DOI: 10.1073/pnas.2113075119] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/15/2021] [Indexed: 12/04/2022] Open
Abstract
Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication-informed collinear anchor identification between genomes and performs base pair-resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor-binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and whole-genome duplication variation.
Collapse
Affiliation(s)
- Baoxing Song
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853;
| | - Santiago Marco-Sola
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Miquel Moreto
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Lynn Johnson
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853;
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853
- Agricultural Research Service, US Department of Agriculture, Ithaca, NY 14853
| | - Michelle C Stitzer
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853;
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853
| |
Collapse
|
33
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
|
34
|
Abu‐Hashem M, Gutub A. Efficient computation of Hash Hirschberg protein alignment utilizing hyper threading multi‐core sharing technology. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2021. [DOI: 10.1049/cit2.12070] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Muhannad Abu‐Hashem
- Department of Geomatics Faculty of Architecture and Planning King Abdulaziz University Jeddah Saudi Arabia
| | - Adnan Gutub
- Department of Computer Engineering College of Computer & Information Systems Umm Al‐Qura University Makkah Saudi Arabia
| |
Collapse
|
35
|
Yan Y, Chaturvedi N, Appuswamy R. Accel-Align: a fast sequence mapper and aligner based on the seed-embed-extend method. BMC Bioinformatics 2021; 22:257. [PMID: 34016035 PMCID: PMC8139006 DOI: 10.1186/s12859-021-04162-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 05/04/2021] [Indexed: 12/30/2022] Open
Abstract
Background Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. Results Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. Conclusions As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04162-z.
Collapse
Affiliation(s)
- Yiqing Yan
- Data Science Department, EURECOM, 06410, Biot, France
| | | | | |
Collapse
|
36
|
Silvestre-Ryan J, Holmes I. Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing. Genome Biol 2021; 22:38. [PMID: 33468205 PMCID: PMC7814537 DOI: 10.1186/s13059-020-02255-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 12/20/2020] [Indexed: 01/09/2023] Open
Abstract
We develop a general computational approach for improving the accuracy of basecalling with Oxford Nanopore's 1D2 and related sequencing protocols. Our software PoreOver ( https://github.com/jordisr/poreover ) finds the consensus of two neural networks by aligning their probability profiles, and is compatible with multiple nanopore basecallers. When applied to the recently-released Bonito basecaller, our method reduces the median sequencing error by more than half.
Collapse
Affiliation(s)
- Jordi Silvestre-Ryan
- grid.47840.3f0000 0001 2181 7878Department of Bioengineering, University of California, Berkeley, 94720 USA
| | - Ian Holmes
- grid.47840.3f0000 0001 2181 7878Department of Bioengineering, University of California, Berkeley, 94720 USA
| |
Collapse
|