1
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
2
|
Pitman A, Huang X, Marth GT, Qiao Y. quickBAM: a parallelized BAM file access API for high-throughput sequence analysis informatics. Bioinformatics 2023; 39:btad463. [PMID: 37498562 PMCID: PMC10412403 DOI: 10.1093/bioinformatics/btad463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 05/31/2023] [Accepted: 07/26/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION In time-critical clinical settings, such as precision medicine, genomic data needs to be processed as fast as possible to arrive at data-informed treatment decisions in a timely fashion. While sequencing throughput has dramatically increased over the past decade, bioinformatics analysis throughput has not been able to keep up with the pace of computer hardware improvement, and consequently has now turned into the primary bottleneck. Modern computer hardware today is capable of much higher performance than current genomic informatics algorithms can typically utilize, therefore presenting opportunities for significant improvement of performance. Accessing the raw sequencing data from BAM files, e.g. is a necessary and time-consuming step in nearly all sequence analysis tools, however existing programming libraries for BAM access do not take full advantage of the parallel input/output capabilities of storage devices. RESULTS In an effort to stimulate the development of a new generation of faster sequence analysis tools, we developed quickBAM, a software library to accelerate sequencing data access by exploiting the parallelism in commodity storage hardware currently widely available. We demonstrate that analysis software ported to quickBAM consistently outperforms their current versions, in some cases finishing an analysis in under 3 min while the original version took 1.5 h, using the same storage solution. AVAILABILITY AND IMPLEMENTATION Open source and freely available at https://gitlab.com/yiq/quickbam/, we envision that quickBAM will enable a new generation of high-performance informatics tools, either directly boosting their performance if they are currently data-access bottlenecked, or allow data-access to keep up with further optimizations in algorithms and compute techniques.
Collapse
Affiliation(s)
- Anders Pitman
- UTAH Center for Genetic Discovery, Department of Human Genetics, University of Utah School of Medicine, 15 N 2030 E, Salt Lake City, UT 84112, United States
| | - Xiaomeng Huang
- UTAH Center for Genetic Discovery, Department of Human Genetics, University of Utah School of Medicine, 15 N 2030 E, Salt Lake City, UT 84112, United States
| | - Gabor T Marth
- UTAH Center for Genetic Discovery, Department of Human Genetics, University of Utah School of Medicine, 15 N 2030 E, Salt Lake City, UT 84112, United States
| | - Yi Qiao
- UTAH Center for Genetic Discovery, Department of Human Genetics, University of Utah School of Medicine, 15 N 2030 E, Salt Lake City, UT 84112, United States
| |
Collapse
|
3
|
Sibbesen JA, Eizenga JM, Novak AM, Sirén J, Chang X, Garrison E, Paten B. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods 2023; 20:239-247. [PMID: 36646895 DOI: 10.1101/2021.03.26.437240] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/28/2022] [Indexed: 05/24/2023]
Abstract
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.
Collapse
Affiliation(s)
| | | | - Adam M Novak
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jouni Sirén
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Xian Chang
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Erik Garrison
- University of Tennessee Health Science Center, Memphis, TN, USA
| | | |
Collapse
|
4
|
Linderman MD, Paudyal C, Shakeel M, Kelley W, Bashir A, Gelb BD. NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data. Gigascience 2021; 10:giab046. [PMID: 34195837 PMCID: PMC8246072 DOI: 10.1093/gigascience/giab046] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/04/2021] [Accepted: 06/07/2021] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. RESULTS We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. CONCLUSIONS Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box" that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.
Collapse
Affiliation(s)
- Michael D Linderman
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Crystal Paudyal
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Musab Shakeel
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - William Kelley
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Ali Bashir
- Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
| | - Bruce D Gelb
- Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave Levy Place, Box 1040, New York, NY 10029, USA
| |
Collapse
|
5
|
Dewhurst SM, Yao X, Rosiene J, Tian H, Behr J, Bosco N, Takai KK, de Lange T, Imieliński M. Structural variant evolution after telomere crisis. Nat Commun 2021; 12:2093. [PMID: 33828097 PMCID: PMC8027843 DOI: 10.1038/s41467-021-21933-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 02/17/2021] [Indexed: 01/14/2023] Open
Abstract
Telomere crisis contributes to cancer genome evolution, yet only a subset of cancers display breakage-fusion-bridge (BFB) cycles and chromothripsis, hallmarks of experimental telomere crisis identified in previous studies. We examine the spectrum of structural variants (SVs) instigated by natural telomere crisis. Eight spontaneous post-crisis clones did not show prominent patterns of BFB cycles or chromothripsis. Their crisis-induced genome rearrangements varied from infrequent simple SVs to more frequent and complex SVs. In contrast, BFB cycles and chromothripsis occurred in MRC5 fibroblast clones that escaped telomere crisis after CRISPR-controlled telomerase activation. This system revealed convergent evolutionary lineages altering one allele of chromosome 12p, where a short telomere likely predisposed to fusion. Remarkably, the 12p chromothripsis and BFB events were stabilized by independent fusions to chromosome 21. The data establish that telomere crisis can generate a wide spectrum of SVs implying that a lack of BFB patterns and chromothripsis in cancer genomes does not indicate absence of past telomere crisis.
Collapse
Affiliation(s)
- Sally M Dewhurst
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
| | - Xiaotong Yao
- Tri-Institutional Ph.D. Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
- New York Genome Center, New York, NY, USA
| | - Joel Rosiene
- Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
- New York Genome Center, New York, NY, USA
| | - Huasong Tian
- Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
- New York Genome Center, New York, NY, USA
| | - Julie Behr
- Tri-Institutional Ph.D. Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY, USA
- Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
- New York Genome Center, New York, NY, USA
| | - Nazario Bosco
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
- Department of Biochemistry and Molecular Pharmacology, Institute for Systems Genetics, NYU Langone Health, New York, NY, USA
| | - Kaori K Takai
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA
| | - Titia de Lange
- Laboratory of Cell Biology and Genetics, Rockefeller University, New York, NY, USA.
| | - Marcin Imieliński
- Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA.
- New York Genome Center, New York, NY, USA.
| |
Collapse
|
6
|
Wala JA, Bandopadhayay P, Greenwald NF, O'Rourke R, Sharpe T, Stewart C, Schumacher S, Li Y, Weischenfeldt J, Yao X, Nusbaum C, Campbell P, Getz G, Meyerson M, Zhang CZ, Imielinski M, Beroukhim R. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res 2018. [PMID: 29535149 PMCID: PMC5880247 DOI: 10.1101/gr.221028.117] [Citation(s) in RCA: 274] [Impact Index Per Article: 39.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20–300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50–300 bp) SVs.
Collapse
Affiliation(s)
- Jeremiah A Wala
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Bioinformatics and Integrative Genomics, Harvard University, Cambridge, Massachusetts 02138, USA.,Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Pratiti Bandopadhayay
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA
| | - Noah F Greenwald
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA
| | - Ryan O'Rourke
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA
| | - Ted Sharpe
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | - Chip Stewart
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | - Steve Schumacher
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA
| | - Yilong Li
- Seven Bridges Genomics, Cambridge, Massachusetts 02142, USA.,Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, United Kingdom
| | - Joachim Weischenfeldt
- The Finsen Laboratory, Rigshospitalet, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Xiaotong Yao
- Tri-Institutional PhD Program in Computational Biology and Medicine, New York, New York 10065, USA.,New York Genome Center, New York, New York 10013, USA
| | - Chad Nusbaum
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | - Peter Campbell
- Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, United Kingdom.,Department of Haematology, University of Cambridge, Cambridge CB2 2XY, United Kingdom
| | - Gad Getz
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Bioinformatics and Integrative Genomics, Harvard University, Cambridge, Massachusetts 02138, USA.,Harvard Medical School, Boston, Massachusetts 02115, USA.,Department of Pathology and Cancer Center, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | - Matthew Meyerson
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Bioinformatics and Integrative Genomics, Harvard University, Cambridge, Massachusetts 02138, USA.,Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Cheng-Zhong Zhang
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Marcin Imielinski
- New York Genome Center, New York, New York 10013, USA.,Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10065, USA
| | - Rameen Beroukhim
- The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.,Bioinformatics and Integrative Genomics, Harvard University, Cambridge, Massachusetts 02138, USA.,Harvard Medical School, Boston, Massachusetts 02115, USA
| |
Collapse
|