1
|
Zhou Q, Ghezelji M, Hari A, Ford MKB, Holley C, Mirabello L, Chanock S, Sahinalp SC, Numanagić I. Geny: A Genotyping Tool for Allelic Decomposition of Killer Cell Immunoglobulin-Like Receptor Genes. bioRxiv 2024:2024.02.27.582413. [PMID: 38529502 PMCID: PMC10962708 DOI: 10.1101/2024.02.27.582413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
Accurate genotyping of Killer cell Immunoglobulin-like Receptor (KIR) genes plays a pivotal role in enhancing our understanding of innate immune responses, disease correlations, and the advancement of personalized medicine. However, due to the high variability of the KIR region and high level of sequence similarity among different KIR genes, the currently available genotyping methods are unable to accurately infer copy numbers, genotypes and haplotypes of individual KIR genes from next-generation sequencing data. Here we introduce Geny, a new computational tool for precise genotyping of KIR genes. Geny utilizes available KIR haplotype databases and proposes a novel combination of expectation-maximization filtering schemes and integer linear programming-based combinatorial optimization models to resolve ambiguous reads, provide accurate copy number estimation and estimate the haplotype of each copy for the genes within the KIR region. We evaluated Geny on a large set of simulated short-read datasets covering the known validated KIR region assemblies and a set of Illumina short-read samples sequenced from 25 validated samples from the Human Pangenome Reference Consortium collection and showed that it outperforms the existing genotyping tools in terms of accuracy, precision and recall. We envision Geny becoming a valuable resource for understanding immune system response and consequently advancing the field of patient-centric medicine.
Collapse
|
2
|
Shugg T, Ly RC, Osei W, Rowe EJ, Granfield CA, Lynnes TC, Medeiros EB, Hodge JC, Breman AM, Schneider BP, Sahinalp SC, Numanagić I, Salisbury BA, Bray SM, Ratcliff R, Skaar TC. Computational pharmacogenotype extraction from clinical next-generation sequencing. Front Oncol 2023; 13:1199741. [PMID: 37469403 PMCID: PMC10352904 DOI: 10.3389/fonc.2023.1199741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 05/22/2023] [Indexed: 07/21/2023] Open
Abstract
Background Next-generation sequencing (NGS), including whole genome sequencing (WGS) and whole exome sequencing (WES), is increasingly being used for clinic care. While NGS data have the potential to be repurposed to support clinical pharmacogenomics (PGx), current computational approaches have not been widely validated using clinical data. In this study, we assessed the accuracy of the Aldy computational method to extract PGx genotypes from WGS and WES data for 14 and 13 major pharmacogenes, respectively. Methods Germline DNA was isolated from whole blood samples collected for 264 patients seen at our institutional molecular solid tumor board. DNA was used for panel-based genotyping within our institutional Clinical Laboratory Improvement Amendments- (CLIA-) certified PGx laboratory. DNA was also sent to other CLIA-certified commercial laboratories for clinical WGS or WES. Aldy v3.3 and v4.4 were used to extract PGx genotypes from these NGS data, and results were compared to the panel-based genotyping reference standard that contained 45 star allele-defining variants within CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, G6PD, NUDT15, SLCO1B1, TPMT, and VKORC1. Results Mean WGS read depth was >30x for all variant regions except for G6PD (average read depth was 29 reads), and mean WES read depth was >30x for all variant regions. For 94 patients with WGS, Aldy v3.3 diplotype calls were concordant with those from the genotyping reference standard in 99.5% of cases when excluding diplotypes with additional major star alleles not tested by targeted genotyping, ambiguous phasing, and CYP2D6 hybrid alleles. Aldy v3.3 identified 15 additional clinically actionable star alleles not covered by genotyping within CYP2B6, CYP2C19, DPYD, SLCO1B1, and NUDT15. Within the WGS cohort, Aldy v4.4 diplotype calls were concordant with those from genotyping in 99.7% of cases. When excluding patients with CYP2D6 copy number variation, all Aldy v4.4 diplotype calls except for one CYP3A4 diplotype call were concordant with genotyping for 161 patients in the WES cohort. Conclusion Aldy v3.3 and v4.4 called diplotypes for major pharmacogenes from clinical WES and WGS data with >99% accuracy. These findings support the use of Aldy to repurpose clinical NGS data to inform clinical PGx.
Collapse
Affiliation(s)
- Tyler Shugg
- Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Reynold C. Ly
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Wilberforce Osei
- Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Elizabeth J. Rowe
- Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Caitlin A. Granfield
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Ty C. Lynnes
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Elizabeth B. Medeiros
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Jennelle C. Hodge
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Amy M. Breman
- Division of Diagnostic Genetics and Genomics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Bryan P. Schneider
- Division of Hematology/Oncology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
| | - S. Cenk Sahinalp
- Center for Cancer Research, National Cancer Institute, National Institute of Health, Bethesda, MD, United States
| | - Ibrahim Numanagić
- Department of Computer Science, University of Victoria, Victoria, BC, Canada
| | | | | | | | - Todd C. Skaar
- Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
| |
Collapse
|
3
|
Hari A, Zhou Q, Gonzaludo N, Harting J, Scott SA, Qin X, Scherer S, Sahinalp SC, Numanagić I. An efficient genotyper and star-allele caller for pharmacogenomics. Genome Res 2023; 33:61-70. [PMID: 36657977 PMCID: PMC9977157 DOI: 10.1101/gr.277075.122] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Accepted: 12/12/2022] [Indexed: 01/20/2023]
Abstract
High-throughput sequencing provides sufficient means for determining genotypes of clinically important pharmacogenes that can be used to tailor medical decisions to individual patients. However, pharmacogene genotyping, also known as star-allele calling, is a challenging problem that requires accurate copy number calling, structural variation identification, variant calling, and phasing within each pharmacogene copy present in the sample. Here we introduce Aldy 4, a fast and efficient tool for genotyping pharmacogenes that uses combinatorial optimization for accurate star-allele calling across different sequencing technologies. Aldy 4 adds support for long reads and uses a novel phasing model and improved copy number and variant calling models. We compare Aldy 4 against the current state-of-the-art star-allele callers on a large and diverse set of samples and genes sequenced by various sequencing technologies, such as whole-genome and targeted Illumina sequencing, barcoded 10x Genomics, and Pacific Biosciences (PacBio) HiFi. We show that Aldy 4 is the most accurate star-allele caller with near-perfect accuracy in all evaluated contexts, and hope that Aldy remains an invaluable tool in the clinical toolbox even with the advent of long-read sequencing technologies.
Collapse
Affiliation(s)
- Ananth Hari
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland 20742, USA;,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Qinghui Zhou
- Department of Computer Science, University of Victoria, Victoria, British Columbia V8P 5C2, Canada
| | | | - John Harting
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Stuart A. Scott
- Department of Pathology, Stanford University, Palo Alto, California 94304, USA
| | - Xiang Qin
- Baylor College of Medicine Human Genome Sequencing Center, Houston, Texas 77030, USA
| | - Steve Scherer
- Baylor College of Medicine Human Genome Sequencing Center, Houston, Texas 77030, USA
| | - S. Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Ibrahim Numanagić
- Department of Computer Science, University of Victoria, Victoria, British Columbia V8P 5C2, Canada
| |
Collapse
|
4
|
Osei WA, Shugg T, Ly RC, Bray SM, Salisbury BA, Ratcliff RR, Pratt VM, Numanagić I, Skaar T. Abstract 1151: Pharmacogenomics genotyping from clinical somatic whole exome sequencing: Aldy, a computational tool. Cancer Res 2022. [DOI: 10.1158/1538-7445.am2022-1151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Background Pharmacogenomics (PGx) testing can reduce toxicities and improve efficacy of several drugs used to treat cancer and associated symptoms. PGx results can be determined from germline whole-exome sequencing (WES), but somatic mutations may cause discordance between tumor and germline DNA. Since clinical diagnostic sequencing in oncology frequently only includes tumor DNA, there would be clinical value in calling germline PGx genotypes from tumor DNA. Thus, the purpose of this study was to assess the feasibility of using somatic WES data to call germline PGx genotypes.
Methods Germline and somatic WES data were obtained as part of the clinical workflow for 64 patients treated at the solid molecular tumor board clinic at Indiana University. Aldy v3.3 was implemented in LifeOmic’s Precision Health Cloud™ to call PGx genotypes from somatic WES. Somatic Aldy calls were compared with previously validated Aldy germline calls for 8 genes: CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, and TPMT. Somatic read depth was >100x, except for the intronic CYP3A4*22 variant, which was >30x.
Results Somatic and germline Aldy calls were compared for a total of 512 genotypes and 56 (11%) calls were discordant. Discordant calls were most common for CYP2B6 (23.4%), followed by CYP2D6 (14.1%), CYP2C19 (10.9%), CYP2C8 (6.3%), and DPYD (6.3%). In contrast, all Aldy calls were concordant for CYP3A5 and TPMT. 38 out of 64 subjects (59%) had discordant calls for at least one gene. The most common first cancer diagnoses in our cohort were colorectal (9.3%), breast (7.8%), and pancreatic (7.8%), and the rates of discordant Aldy calls did not differ by cancer type (p>0.05 for all cancer types). Based on our analyses of discordant calls, we anticipate that adjusting Aldy’s thresholds for variant calling may allow Aldy to determine genotypes from somatic WES data.
Conclusion In most cases, genotype calls of drug metabolism genes from tumor DNA reflected the germline genotypes; however, additional work needs to be done to determine if the remaining discordant calls can be corrected by modifying the informatics tools or if they are due to somatic mutations.
Citation Format: Wilberforce A. Osei, Tyler Shugg, Reynold C. Ly, Steven M. Bray, Benjamin A. Salisbury, Ryan R. Ratcliff, Victoria M. Pratt, Ibrahim Numanagić, Todd Skaar. Pharmacogenomics genotyping from clinical somatic whole exome sequencing: Aldy, a computational tool [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1151.
Collapse
Affiliation(s)
| | - Tyler Shugg
- 2Indiana University School of Medicine, Indianapolis, IN
| | - Reynold C. Ly
- 2Indiana University School of Medicine, Indianapolis, IN
| | | | | | | | | | | | - Todd Skaar
- 2Indiana University School of Medicine, Indianapolis, IN
| |
Collapse
|
5
|
Smajlović H, Shajii A, Berger B, Cho H, Numanagić I. Sequre: a high-performance framework for rapid development of secure bioinformatics pipelines. IEEE Int Symp Parallel Distrib Process Workshops Phd Forum 2022; 2022:164-165. [PMID: 35958356 PMCID: PMC9364365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Affiliation(s)
| | | | | | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Massachusetts, USA
| | | |
Collapse
|
6
|
Gaedigk A, Boone EC, Scherer SE, Lee SB, Numanagić I, Sahinalp C, Smith JD, McGee S, Radhakrishnan A, Qin X, Wang WY, Farrow EG, Gonzaludo N, Halpern AL, Nickerson DA, Miller NA, Pratt VM, Kalman LV. CYP2C8, CYP2C9, and CYP2C19 Characterization Using Next-Generation Sequencing and Haplotype Analysis: A GeT-RM Collaborative Project. J Mol Diagn 2022; 24:337-350. [PMID: 35134542 PMCID: PMC9069873 DOI: 10.1016/j.jmoldx.2021.12.011] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 12/09/2021] [Accepted: 12/28/2021] [Indexed: 01/13/2023] Open
Abstract
Pharmacogenetic tests typically target selected sequence variants to identify haplotypes that are often defined by star (∗) allele nomenclature. Due to their design, these targeted genotyping assays are unable to detect novel variants that may change the function of the gene product and thereby affect phenotype prediction and patient care. In the current study, 137 DNA samples that were previously characterized by the Genetic Testing Reference Material (GeT-RM) program using a variety of targeted genotyping methods were recharacterized using targeted and whole genome sequencing analysis. Sequence data were analyzed using three genotype calling tools to identify star allele diplotypes for CYP2C8, CYP2C9, and CYP2C19. The genotype calls from next-generation sequencing (NGS) correlated well to those previously reported, except when novel alleles were present in a sample. Six novel alleles and 38 novel suballeles were identified in the three genes due to identification of variants not covered by targeted genotyping assays. In addition, several ambiguous genotype calls from a previous study were resolved using the NGS and/or long-read NGS data. Diplotype calls were mostly consistent between the calling algorithms, although several discrepancies were noted. This study highlights the utility of NGS for pharmacogenetic testing and demonstrates that there are many novel alleles that are yet to be discovered, even in highly characterized genes such as CYP2C9 and CYP2C19.
Collapse
Affiliation(s)
- Andrea Gaedigk
- Division of Clinical Pharmacology, Toxicology and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri; University of Missouri-Kansas City School of Medicine, Kansas City, Missouri
| | - Erin C Boone
- Division of Clinical Pharmacology, Toxicology and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri
| | - Steven E Scherer
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| | - Seung-Been Lee
- Precision Medicine Institute, Macrogen Inc., Seongnam, Republic of Korea
| | - Ibrahim Numanagić
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | - Joshua D Smith
- Department of Genome Sciences, University of Washington, Seattle, Washington
| | - Sean McGee
- Department of Genome Sciences, University of Washington, Seattle, Washington
| | | | - Xiang Qin
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| | - Wendy Y Wang
- Division of Clinical Pharmacology, Toxicology and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, Missouri
| | - Emily G Farrow
- University of Missouri-Kansas City School of Medicine, Kansas City, Missouri; Center for Genomic Medicine, Children's Mercy Kansas City, Kansas City, Missouri
| | - Nina Gonzaludo
- Medical Genomics Research, Illumina Inc., San Diego, California
| | - Aaron L Halpern
- Medical Genomics Research, Illumina Inc., San Diego, California
| | - Deborah A Nickerson
- Department of Genome Sciences, University of Washington, Seattle, Washington
| | - Neil A Miller
- University of Missouri-Kansas City School of Medicine, Kansas City, Missouri; Center for Genomic Medicine, Children's Mercy Kansas City, Kansas City, Missouri
| | - Victoria M Pratt
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, Indiana
| | - Lisa V Kalman
- Informatics and Data Science Branch, Division of Laboratory Systems, Centers for Disease Control and Prevention, Atlanta, Georgia.
| |
Collapse
|
7
|
Išerić H, Alkan C, Hach F, Numanagić I. Fast characterization of segmental duplication structure in multiple genome assemblies. Algorithms Mol Biol 2022; 17:4. [PMID: 35303886 PMCID: PMC8932185 DOI: 10.1186/s13015-022-00210-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 02/08/2022] [Indexed: 11/29/2022] Open
Abstract
Motivation The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Results Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7–33\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years. Availability and implementation BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser.
Collapse
|
8
|
Ly R, Shugg T, Ratcliff R, Osei W, Pratt V, Schneider B, Radovich M, Bray S, Salisbury B, Parikh B, Sahinalp SC, Numanagić I, Skaar T. eP373: Analytical validation of a computational method for pharmacogenetic genotyping from clinical exome sequencing. Genet Med 2022. [DOI: 10.1016/j.gim.2022.01.408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
9
|
Shajii A, Numanagić I, Leighton AT, Greenyer H, Amarasinghe S, Berger B. A Python-based programming language for high-performance computational genomics. Nat Biotechnol 2021; 39:1062-1064. [PMID: 34282326 PMCID: PMC8542382 DOI: 10.1038/s41587-021-00985-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Ariya Shajii
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Alexander T Leighton
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Haley Greenyer
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Saman Amarasinghe
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
10
|
Berger E, Yorukoglu D, Zhang L, Nyquist SK, Shalek AK, Kellis M, Numanagić I, Berger B. Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets. Nat Commun 2020; 11:4662. [PMID: 32938926 PMCID: PMC7494856 DOI: 10.1038/s41467-020-18320-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Accepted: 08/07/2020] [Indexed: 01/04/2023] Open
Abstract
Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10× faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.
Collapse
Affiliation(s)
- Emily Berger
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, UC Berkeley, Berkeley, CA, 94720, USA
| | - Deniz Yorukoglu
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Lillian Zhang
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Sarah K Nyquist
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Alex K Shalek
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Manolis Kellis
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Ibrahim Numanagić
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Computer Science, University of Victoria, Victoria, BC, V8P 5C2, Canada.
| | - Bonnie Berger
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
11
|
Abstract
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100-a factor of over 106-and the amount of data to be analyzed has increased proportionally. Yet, as Moore's Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python-and is in many cases a drop-in replacement-yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
Collapse
Affiliation(s)
- Ariya Shajii
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | | | | - Bonnie Berger
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | |
Collapse
|
12
|
Abstract
Motivation Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner. Results Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation SEDEF is available at https://github.com/vpc-ccg/sedef.
Collapse
Affiliation(s)
- Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Alim S Gökkaya
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Lillian Zhang
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, Canada
| |
Collapse
|
13
|
Lin YY, Gawronski A, Hach F, Li S, Numanagić I, Sarrafi I, Mishra S, McPherson A, Collins CC, Radovich M, Tang H, Sahinalp SC. Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 2018; 34:1672-1681. [PMID: 29267878 PMCID: PMC5946953 DOI: 10.1093/bioinformatics/btx807] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Revised: 11/24/2017] [Accepted: 12/15/2017] [Indexed: 12/18/2022] Open
Abstract
Motivation Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples. Results We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure. Availability and implementation MiStrVar is available for download at https://bitbucket.org/compbio/mistrvar, and ProTIE is available at https://bitbucket.org/compbio/protie. Contact cenksahi@indiana.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yen-Yi Lin
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
- Vancouver Prostate Centre, Vancouver, BC, Canada
| | | | - Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
- Vancouver Prostate Centre, Vancouver, BC, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada
| | - Sujun Li
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Ibrahim Numanagić
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Iman Sarrafi
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
- Vancouver Prostate Centre, Vancouver, BC, Canada
| | - Swati Mishra
- Department of Surgery, Indiana University, School of Medicine, Indianapolis, IN, USA
| | - Andrew McPherson
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Colin C Collins
- Vancouver Prostate Centre, Vancouver, BC, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada
| | - Milan Radovich
- Department of Surgery, Indiana University, School of Medicine, Indianapolis, IN, USA
| | - Haixu Tang
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - S Cenk Sahinalp
- Vancouver Prostate Centre, Vancouver, BC, Canada
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| |
Collapse
|
14
|
Shajii A, Numanagić I, Berger B. Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 2018; 10812:280-282. [PMID: 29888346 PMCID: PMC5989713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Affiliation(s)
- Ariya Shajii
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
| | - Ibrahim Numanagić
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
- Department of Mathematics, MIT, Cambridge, MA, USA
| | - Bonnie Berger
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
- Department of Mathematics, MIT, Cambridge, MA, USA
| |
Collapse
|
15
|
Numanagić I, Malikić S, Ford M, Qin X, Toji L, Radovich M, Skaar TC, Pratt VM, Berger B, Scherer S, Sahinalp SC. Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 2018; 9:828. [PMID: 29483503 PMCID: PMC5826927 DOI: 10.1038/s41467-018-03273-1] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 02/01/2018] [Indexed: 12/30/2022] Open
Abstract
High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest-the number of copies and the exact sequence content of each copy of a gene. Although many clinically and functionally important genes are highly polymorphic and have undergone structural alterations, no high-throughput sequencing data analysis tool has yet been designed to effectively solve the full allelic decomposition problem. Here we introduce a combinatorial optimization framework that successfully resolves this challenging problem, including for genes with structural alterations. We provide an associated computational tool Aldy that performs allelic decomposition of highly polymorphic, multi-copy genes through using whole or targeted genome sequencing data. For a large diverse sequencing data set, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays. As more data sets become available, we expect Aldy to become an essential component of genotyping toolkits.
Collapse
Affiliation(s)
- Ibrahim Numanagić
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Salem Malikić
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Michael Ford
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Xiang Qin
- Baylor College of Medicine Human Genome Sequencing Center, Houston, TX, 77030, USA
| | - Lorraine Toji
- Coriell Institute for Medical Research, Camden, NJ, 08103, USA
| | - Milan Radovich
- Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Todd C Skaar
- Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Victoria M Pratt
- Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Steve Scherer
- Baylor College of Medicine Human Genome Sequencing Center, Houston, TX, 77030, USA
| | - S Cenk Sahinalp
- Department of Computer Science, Indiana University, Bloomington, IN, 47405, USA.
| |
Collapse
|
16
|
Abstract
MOTIVATION Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. RESULT Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION Pamir is available at https://github.com/vpc-ccg/pamir . CONTACT fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pınar Kavak
- Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey
| | - Yen-Yi Lin
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Ibrahim Numanagić
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Hossein Asghari
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Tunga Güngör
- Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, Canada
- Vancouver Prostate Centre, Vancouver, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, Canada
| |
Collapse
|
17
|
Abstract
Motivation:CYP2D6 is highly polymorphic gene which encodes the (CYP2D6) enzyme, involved in the metabolism of 20–25% of all clinically prescribed drugs and other xenobiotics in the human body. CYP2D6 genotyping is recommended prior to treatment decisions involving one or more of the numerous drugs sensitive to CYP2D6 allelic composition. In this context, high-throughput sequencing (HTS) technologies provide a promising time-efficient and cost-effective alternative to currently used genotyping techniques. To achieve accurate interpretation of HTS data, however, one needs to overcome several obstacles such as high sequence similarity and genetic recombinations between CYP2D6 and evolutionarily related pseudogenes CYP2D7 and CYP2D8, high copy number variation among individuals and short read lengths generated by HTS technologies. Results: In this work, we present the first algorithm to computationally infer CYP2D6 genotype at basepair resolution from HTS data. Our algorithm is able to resolve complex genotypes, including alleles that are the products of duplication, deletion and fusion events involving CYP2D6 and its evolutionarily related cousin CYP2D7. Through extensive experiments using simulated and real datasets, we show that our algorithm accurately solves this important problem with potential clinical implications. Availability and implementation: Cypiripi is available at http://sfu-compbio.github.io/cypiripi. Contact:cenk@sfu.ca.
Collapse
Affiliation(s)
- Ibrahim Numanagić
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| | - Salem Malikić
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| | - Victoria M Pratt
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| | - Todd C Skaar
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| | - David A Flockhart
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| | - S Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Department of Medicine, Division of Clinical Pharmacology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and School of Informatics and Computing, Indiana University, Bloomington, IN 47401, USA
| |
Collapse
|
18
|
Dao P, Numanagić I, Lin YY, Hach F, Karakoc E, Donmez N, Collins C, Eichler EE, Sahinalp SC. ORMAN: optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms. ACTA ACUST UNITED AC 2013; 30:644-51. [PMID: 24130305 DOI: 10.1093/bioinformatics/btt591] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics. RESULTS On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage. AVAILABILITY ORMAN is available at http://orman.sf.net
Collapse
Affiliation(s)
- Phuong Dao
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Department of Genome Sciences, University of Washington, Seattle, WA, USA, Vancouver Prostate Centre & Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada and Division of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | | | | | | | | | | | | | | | | |
Collapse
|