1
|
Abdelwahab O, Torkamaneh D. Artificial intelligence in variant calling: a review. FRONTIERS IN BIOINFORMATICS 2025; 5:1574359. [PMID: 40337525 PMCID: PMC12055765 DOI: 10.3389/fbinf.2025.1574359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2025] [Accepted: 04/08/2025] [Indexed: 05/09/2025] Open
Abstract
Artificial intelligence (AI) has revolutionized numerous fields, including genomics, where it has significantly impacted variant calling, a crucial process in genomic analysis. Variant calling involves the detection of genetic variants such as single nucleotide polymorphisms (SNPs), insertions/deletions (InDels), and structural variants from high-throughput sequencing data. Traditionally, statistical approaches have dominated this task, but the advent of AI led to the development of sophisticated tools that promise higher accuracy, efficiency, and scalability. This review explores the state-of-the-art AI-based variant calling tools, including DeepVariant, DNAscope, DeepTrio, Clair, Clairvoyante, Medaka, and HELLO. We discuss their underlying methodologies, strengths, limitations, and performance metrics across different sequencing technologies, alongside their computational requirements, focusing primarily on SNP and InDel detection. By comparing these AI-driven techniques with conventional methods, we highlight the transformative advancements AI has introduced and its potential to further enhance genomic research.
Collapse
Affiliation(s)
- Omar Abdelwahab
- Département de Phytologie, Université Laval, Québec City, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada
- Centre de recherche et d’innovation sur les végétaux (CRIV), Université Laval, Québec City, QC, Canada
- Institut intelligence et données (IID), Université Laval, Québec City, QC, Canada
| | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada
- Centre de recherche et d’innovation sur les végétaux (CRIV), Université Laval, Québec City, QC, Canada
- Institut intelligence et données (IID), Université Laval, Québec City, QC, Canada
| |
Collapse
|
2
|
Li Q, Keskus AG, Wagner J, Izydorczyk MB, Timp W, Sedlazeck FJ, Klein AP, Zook JM, Kolmogorov M, Schatz MC. Unraveling the hidden complexity of cancer through long-read sequencing. Genome Res 2025; 35:599-620. [PMID: 40113261 PMCID: PMC12047254 DOI: 10.1101/gr.280041.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2025]
Abstract
Cancer is fundamentally a disease of the genome, characterized by extensive genomic, transcriptomic, and epigenomic alterations. Most current studies predominantly use short-read sequencing, gene panels, or microarrays to explore these alterations; however, these technologies can systematically miss or misrepresent certain types of alterations, especially structural variants, complex rearrangements, and alterations within repetitive regions. Long-read sequencing is rapidly emerging as a transformative technology for cancer research by providing a comprehensive view across the genome, transcriptome, and epigenome, including the ability to detect alterations that previous technologies have overlooked. In this Perspective, we explore the current applications of long-read sequencing for both germline and somatic cancer analysis. We provide an overview of the computational methodologies tailored to long-read data and highlight key discoveries and resources within cancer genomics that were previously inaccessible with prior technologies. We also address future opportunities and persistent challenges, including the experimental and computational requirements needed to scale to larger sample sizes, the hurdles in sequencing and analyzing complex cancer genomes, and opportunities for leveraging machine learning and artificial intelligence technologies for cancer informatics. We further discuss how the telomere-to-telomere genome and the emerging human pangenome could enhance the resolution of cancer genome analysis, potentially revolutionizing early detection and disease monitoring in patients. Finally, we outline strategies for transitioning long-read sequencing from research applications to routine clinical practice.
Collapse
Affiliation(s)
- Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Ayse G Keskus
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, Bethesda, Maryland 20892, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | - Michal B Izydorczyk
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Texas 77030, USA
- Department of Computer Science, Rice University, Houston, Texas 77251, USA
| | - Alison P Klein
- Sidney Kimmel Comprehensive Cancer Center, Department of Oncology, Johns Hopkins Medicine, Baltimore, Maryland 21031, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, Bethesda, Maryland 20892, USA;
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
- Sidney Kimmel Comprehensive Cancer Center, Department of Oncology, Johns Hopkins Medicine, Baltimore, Maryland 21031, USA
| |
Collapse
|
3
|
Zheng Z, Ren Y, Chen L, Wong AOK, Li S, Yu X, Lam TW, Luo R. Repun: an accurate small variant representation unification method for multiple sequencing platforms. Brief Bioinform 2024; 26:bbae613. [PMID: 39584701 PMCID: PMC11586763 DOI: 10.1093/bib/bbae613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 10/31/2024] [Accepted: 11/11/2024] [Indexed: 11/26/2024] Open
Abstract
Ensuring a unified variant representation aligning the sequencing data is critical for downstream analysis as variant representation may differ across platforms and sequencing conditions. Current approaches typically treat variant unification as a post-step following variant calling and are incapable of measuring the correct variant representation from the outset. Aligning variant representations with the alignment before variant calling has benefits like providing reliable training labels for deep learning-based variant caller model training and enabling direct assessment of alignment quality. However, it also poses challenges due to the large number of candidates to handle. Here, we present Repun, a haplotype-aware variant-alignment unification algorithm that harmonizes the variant representation between provided variants and alignments in different sequencing platforms. Repun leverages phasing to facilitate equivalent haplotype matches between variants and alignments. Our approach reduced the comparisons between variant haplotypes and candidate haplotypes by utilizing haplotypes with read evidence to speed up the unification process. Repun achieved >99.99% precision and > 99.5% recall through extensive evaluations of various Genome in a Bottle Consortium samples encompassing three sequencing platforms: Oxford Nanopore Technology, Pacific Biosciences, and Illumina. Repun is open-source and available at (https://github.com/zhengzhenxian/Repun).
Collapse
Affiliation(s)
- Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Yingxuan Ren
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Lei Chen
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Angel On Ki Wong
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Shumin Li
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Xian Yu
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China
| |
Collapse
|
4
|
O’Fallon B, Bolia A, Durtschi J, Yang L, Fredrickson E, Best H. Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data. Bioinformatics 2024; 40:btae565. [PMID: 39298478 PMCID: PMC11549014 DOI: 10.1093/bioinformatics/btae565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 07/12/2024] [Accepted: 09/18/2024] [Indexed: 09/21/2024] Open
Abstract
MOTIVATION Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden. RESULTS We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested. AVAILABILITY AND IMPLEMENTATION Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.
Collapse
Affiliation(s)
- Brendan O’Fallon
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
- Institute for Clinical and Experimental Pathology, ARUP Labs, Salt Lake City, UT 84108, United States
| | - Ashini Bolia
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
| | - Jacob Durtschi
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
- Institute for Clinical and Experimental Pathology, ARUP Labs, Salt Lake City, UT 84108, United States
| | - Luobin Yang
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
| | - Eric Fredrickson
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
| | - Hunter Best
- Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States
| |
Collapse
|
5
|
Wang S, Ye K. Deep-learning based representation and recognition for genome variants-from SNVs to structural variants. Natl Sci Rev 2024; 11:nwae335. [PMID: 39606147 PMCID: PMC11601977 DOI: 10.1093/nsr/nwae335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Revised: 09/13/2024] [Accepted: 09/17/2024] [Indexed: 11/29/2024] Open
Affiliation(s)
- Songbo Wang
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, China
| | - Kai Ye
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, China
- MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, China
- School of Life Science and Technology, Xi'an Jiaotong University, China
- Faculty of Science, Leiden University, The Netherlands
- Genome Institute, The First Affiliated Hospital of Xi'an Jiaotong University, China
| |
Collapse
|
6
|
Cui M, Liu Y, Yu X, Guo H, Jiang T, Wang Y, Liu B. miniSNV: accurate and fast single nucleotide variant calling from nanopore sequencing data. Brief Bioinform 2024; 25:bbae473. [PMID: 39331016 PMCID: PMC11428505 DOI: 10.1093/bib/bbae473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 06/18/2024] [Accepted: 09/12/2024] [Indexed: 09/28/2024] Open
Abstract
Nanopore sequence technology has demonstrated a longer read length and enabled to potentially address the limitations of short-read sequencing including long-range haplotype phasing and accurate variant calling. However, there is still room for improvement in terms of the performance of single nucleotide variant (SNV) identification and computing resource usage for the state-of-the-art approaches. In this work, we introduce miniSNV, a lightweight SNV calling algorithm that simultaneously achieves high performance and yield. miniSNV utilizes known common variants in populations as variation backgrounds and leverages read pileup, read-based phasing, and consensus generation to identify and genotype SNVs for Oxford Nanopore Technologies (ONT) long reads. Benchmarks on real and simulated ONT data under various error profiles demonstrate that miniSNV has superior sensitivity and comparable accuracy on SNV detection and runs faster with outstanding scalability and lower memory than most state-of-the-art variant callers. miniSNV is available from https://github.com/CuiMiao-HIT/miniSNV.
Collapse
Affiliation(s)
- Miao Cui
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
| | - Yadong Liu
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, 26 Longyuan East 7th Street, Zhengdong New District, Zhengzhou, Henan 450000, China
| | - Xian Yu
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
| | - Hongzhe Guo
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, 26 Longyuan East 7th Street, Zhengdong New District, Zhengzhou, Henan 450000, China
| | - Tao Jiang
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, 26 Longyuan East 7th Street, Zhengdong New District, Zhengzhou, Henan 450000, China
| | - Yadong Wang
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, 26 Longyuan East 7th Street, Zhengdong New District, Zhengzhou, Henan 450000, China
| | - Bo Liu
- Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China
- Zhengzhou Research Institute, Harbin Institute of Technology, 26 Longyuan East 7th Street, Zhengdong New District, Zhengzhou, Henan 450000, China
| |
Collapse
|
7
|
Rothschild D, Susanto TT, Sui X, Spence JP, Rangan R, Genuth NR, Sinnott-Armstrong N, Wang X, Pritchard JK, Barna M. Diversity of ribosomes at the level of rRNA variation associated with human health and disease. CELL GENOMICS 2024; 4:100629. [PMID: 39111318 PMCID: PMC11480859 DOI: 10.1016/j.xgen.2024.100629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 05/07/2024] [Accepted: 07/14/2024] [Indexed: 09/14/2024]
Abstract
With hundreds of copies of rDNA, it is unknown whether they possess sequence variations that form different types of ribosomes. Here, we developed an algorithm for long-read variant calling, termed RGA, which revealed that variations in human rDNA loci are predominantly insertion-deletion (indel) variants. We developed full-length rRNA sequencing (RIBO-RT) and in situ sequencing (SWITCH-seq), which showed that translating ribosomes possess variation in rRNA. Over 1,000 variants are lowly expressed. However, tens of variants are abundant and form distinct rRNA subtypes with different structures near indels as revealed by long-read rRNA structure probing coupled to dimethyl sulfate sequencing. rRNA subtypes show differential expression in endoderm/ectoderm-derived tissues, and in cancer, low-abundance rRNA variants can become highly expressed. Together, this study identifies the diversity of ribosomes at the level of rRNA variants, their chromosomal location, and unique structure as well as the association of ribosome variation with tissue-specific biology and cancer.
Collapse
Affiliation(s)
- Daphna Rothschild
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Xin Sui
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Ramya Rangan
- Biophysics Program, Stanford University, Stanford, CA 94305, USA
| | - Naomi R Genuth
- Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Biology, Stanford University, Stanford, CA 94305, USA
| | | | - Xiao Wang
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Biology, Stanford University, Stanford, CA 94305, USA
| | - Maria Barna
- Department of Genetics, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
8
|
Ricci CA, Crysup B, Phillips NR, Ray WC, Santillan MK, Trask AJ, Woerner AE, Goulopoulou S. Machine learning: a new era for cardiovascular pregnancy physiology and cardio-obstetrics research. Am J Physiol Heart Circ Physiol 2024; 327:H417-H432. [PMID: 38847756 PMCID: PMC11442027 DOI: 10.1152/ajpheart.00149.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 05/31/2024] [Accepted: 05/31/2024] [Indexed: 06/10/2024]
Abstract
The maternal cardiovascular system undergoes functional and structural adaptations during pregnancy and postpartum to support increased metabolic demands of offspring and placental growth, labor, and delivery, as well as recovery from childbirth. Thus, pregnancy imposes physiological stress upon the maternal cardiovascular system, and in the absence of an appropriate response it imparts potential risks for cardiovascular complications and adverse outcomes. The proportion of pregnancy-related maternal deaths from cardiovascular events has been steadily increasing, contributing to high rates of maternal mortality. Despite advances in cardiovascular physiology research, there is still no comprehensive understanding of maternal cardiovascular adaptations in healthy pregnancies. Furthermore, current approaches for the prognosis of cardiovascular complications during pregnancy are limited. Machine learning (ML) offers new and effective tools for investigating mechanisms involved in pregnancy-related cardiovascular complications as well as the development of potential therapies. The main goal of this review is to summarize existing research that uses ML to understand mechanisms of cardiovascular physiology during pregnancy and develop prediction models for clinical application in pregnant patients. We also provide an overview of ML platforms that can be used to comprehensively understand cardiovascular adaptations to pregnancy and discuss the interpretability of ML outcomes, the consequences of model bias, and the importance of ethical consideration in ML use.
Collapse
Affiliation(s)
- Contessa A Ricci
- College of Nursing, Washington State University, Spokane, Washington, United States
- IREACH: Institute for Research and Education to Advance Community Health, Washington State University, Seattle, Washington, United States
- Elson S. Floyd College of Medicine, Washington State University, Spokane, Washington, United States
| | - Benjamin Crysup
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science, Fort Worth, Texas, United States
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Nicole R Phillips
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science, Fort Worth, Texas, United States
| | - William C Ray
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio, United States
| | - Mark K Santillan
- Department of Obstetrics and Gynecology, University of Iowa Carver College of Medicine, Iowa City, Iowa, United States
| | - Aaron J Trask
- Center for Cardiovascular Research, The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, Ohio, United States
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio, United States
| | - August E Woerner
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science, Fort Worth, Texas, United States
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Styliani Goulopoulou
- Lawrence D. Longo Center for Perinatal Biology, Departments of Basic Sciences, Gynecology and Obstetrics, Loma Linda University, Loma Linda, California, United States
| |
Collapse
|
9
|
Junjun R, Zhengqian Z, Ying W, Jialiang W, Yongzhuang L. A comprehensive review of deep learning-based variant calling methods. Brief Funct Genomics 2024; 23:303-313. [PMID: 38366908 DOI: 10.1093/bfgp/elae003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/14/2024] [Accepted: 01/18/2023] [Indexed: 02/18/2024] Open
Abstract
Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning-based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning-based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.
Collapse
Affiliation(s)
- Ren Junjun
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Zhang Zhengqian
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Wu Ying
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Wang Jialiang
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Liu Yongzhuang
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| |
Collapse
|
10
|
Grochowski CM, Bengtsson JD, Du H, Gandhi M, Lun MY, Mehaffey MG, Park K, Höps W, Benito E, Hasenfeld P, Korbel JO, Mahmoud M, Paulin LF, Jhangiani SN, Hwang JP, Bhamidipati SV, Muzny DM, Fatih JM, Gibbs RA, Pendleton M, Harrington E, Juul S, Lindstrand A, Sedlazeck FJ, Pehlivan D, Lupski JR, Carvalho CMB. Inverted triplications formed by iterative template switches generate structural variant diversity at genomic disorder loci. CELL GENOMICS 2024; 4:100590. [PMID: 38908378 PMCID: PMC11293582 DOI: 10.1016/j.xgen.2024.100590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 12/27/2023] [Accepted: 05/31/2024] [Indexed: 06/24/2024]
Abstract
The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a complex genomic rearrangement (CGR). Although it has been identified as an important pathogenic DNA mutation signature in genomic disorders and cancer genomes, its architecture remains unresolved. Here, we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the DNA of 24 patients identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted structural variant (SV) haplotypes. Using a combination of short-read genome sequencing (GS), long-read GS, optical genome mapping, and single-cell DNA template strand sequencing (strand-seq), the haplotype structure was resolved in 18 samples. The point of template switching in 4 samples was shown to be a segment of ∼2.2-5.5 kb of 100% nucleotide similarity within inverted repeat pairs. These data provide experimental evidence that inverted low-copy repeats act as recombinant substrates. This type of CGR can result in multiple conformers generating diverse SV haplotypes in susceptible dosage-sensitive loci.
Collapse
Affiliation(s)
| | | | - Haowei Du
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mira Gandhi
- Pacific Northwest Research Institute, Seattle, WA 98122, USA
| | - Ming Yin Lun
- Pacific Northwest Research Institute, Seattle, WA 98122, USA
| | | | - KyungHee Park
- Pacific Northwest Research Institute, Seattle, WA 98122, USA
| | - Wolfram Höps
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Eva Benito
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Patrick Hasenfeld
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Medhat Mahmoud
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - James Paul Hwang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Sravya V Bhamidipati
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jawid M Fatih
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Richard A Gibbs
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | | | | | - Sissel Juul
- Oxford Nanopore Technologies, New York, NY 10013, USA
| | - Anna Lindstrand
- Department of Molecular Medicine and Surgery, Karolinska Institutet, 171 76 Stockholm, Sweden; Department of Clinical Genetics and Genomics, Karolinska University Hospital, 171 76 Stockholm, Sweden
| | - Fritz J Sedlazeck
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Department of Computer Science, Rice University, Houston TX 77030, USA
| | - Davut Pehlivan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Section of Neurology and Developmental Neuroscience, Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA; Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston, TX 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA
| | | |
Collapse
|
11
|
Kramer M, Goodwin S, Wappel R, Borio M, Offit K, Feldman DR, Stadler ZK, McCombie WR. Exploring the genetic and epigenetic underpinnings of early-onset cancers: Variant prioritization for long read whole genome sequencing from family cancer pedigrees. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.27.601096. [PMID: 39005350 PMCID: PMC11244929 DOI: 10.1101/2024.06.27.601096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Despite significant advances in our understanding of genetic cancer susceptibility, known inherited cancer predisposition syndromes explain at most 20% of early-onset cancers. As early-onset cancer prevalence continues to increase, the need to assess previously inaccessible areas of the human genome, harnessing a trio or quad family-based architecture for variant filtration, may reveal further insights into cancer susceptibility. To assess a broader spectrum of variation than can be ascertained by multi-gene panel sequencing, or even whole genome sequencing with short reads, we employed long read whole genome sequencing using an Oxford Nanopore Technology (ONT) PromethION of 3 families containing an early-onset cancer proband using a trio or quad family architecture. Analysis included 2 early-onset colorectal cancer family trios and one quad consisting of two siblings with testicular cancer, all with unaffected parents. Structural variants (SVs), epigenetic profiles and single nucleotide variants (SNVs) were determined for each individual, and a filtering strategy was employed to refine and prioritize candidate variants based on the family architecture. The family architecture enabled us to focus on inapposite variants while filtering variants shared with the unaffected parents, significantly decreasing background variation that can hamper identification of potentially disease causing differences. Candidate d e novo and compound heterozygous variants were identified in this way. Gene expression, in matched neoplastic and pre-neoplastic lesions, was assessed for one trio. Our study demonstrates the feasibility of a streamlined analysis of genomic variants from long read ONT whole genome sequencing and a way to prioritize key variants for further evaluation of pathogenicity, while revealing what may be missing from panel based analyses.
Collapse
|
12
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
13
|
Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges. Brief Bioinform 2024; 25:bbad508. [PMID: 38271481 PMCID: PMC10810331 DOI: 10.1093/bib/bbad508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/18/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| | - Mikhail O Ushakov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Tatyana E Lazareva
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Yulia A Nasykhova
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Andrey S Glotov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Alexander V Predeus
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| |
Collapse
|
14
|
De Meulenaere K, Cuypers WL, Gauglitz JM, Guetens P, Rosanas-Urgell A, Laukens K, Cuypers B. Selective whole-genome sequencing of Plasmodium parasites directly from blood samples by nanopore adaptive sampling. mBio 2024; 15:e0196723. [PMID: 38054750 PMCID: PMC10790762 DOI: 10.1128/mbio.01967-23] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 10/20/2023] [Indexed: 12/07/2023] Open
Abstract
IMPORTANCE Malaria is caused by parasites of the genus Plasmodium, and reached a global disease burden of 247 million cases in 2021. To study drug resistance mutations and parasite population dynamics, whole-genome sequencing of patient blood samples is commonly performed. However, the predominance of human DNA in these samples imposes the need for time-consuming laboratory procedures to enrich Plasmodium DNA. We used the Oxford Nanopore Technologies' adaptive sampling feature to circumvent this problem and enrich Plasmodium reads directly during the sequencing run. We demonstrate that adaptive nanopore sequencing efficiently enriches Plasmodium reads, which simplifies and shortens the timeline from blood collection to parasite sequencing. In addition, we show that the obtained data can be used for monitoring genetic markers, or to generate nearly complete genomes. Finally, owing to its inherent mobility, this technology can be easily applied on-site in endemic areas where patients would benefit the most from genomic surveillance.
Collapse
Affiliation(s)
- Katlijn De Meulenaere
- Department of Computer Science, Adrem Data Lab, University of Antwerp, Wilrijk, Belgium
- Department of Biomedical Sciences, Malariology Unit, Institute of Tropical Medicine, Antwerp, Belgium
| | - Wim L. Cuypers
- Department of Computer Science, Adrem Data Lab, University of Antwerp, Wilrijk, Belgium
| | - Julia M. Gauglitz
- Department of Computer Science, Adrem Data Lab, University of Antwerp, Wilrijk, Belgium
| | - Pieter Guetens
- Department of Biomedical Sciences, Malariology Unit, Institute of Tropical Medicine, Antwerp, Belgium
| | - Anna Rosanas-Urgell
- Department of Biomedical Sciences, Malariology Unit, Institute of Tropical Medicine, Antwerp, Belgium
| | - Kris Laukens
- Department of Computer Science, Adrem Data Lab, University of Antwerp, Wilrijk, Belgium
- Excellence centre for Microbial Systems Technology, University of Antwerp, Wilrijk, Belgium
| | - Bart Cuypers
- Department of Computer Science, Adrem Data Lab, University of Antwerp, Wilrijk, Belgium
- Excellence centre for Microbial Systems Technology, University of Antwerp, Wilrijk, Belgium
| |
Collapse
|
15
|
Abdelwahab O, Belzile F, Torkamaneh D. Performance analysis of conventional and AI-based variant callers using short and long reads. BMC Bioinformatics 2023; 24:472. [PMID: 38097928 PMCID: PMC10720095 DOI: 10.1186/s12859-023-05596-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/04/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND The accurate detection of variants is essential for genomics-based studies. Currently, there are various tools designed to detect genomic variants, however, it has always been a challenge to decide which tool to use, especially when various major genome projects have chosen to use different tools. Thus far, most of the existing tools were mainly developed to work on short-read data (i.e., Illumina); however, other sequencing technologies (e.g. PacBio, and Oxford Nanopore) have recently shown that they can also be used for variant calling. In addition, with the emergence of artificial intelligence (AI)-based variant calling tools, there is a pressing need to compare these tools in terms of efficiency, accuracy, computational power, and ease of use. RESULTS In this study, we evaluated five of the most widely used conventional and AI-based variant calling tools (BCFTools, GATK4, Platypus, DNAscope, and DeepVariant) in terms of accuracy and computational cost using both short-read and long-read data derived from three different sequencing technologies (Illumina, PacBio HiFi, and ONT) for the same set of samples from the Genome In A Bottle project. The analysis showed that AI-based variant calling tools supersede conventional ones for calling SNVs and INDELs using both long and short reads in most aspects. In addition, we demonstrate the advantages and drawbacks of each tool while ranking them in each aspect of these comparisons. CONCLUSION This study provides best practices for variant calling using AI-based and conventional variant callers with different types of sequencing data.
Collapse
Affiliation(s)
- Omar Abdelwahab
- Département de Phytologie, Université Laval, Québec, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada
- Institut intelligence et données (IID), Université Laval, Québec, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Québec, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada
| | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec, Canada.
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada.
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada.
- Institut intelligence et données (IID), Université Laval, Québec, Canada.
| |
Collapse
|
16
|
Bond DM, Ortega-Recalde O, Laird MK, Hayakawa T, Richardson KS, Reese FCB, Kyle B, McIsaac-Williams BE, Robertson BC, van Heezik Y, Adams AL, Chang WS, Haase B, Mountcastle J, Driller M, Collins J, Howe K, Go Y, Thibaud-Nissen F, Lister NC, Waters PD, Fedrigo O, Jarvis ED, Gemmell NJ, Alexander A, Hore TA. The admixed brushtail possum genome reveals invasion history in New Zealand and novel imprinted genes. Nat Commun 2023; 14:6364. [PMID: 37848431 PMCID: PMC10582058 DOI: 10.1038/s41467-023-41784-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 09/13/2023] [Indexed: 10/19/2023] Open
Abstract
Combining genome assembly with population and functional genomics can provide valuable insights to development and evolution, as well as tools for species management. Here, we present a chromosome-level genome assembly of the common brushtail possum (Trichosurus vulpecula), a model marsupial threatened in parts of their native range in Australia, but also a major introduced pest in New Zealand. Functional genomics reveals post-natal activation of chemosensory and metabolic genes, reflecting unique adaptations to altricial birth and delayed weaning, a hallmark of marsupial development. Nuclear and mitochondrial analyses trace New Zealand possums to distinct Australian subspecies, which have subsequently hybridised. This admixture allowed phasing of parental alleles genome-wide, ultimately revealing at least four genes with imprinted, parent-specific expression not yet detected in other species (MLH1, EPM2AIP1, UBP1 and GPX7). We find that reprogramming of possum germline imprints, and the wider epigenome, is similar to eutherian mammals except onset occurs after birth. Together, this work is useful for genetic-based control and conservation of possums, and contributes to understanding of the evolution of novel mammalian epigenetic traits.
Collapse
Affiliation(s)
- Donna M Bond
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | | | - Melanie K Laird
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | - Takashi Hayakawa
- Faculty of Environmental Earth Science, Hokkaido University, Sapporo, Hokkaido, 060-0808, Japan
| | - Kyle S Richardson
- Department of Anatomy, University of Otago, Dunedin, New Zealand
- Biology Department, University of Montana Western, Dillon, MT, 59725, USA
| | - Finlay C B Reese
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | - Bruce Kyle
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | | | | | | | - Amy L Adams
- Department of Zoology, University of Otago, Dunedin, New Zealand
| | - Wei-Shan Chang
- School of Life and Environmental Science, Faculty of Science, The University of Sydney, Sydney, NSW, Australia
- Health and Biosecurity, CSIRO, Canberra, ACT, Australia
| | - Bettina Haase
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | | | | | - Joanna Collins
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Yasuhiro Go
- Graduate School of Information Science, Hyogo University, Hyogo, Japan
- Cognitive Genomics Research Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences, Aichi, Japan
- Department of System Neuroscience, National Institute for Physiological Sciences, Aichi, Japan
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas C Lister
- School of Biotechnology and Biomolecular Science, Faculty of Science, UNSW Sydney, Sydney, NSW, 2052, Australia
| | - Paul D Waters
- School of Biotechnology and Biomolecular Science, Faculty of Science, UNSW Sydney, Sydney, NSW, 2052, Australia
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, 10065, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, 20815, USA
| | - Neil J Gemmell
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | - Alana Alexander
- Department of Anatomy, University of Otago, Dunedin, New Zealand
| | - Timothy A Hore
- Department of Anatomy, University of Otago, Dunedin, New Zealand.
| |
Collapse
|
17
|
Majidian S, Agustinho DP, Chin CS, Sedlazeck FJ, Mahmoud M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol 2023; 24:221. [PMID: 37798733 PMCID: PMC10552390 DOI: 10.1186/s13059-023-03061-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 09/18/2023] [Indexed: 10/07/2023] Open
Abstract
Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computational Biology, University of Lausanne, 1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | | | | | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA.
| | - Medhat Mahmoud
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
18
|
Grochowski CM, Bengtsson JD, Du H, Gandhi M, Lun MY, Mehaffey MG, Park K, Höps W, Benito-Garagorri E, Hasenfeld P, Korbel JO, Mahmoud M, Paulin LF, Jhangiani SN, Muzny DM, Fatih JM, Gibbs RA, Pendleton M, Harrington E, Juul S, Lindstrand A, Sedlazeck FJ, Pehlivan D, Lupski JR, Carvalho CMB. Break-induced replication underlies formation of inverted triplications and generates unexpected diversity in haplotype structures. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.02.560172. [PMID: 37873367 PMCID: PMC10592851 DOI: 10.1101/2023.10.02.560172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Background The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a type of complex genomic rearrangement (CGR) hypothesized to result from replicative repair of DNA due to replication fork collapse. It is often mediated by a pair of inverted low-copy repeats (LCR) followed by iterative template switches resulting in at least two breakpoint junctions in cis . Although it has been identified as an important mutation signature of pathogenicity for genomic disorders and cancer genomes, its architecture remains unresolved and is predicted to display at least four structural variation (SV) haplotypes. Results Here we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the genomic DNA of 24 patients with neurodevelopmental disorders identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted SV haplotypes. Using a combination of short-read genome sequencing (GS), long- read GS, optical genome mapping and StrandSeq the haplotype structure was resolved in 18 samples. This approach refined the point of template switching between inverted LCRs in 4 samples revealing a DNA segment of ∼2.2-5.5 kb of 100% nucleotide similarity. A prediction model was developed to infer the LCR used to mediate the non-allelic homology repair. Conclusions These data provide experimental evidence supporting the hypothesis that inverted LCRs act as a recombinant substrate in replication-based repair mechanisms. Such inverted repeats are particularly relevant for formation of copy-number associated inversions, including the DUP-TRP/INV-DUP structures. Moreover, this type of CGR can result in multiple conformers which contributes to generate diverse SV haplotypes in susceptible loci .
Collapse
|
19
|
Zhang B, Bassani-Sternberg M. Current perspectives on mass spectrometry-based immunopeptidomics: the computational angle to tumor antigen discovery. J Immunother Cancer 2023; 11:e007073. [PMID: 37899131 PMCID: PMC10619091 DOI: 10.1136/jitc-2023-007073] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/21/2023] [Indexed: 10/31/2023] Open
Abstract
Identification of tumor antigens presented by the human leucocyte antigen (HLA) molecules is essential for the design of effective and safe cancer immunotherapies that rely on T cell recognition and killing of tumor cells. Mass spectrometry (MS)-based immunopeptidomics enables high-throughput, direct identification of HLA-bound peptides from a variety of cell lines, tumor tissues, and healthy tissues. It involves immunoaffinity purification of HLA complexes followed by MS profiling of the extracted peptides using data-dependent acquisition, data-independent acquisition, or targeted approaches. By incorporating DNA, RNA, and ribosome sequencing data into immunopeptidomics data analysis, the proteogenomic approach provides a powerful means for identifying tumor antigens encoded within the canonical open reading frames of annotated coding genes and non-canonical tumor antigens derived from presumably non-coding regions of our genome. We discuss emerging computational challenges in immunopeptidomics data analysis and tumor antigen identification, highlighting key considerations in the proteogenomics-based approach, including accurate DNA, RNA and ribosomal sequencing data analysis, careful incorporation of predicted novel protein sequences into reference protein database, special quality control in MS data analysis due to the expanded and heterogeneous search space, cancer-specificity determination, and immunogenicity prediction. The advancements in technology and computation is continually enabling us to identify tumor antigens with higher sensitivity and accuracy, paving the way toward the development of more effective cancer immunotherapies.
Collapse
Affiliation(s)
- Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Michal Bassani-Sternberg
- Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland
- Department of Oncology, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland
- Agora Cancer Research Centre, Lausanne, Switzerland
| |
Collapse
|
20
|
Zeibich R, Kwan P, J. O’Brien T, Perucca P, Ge Z, Anderson A. Applications for Deep Learning in Epilepsy Genetic Research. Int J Mol Sci 2023; 24:14645. [PMID: 37834093 PMCID: PMC10572791 DOI: 10.3390/ijms241914645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/11/2023] [Accepted: 09/21/2023] [Indexed: 10/15/2023] Open
Abstract
Epilepsy is a group of brain disorders characterised by an enduring predisposition to generate unprovoked seizures. Fuelled by advances in sequencing technologies and computational approaches, more than 900 genes have now been implicated in epilepsy. The development and optimisation of tools and methods for analysing the vast quantity of genomic data is a rapidly evolving area of research. Deep learning (DL) is a subset of machine learning (ML) that brings opportunity for novel investigative strategies that can be harnessed to gain new insights into the genomic risk of people with epilepsy. DL is being harnessed to address limitations in accuracy of long-read sequencing technologies, which improve on short-read methods. Tools that predict the functional consequence of genetic variation can represent breaking ground in addressing critical knowledge gaps, while methods that integrate independent but complimentary data enhance the predictive power of genetic data. We provide an overview of these DL tools and discuss how they may be applied to the analysis of genetic data for epilepsy research.
Collapse
Affiliation(s)
- Robert Zeibich
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
| | - Patrick Kwan
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Terence J. O’Brien
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Piero Perucca
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Epilepsy Research Centre, Department of Medicine, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
- Bladin-Berkovic Comprehensive Epilepsy Program, Department of Neurology, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
| | - Zongyuan Ge
- Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia;
- Monash-Airdoc Research, Monash University, Melbourne, VIC 3800, Australia
| | - Alison Anderson
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| |
Collapse
|
21
|
Ahsan MU, Liu Q, Perdomo JE, Fang L, Wang K. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data. Nat Methods 2023; 20:1143-1158. [PMID: 37386186 PMCID: PMC11208083 DOI: 10.1038/s41592-023-01932-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 05/31/2023] [Indexed: 07/01/2023]
Abstract
As long-read sequencing technologies are becoming increasingly popular, a number of methods have been developed for the discovery and analysis of structural variants (SVs) from long reads. Long reads enable detection of SVs that could not be previously detected from short-read sequencing, but computational methods must adapt to the unique challenges and opportunities presented by long-read sequencing. Here, we summarize over 50 long-read-based methods for SV detection, genotyping and visualization, and discuss how new telomere-to-telomere genome assemblies and pangenome efforts can improve the accuracy and drive the development of SV callers in the future.
Collapse
Affiliation(s)
- Mian Umair Ahsan
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Jonathan Elliot Perdomo
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
22
|
Spealman P, De T, Chuong JN, Gresham D. Best Practices in Microbial Experimental Evolution: Using Reporters and Long-Read Sequencing to Identify Copy Number Variation in Experimental Evolution. J Mol Evol 2023; 91:356-368. [PMID: 37012421 PMCID: PMC10275804 DOI: 10.1007/s00239-023-10102-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 02/21/2023] [Indexed: 04/05/2023]
Abstract
Copy number variants (CNVs), comprising gene amplifications and deletions, are a pervasive class of heritable variation. CNVs play a key role in rapid adaptation in both natural, and experimental, evolution. However, despite the advent of new DNA sequencing technologies, detection and quantification of CNVs in heterogeneous populations has remained challenging. Here, we summarize recent advances in the use of CNV reporters that provide a facile means of quantifying de novo CNVs at a specific locus in the genome, and nanopore sequencing, for resolving the often complex structures of CNVs. We provide guidance for the engineering and analysis of CNV reporters and practical guidelines for single-cell analysis of CNVs using flow cytometry. We summarize recent advances in nanopore sequencing, discuss the utility of this technology, and provide guidance for the bioinformatic analysis of these data to define the molecular structure of CNVs. The combination of reporter systems for tracking and isolating CNV lineages and long-read DNA sequencing for characterizing CNV structures enables unprecedented resolution of the mechanisms by which CNVs are generated and their evolutionary dynamics.
Collapse
Affiliation(s)
- Pieter Spealman
- Department of Biology, New York University, New York, NY, 10003, USA
- Center for Genomics and Systems Biology, New York University, New York, NY, 10003, USA
| | - Titir De
- Department of Biology, New York University, New York, NY, 10003, USA
- Center for Genomics and Systems Biology, New York University, New York, NY, 10003, USA
| | - Julie N Chuong
- Department of Biology, New York University, New York, NY, 10003, USA
- Center for Genomics and Systems Biology, New York University, New York, NY, 10003, USA
| | - David Gresham
- Department of Biology, New York University, New York, NY, 10003, USA.
- Center for Genomics and Systems Biology, New York University, New York, NY, 10003, USA.
| |
Collapse
|
23
|
Dunn T, Blaauw D, Das R, Narayanasamy S. nPoRe: n-polymer realigner for improved pileup-based variant calling. BMC Bioinformatics 2023; 24:98. [PMID: 36927439 PMCID: PMC10022090 DOI: 10.1186/s12859-023-05193-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 02/19/2023] [Indexed: 03/18/2023] Open
Abstract
Despite recent improvements in nanopore basecalling accuracy, germline variant calling of small insertions and deletions (INDELs) remains poor. Although precision and recall for single nucleotide polymorphisms (SNPs) now exceeds 99.5%, INDEL recall remains below 80% for standard R9.4.1 flow cells. We show that read phasing and realignment can recover a significant portion of false negative INDELs. In particular, we extend Needleman-Wunsch affine gap alignment by introducing new gap penalties for more accurately aligning repeated n-polymer sequences such as homopolymers ([Formula: see text]) and tandem repeats ([Formula: see text]). At the same precision, haplotype phasing improves INDEL recall from 63.76 to [Formula: see text] and nPoRe realignment improves it further to [Formula: see text].
Collapse
Affiliation(s)
- Tim Dunn
- University of Michigan, Ann Arbor, USA
| | | | | | | |
Collapse
|
24
|
Senanayake A, Gamaarachchi H, Herath D, Ragel R. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics 2023; 24:31. [PMID: 36709261 PMCID: PMC9883605 DOI: 10.1186/s12859-023-05151-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/17/2023] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Nanopore sequencing allows selective sequencing, the ability to programmatically reject unwanted reads in a sample. Selective sequencing has many present and future applications in genomics research and the classification of species from a pool of species is an example. Existing methods for selective sequencing for species classification are still immature and the accuracy highly varies depending on the datasets. For the five datasets we tested, the accuracy of existing methods varied in the range of [Formula: see text] 77 to 97% (average accuracy < 89%). Here we present DeepSelectNet, an accurate deep-learning-based method that can directly classify nanopore current signals belonging to a particular species. DeepSelectNet utilizes novel data preprocessing techniques and improved neural network architecture for regularization. RESULTS For the five datasets tested, DeepSelectNet's accuracy varied between [Formula: see text] 91 and 99% (average accuracy [Formula: see text] 95%). At its best performance, DeepSelectNet achieved a nearly 12% accuracy increase compared to its deep learning-based predecessor SquiggleNet. Furthermore, precision and recall evaluated for DeepSelectNet on average were always > 89% (average [Formula: see text] 95%). In terms of execution performance, DeepSelectNet outperformed SquiggleNet by [Formula: see text] 13% on average. Thus, DeepSelectNet is a practically viable method to improve the effectiveness of selective sequencing. CONCLUSIONS Compared to base alignment and deep learning predecessors, DeepSelectNet can significantly improve the accuracy to enable real-time species classification using selective sequencing. The source code of DeepSelectNet is available at https://github.com/AnjanaSenanayake/DeepSelectNet .
Collapse
Affiliation(s)
- Anjana Senanayake
- Department of Computer Engineering, University of Peradeniya, Peradeniya, Sri Lanka.
| | - Hasindu Gamaarachchi
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Darlinghurst, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
| | - Damayanthi Herath
- Department of Computer Engineering, University of Peradeniya, Peradeniya, Sri Lanka
| | - Roshan Ragel
- Department of Computer Engineering, University of Peradeniya, Peradeniya, Sri Lanka
| |
Collapse
|
25
|
Huang N, Xu M, Nie F, Ni P, Xiao CL, Luo F, Wang J. NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data. Bioinformatics 2023; 39:btac824. [PMID: 36548365 PMCID: PMC9822538 DOI: 10.1093/bioinformatics/btac824] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 11/16/2022] [Accepted: 12/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Oxford Nanopore sequencing has great potential and advantages in population-scale studies. Due to the cost of sequencing, the depth of whole-genome sequencing for per individual sample must be small. However, the existing single nucleotide polymorphism (SNP) callers are aimed at high-coverage Nanopore sequencing reads. Detecting the SNP variants on low-coverage Nanopore sequencing data is still a challenging problem. RESULTS We developed a novel deep learning-based SNP calling method, NanoSNP, to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. In this method, we design a multi-step, multi-scale and haplotype-aware SNP detection pipeline. First, the pileup model in NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-long short-term memory (LSTM) network. These SNP sites are phased and used to divide the low-coverage Nanopore reads into different haplotypes. Finally, the long-range haplotype feature and short-range pileup feature are extracted from each haplotype. The haplotype model combines two features and predicts the genotype for the candidate site using a Bi-LSTM network. To evaluate the performance of NanoSNP, we compared NanoSNP with Clair, Clair3, Pepper-DeepVariant and NanoCaller on the low-coverage (∼16×) Nanopore sequencing reads. We also performed cross-genome testing on six human genomes HG002-HG007, respectively. Comprehensive experiments demonstrate that NanoSNP outperforms Clair, Pepper-DeepVariant and NanoCaller in identifying SNPs on low-coverage Nanopore sequencing data, including the difficult-to-map regions and major histocompatibility complex regions in the human genome. NanoSNP is comparable to Clair3 when the coverage exceeds 16×. AVAILABILITY AND IMPLEMENTATION https://github.com/huangnengCSU/NanoSNP.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Neng Huang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Minghua Xu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Fan Nie
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Peng Ni
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Chuan-Le Xiao
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Feng Luo
- School of Computing, Clemson University, Clemson, SC 29634, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| |
Collapse
|
26
|
Wiewiórka M, Szmurło A, Stankiewicz P, Gambin T. Cloud-native distributed genomic pileup operations. Bioinformatics 2022; 39:6900922. [PMID: 36515465 PMCID: PMC9848050 DOI: 10.1093/bioinformatics/btac804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/16/2022] [Accepted: 12/13/2022] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. RESULTS Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. AVAILABILITY AND IMPLEMENTATION https://biodatageeks.github.io/sequila/.
Collapse
Affiliation(s)
| | | | - Paweł Stankiewicz
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
27
|
Zheng Z, Li S, Su J, Leung AWS, Lam TW, Luo R. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. NATURE COMPUTATIONAL SCIENCE 2022; 2:797-803. [PMID: 38177392 DOI: 10.1038/s43588-022-00387-x] [Citation(s) in RCA: 113] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 11/30/2022] [Indexed: 01/06/2024]
Abstract
Deep learning-based variant callers are becoming the standard and have achieved superior single nucleotide polymorphisms calling performance using long reads. Here we present Clair3, which leverages two major method categories: pileup calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 runs faster than any of the other state-of-the-art variant callers and demonstrates improved performance, especially at lower coverage.
Collapse
Affiliation(s)
- Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Shumin Li
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Junhao Su
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Amy Wing-Sze Leung
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
28
|
Chander V, Mahmoud M, Hu J, Dardas Z, Grochowski CM, Dawood M, Khayat MM, Li H, Li S, Jhangiani S, Korchina V, Shen H, Weissenberger G, Meng Q, Gingras MC, Muzny DM, Doddapaneni H, Posey JE, Lupski JR, Sabo A, Murdock DR, Sedlazeck FJ, Gibbs RA. Long read sequencing and expression studies of AHDC1 deletions in Xia-Gibbs syndrome reveal a novel genetic regulatory mechanism. Hum Mutat 2022; 43:2033-2053. [PMID: 36054313 PMCID: PMC10167679 DOI: 10.1002/humu.24461] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 08/17/2022] [Accepted: 08/30/2022] [Indexed: 01/25/2023]
Abstract
Xia-Gibbs syndrome (XGS; MIM# 615829) is a rare mendelian disorder characterized by Development Delay (DD), intellectual disability (ID), and hypotonia. Individuals with XGS typically harbor de novo protein-truncating mutations in the AT-Hook DNA binding motif containing 1 (AHDC1) gene, although some missense mutations can also cause XGS. Large de novo heterozygous deletions that encompass the AHDC1 gene have also been ascribed as diagnostic for the disorder, without substantial evidence to support their pathogenicity. We analyzed 19 individuals with large contiguous deletions involving AHDC1, along with other genes. One individual bore the smallest known contiguous AHDC1 deletion (∼350 Kb), encompassing eight other genes within chr1p36.11 (Feline Gardner-Rasheed, IFI6, FAM76A, STX12, PPP1R8, THEMIS2, RPA2, SMPDL3B) and terminating within the first intron of AHDC1. The breakpoint junctions and phase of the deletion were identified using both short and long read sequencing (Oxford Nanopore). Quantification of RNA expression patterns in whole blood revealed that AHDC1 exhibited a mono-allelic expression pattern with no deficiency in overall AHDC1 expression levels, in contrast to the other deleted genes, which exhibited a 50% reduction in mRNA expression. These results suggest that AHDC1 expression in this individual is compensated by a novel regulatory mechanism and advances understanding of mutational and regulatory mechanisms in neurodevelopmental disorders.
Collapse
Affiliation(s)
- Varuna Chander
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Jianhong Hu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Zain Dardas
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Moez Dawood
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Michael M. Khayat
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Shoudong Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Shalini Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Viktoriya Korchina
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Hua Shen
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | | | - Qingchang Meng
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Marie-Claude Gingras
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Donna M. Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Harsha Doddapaneni
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - James R. Lupski
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Texas Children’s Hospital, Houston, Texas, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, Texas, USA
| | - Aniko Sabo
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - David R. Murdock
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Department of Computer Science, Rice University, Houston, Texas, USA
| | - Richard A. Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| |
Collapse
|
29
|
Holt GS, Batty LE, Alobaidi BKS, Smith HE, Oud MS, Ramos L, Xavier MJ, Veltman JA. Phasing of de novo mutations using a scaled-up multiple amplicon long-read sequencing approach. Hum Mutat 2022; 43:1545-1556. [PMID: 36047340 PMCID: PMC9826063 DOI: 10.1002/humu.24450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/11/2022] [Accepted: 08/18/2022] [Indexed: 01/11/2023]
Abstract
De novo mutations (DNMs) play an important role in severe genetic disorders that reduce fitness. To better understand their role in disease, it is important to determine the parent-of-origin and timing of mutational events that give rise to these mutations, especially in sex-specific developmental disorders such as male infertility. However, currently available short-read sequencing approaches are not ideally suited for phasing, as this requires long continuous DNA strands that span both the DNM and one or more informative single-nucleotide polymorphisms. To overcome these challenges, we optimized and implemented a multiplexed long-read sequencing approach using Oxford Nanopore technologies MinION platform. We focused on improving target amplification, integrating long-read sequenced data with high-quality short-read sequence data, and developing an anchored phasing computational method. This approach handled the inherent phasing challenges of long-range target amplification and the normal accumulation of sequencing error associated with long-read sequencing. In total, 77 of 109 DNMs (71%) were successfully phased and parent-of-origin identified. The majority of phased DNMs were prezygotic (90%), the accuracy of which is highlighted by an average mutant allele frequency of 49.6% and standard error of 0.84%. This study demonstrates the benefits of employing an integrated short-read and long-read sequencing approach for large-scale DNM phasing.
Collapse
Affiliation(s)
- Giles S. Holt
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Lois E. Batty
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Bilal K. S. Alobaidi
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Hannah E. Smith
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Manon S. Oud
- Department of Human Genetics, Donders Institute for BrainCognition and Behaviour, RadboudumcNijmegenThe Netherlands
| | - Liliana Ramos
- Department of Obstetrics and Gynecology, Division of Reproductive MedicineRadboudumcNijmegenThe Netherlands
| | - Miguel J. Xavier
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Joris A. Veltman
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| |
Collapse
|
30
|
NanoCross: A pipeline that detecting recombinant crossover using ONT sequencing data. Genomics 2022; 114:110499. [PMID: 36174880 DOI: 10.1016/j.ygeno.2022.110499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 08/30/2022] [Accepted: 09/25/2022] [Indexed: 01/14/2023]
Abstract
Meiotic recombination is crucial for eukaryotes but varies among taxonomic scales (between individuals, groups, species, etc.) and genome resolutions. Studying how and why recombination rates change can help us understand the molecular basis and mechanisms of genetics and evolution. We introduce a genome-wide identification script called NanoCross, which uses ONT sequences to detect pooled gamete DNA cross recombination events. NanoCross first reduced sequencing errors and then constructed individual haplotypes based on homopolymer-filtered ONT sequences. Then, each molecule read is used to estimate cross recombination. In the case of moderate heterozygous variation density and sequencing depth, simulations revealed that our technique offers a good level of sensitivity and specificity. We constructed a high-resolution recombination map of wild boar genomes using NanoCross and compared it to recombination maps of male breeding pig populations. NanoCross provides us with a method and scripts for constructing a high-resolution individual genome recombination map utilizing long-read sequencing, as well as a novel approach for examining the variation in individual recombination rate. The source code and data mechanism are hosted on GitHub (https://github.com/zuoquanchen/NanoCross).
Collapse
|
31
|
Increased volatile thiol release during beer fermentation using constructed interspecies yeast hybrids. Eur Food Res Technol 2022. [DOI: 10.1007/s00217-022-04132-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractInterspecies hybridization has been shown to be a powerful tool for developing and improving brewing yeast in a number of industry-relevant respects. Thanks to the popularity of heavily hopped ‘India Pale Ale’-style beers, there is an increased demand from brewers for strains that can boost hop aroma. Here, we explored whether hybridization could be used to construct strains with an enhanced ability to release hop-derived flavours through β-lyase activity, which releases desirable volatile thiols. Wild Saccharomyces strains were shown to possess high β-lyase activity compared to brewing strains, however, they also produced phenolic off-flavours (POF) and showed poor attenuation. To overcome these limitations, interspecies hybrids were constructed by crossing pairs of one of three brewing and one of three wild Saccharomyces strains (S. uvarum and S. eubayanus). Hybrids were screened for fermentation ability and β-lyase activity, and selected hybrids showed improved fermentation and formation of both volatile thiols (4MMP, 3MH and 3MH-acetate) and aroma-active esters compared to the parent strains. Undesirable traits (e.g. POF) could be removed from the hybrid by sporulation. To conclude, it was possible to boost the release of desirable hop-derived thiols in brewing yeast by hybridization with wild yeast. This allows production of beer with boosted hop aroma with less hops (thus improving sustainability issues).
Collapse
|
32
|
Su J, Zheng Z, Ahmed SS, Lam TW, Luo R. Clair3-trio: high-performance Nanopore long-read variant calling in family trios with trio-to-trio deep neural networks. Brief Bioinform 2022; 23:bbac301. [PMID: 35849103 PMCID: PMC9487642 DOI: 10.1093/bib/bbac301] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 06/16/2022] [Accepted: 07/02/2022] [Indexed: 11/14/2022] Open
Abstract
Accurate identification of genetic variants from family child-mother-father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio's predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.
Collapse
Affiliation(s)
- Junhao Su
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Syed Shakeel Ahmed
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
33
|
Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer. Genes (Basel) 2022; 13:genes13091583. [PMID: 36140751 PMCID: PMC9498802 DOI: 10.3390/genes13091583] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/30/2022] [Accepted: 08/31/2022] [Indexed: 11/23/2022] Open
Abstract
The goal of biomarker testing, in the field of personalized medicine, is to guide treatments to achieve the best possible results for each patient. The accurate and reliable identification of everyone’s genome variants is essential for the success of clinical genomics, employing third-generation sequencing. Different variant calling techniques have been used and recommended by both Oxford Nanopore Technologies (ONT) and Nanopore communities. A thorough examination of the variant callers might give critical guidance for third-generation sequencing-based clinical genomics. In this study, two reference genome sample datasets (NA12878) and (NA24385) and the set of high-confidence variant calls provided by the Genome in a Bottle (GIAB) were used to allow the evaluation of the performance of six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, as an integral step in the in-house variant detection workflow. Out of the six variant callers understudy, Clair3 and Human-SNP-wf that has Clair3 incorporated into it achieved the highest performance rates in comparison to the other variant callers. Evaluation of the results for the tool was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. In conclusion, our findings give important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing.
Collapse
|
34
|
Makhoul M, Chawla HS, Wittkop B, Stahl A, Voss-Fels KP, Zetzsche H, Snowdon RJ, Obermeier C. Long-Amplicon Single-Molecule Sequencing Reveals Novel, Trait-Associated Variants of VERNALIZATION1 Homoeologs in Hexaploid Wheat. FRONTIERS IN PLANT SCIENCE 2022; 13:942461. [PMID: 36420025 PMCID: PMC9676936 DOI: 10.3389/fpls.2022.942461] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 06/03/2022] [Indexed: 05/26/2023]
Abstract
The gene VERNALIZATION1 (VRN1) is a key controller of vernalization requirement in wheat. The genome of hexaploid wheat (Triticum aestivum) harbors three homoeologous VRN1 loci on chromosomes 5A, 5B, and 5D. Structural sequence variants including small and large deletions and insertions and single nucleotide polymorphisms (SNPs) in the three homoeologous VRN1 genes not only play an important role in the control of vernalization requirement, but also have been reported to be associated with other yield related traits of wheat. Here we used single-molecule sequencing of barcoded long-amplicons to assay the full-length sequences (∼13 kbp plus 700 bp from the promoter sequence) of the three homoeologous VRN1 genes in a panel of 192 predominantly European winter wheat cultivars. Long read sequences revealed previously undetected duplications, insertions and single-nucleotide polymorphisms in the three homoeologous VRN1 genes. All the polymorphisms were confirmed by Sanger sequencing. Sequence analysis showed the predominance of the winter alleles vrn-A1, vrn-B1, and vrn-D1 across the investigated cultivars. Associations of SNPs and structural variations within the three VRN1 genes with 20 economically relevant traits including yield, nodal root-angle index and quality related traits were evaluated at the levels of alleles, haplotypes, and copy number variants. Cultivars carrying structural variants within VRN1 genes showed lower grain yield, protein yield and biomass compared to those with intact genes. Cultivars carrying a single vrn-A1 copy and a unique haplotype with a high number of SNPs were found to have elevated grain yield, kernels per spike and kernels per m2 along with lower grain sedimentation values. In addition, we detected a novel SNP polymorphism within the G-quadruplex region of the promoter of vrn-A1 that was associated with deeper roots in winter wheat. Our findings show that multiplex, single-molecule long-amplicon sequencing is a useful tool for detecting variants in target genes within large plant populations, and can be used to simultaneously assay sequence variants among target multiple gene homoeologs in polyploid crops. Numerous novel VRN1 haplotypes and alleles were identified that showed significantly associations to economically important traits. These polymorphisms were converted into PCR or KASP assays for use in marker-assisted breeding.
Collapse
Affiliation(s)
- Manar Makhoul
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
| | - Harmeet S. Chawla
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
- Department of Plant Sciences, Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada
| | - Benjamin Wittkop
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
| | - Andreas Stahl
- Institute for Resistance Research and Stress Tolerance, Julius Kühn Institute, Quedlinburg, Germany
| | - Kai Peter Voss-Fels
- Institute for Grapevine Breeding, Hochschule Geisenheim University, Geisenheim, Germany
| | - Holger Zetzsche
- Institute for Resistance Research and Stress Tolerance, Julius Kühn Institute, Quedlinburg, Germany
| | - Rod J. Snowdon
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
| | - Christian Obermeier
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
| |
Collapse
|
35
|
Toffoli M, Chen X, Sedlazeck FJ, Lee CY, Mullin S, Higgins A, Koletsi S, Garcia-Segura ME, Sammler E, Scholz SW, Schapira AHV, Eberle MA, Proukakis C. Comprehensive short and long read sequencing analysis for the Gaucher and Parkinson's disease-associated GBA gene. Commun Biol 2022; 5:670. [PMID: 35794204 PMCID: PMC9259685 DOI: 10.1038/s42003-022-03610-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Accepted: 06/21/2022] [Indexed: 11/30/2022] Open
Abstract
GBA variants carriers are at increased risk of Parkinson's disease (PD) and Lewy body dementia (LBD). The presence of pseudogene GBAP1 predisposes to structural variants, complicating genetic analysis. We present two methods to resolve recombinant alleles and other variants in GBA: Gauchian, a tool for short-read, whole-genome sequencing data analysis, and Oxford Nanopore sequencing after PCR enrichment. Both methods were concordant for 42 samples carrying a range of recombinants and GBAP1-related mutations, and Gauchian outperformed the GATK Best Practices pipeline. Applying Gauchian to sequencing of over 10,000 individuals shows that copy number variants (CNVs) spanning GBAP1 are relatively common in Africans. CNV frequencies in PD and LBD are similar to controls. Gains may coexist with other mutations in patients, and a modifying effect cannot be excluded. Gauchian detects more GBA variants in LBD than PD, especially severe ones. These findings highlight the importance of accurate GBA analysis in these patients.
Collapse
Affiliation(s)
- Marco Toffoli
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Xiao Chen
- Illumina Inc., San Diego, CA, USA
- Pacific Biosciences, 1305 O'Brien Dr., Menlo Park, CA, 94025, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Chiao-Yin Lee
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Stephen Mullin
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
- Institute of Translational and Stratified Medicine, University of Plymouth School of Medicine, Plymouth, United Kingdom
| | - Abigail Higgins
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Sofia Koletsi
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Monica Emili Garcia-Segura
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Esther Sammler
- MRC Protein Phosphorylation and Ubiquitylation Unit, School of Life Sciences, University of Dundee, Dundee, United Kingdom
- Molecular and Clinical Medicine, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Sonja W Scholz
- Neurodegenerative Diseases Research Unit, National Institute of Neurological Disorders and Stroke, Bethesda, MD, 20892, USA
- Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD, 21287, USA
| | - Anthony H V Schapira
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom
| | - Michael A Eberle
- Illumina Inc., San Diego, CA, USA.
- Pacific Biosciences, 1305 O'Brien Dr., Menlo Park, CA, 94025, USA.
| | - Christos Proukakis
- Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, NW3 2PF, United Kingdom.
| |
Collapse
|
36
|
Yang H, Gu F, Zhang L, Hua XS. Using generative adversarial networks for genome variant calling from low depth ONT sequencing data. Sci Rep 2022; 12:8725. [PMID: 35637238 PMCID: PMC9151722 DOI: 10.1038/s41598-022-12346-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 05/10/2022] [Indexed: 11/21/2022] Open
Abstract
Genome variant calling is a challenging yet critical task for subsequent studies. Existing methods almost rely on high depth DNA sequencing data. Performance on low depth data drops a lot. Using public Oxford Nanopore (ONT) data of human being from the Genome in a Bottle (GIAB) Consortium, we trained a generative adversarial network for low depth variant calling. Our method, noted as LDV-Caller, can project high depth sequencing information from low depth data. It achieves 94.25% F1 score on low depth data, while the F1 score of the state-of-the-art method on two times higher depth data is 94.49%. By doing so, the price of genome-wide sequencing examination can reduce deeply. In addition, we validated the trained LDV-Caller model on 157 public Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) samples. The mean sequencing depth of these samples is 2982. The LDV-Caller yields 92.77% F1 score using only 22x sequencing depth, which demonstrates our method has potential to analyze different species with only low depth sequencing data.
Collapse
|
37
|
Lüth T, Schaake S, Grünewald A, May P, Trinh J, Weissensteiner H. Benchmarking Low-Frequency Variant Calling With Long-Read Data on Mitochondrial DNA. Front Genet 2022; 13:887644. [PMID: 35664331 PMCID: PMC9161029 DOI: 10.3389/fgene.2022.887644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 04/18/2022] [Indexed: 11/13/2022] Open
Abstract
Background: Sequencing quality has improved over the last decade for long-reads, allowing for more accurate detection of somatic low-frequency variants. In this study, we used mixtures of mitochondrial samples with different haplogroups (i.e., a specific set of mitochondrial variants) to investigate the applicability of nanopore sequencing for low-frequency single nucleotide variant detection. Methods: We investigated the impact of base-calling, alignment/mapping, quality control steps, and variant calling by comparing the results to a previously derived short-read gold standard generated on the Illumina NextSeq. For nanopore sequencing, six mixtures of four different haplotypes were prepared, allowing us to reliably check for expected variants at the predefined 5%, 2%, and 1% mixture levels. We used two different versions of Guppy for base-calling, two aligners (i.e., Minimap2 and Ngmlr), and three variant callers (i.e., Mutserve2, Freebayes, and Nanopanel2) to compare low-frequency variants. We used F1 score measurements to assess the performance of variant calling. Results: We observed a mean read length of 11 kb and a mean overall read quality of 15. Ngmlr showed not only higher F1 scores but also higher allele frequencies (AF) of false-positive calls across the mixtures (mean F1 score = 0.83; false-positive allele frequencies < 0.17) compared to Minimap2 (mean F1 score = 0.82; false-positive AF < 0.06). Mutserve2 had the highest F1 scores (5% level: F1 score >0.99, 2% level: F1 score >0.54, and 1% level: F1 score >0.70) across all callers and mixture levels. Conclusion: We here present the benchmarking for low-frequency variant calling with nanopore sequencing by identifying current limitations.
Collapse
Affiliation(s)
- Theresa Lüth
- Institute of Neurogenetics, University of Lübeck and University Hospital Schleswig-Holstein, Lübeck, Germany
| | - Susen Schaake
- Institute of Neurogenetics, University of Lübeck and University Hospital Schleswig-Holstein, Lübeck, Germany
| | - Anne Grünewald
- Institute of Neurogenetics, University of Lübeck and University Hospital Schleswig-Holstein, Lübeck, Germany
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, Luxembourg
| | - Patrick May
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, Luxembourg
| | - Joanne Trinh
- Institute of Neurogenetics, University of Lübeck and University Hospital Schleswig-Holstein, Lübeck, Germany
- *Correspondence: Joanne Trinh, ; Hansi Weissensteiner,
| | - Hansi Weissensteiner
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
- *Correspondence: Joanne Trinh, ; Hansi Weissensteiner,
| |
Collapse
|
38
|
Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG, Johanson E, Boja E, Maier EJ, Serang O, Jáspez D, Lorenzo-Salazar JM, Muñoz-Barrera A, Rubio-Rodríguez LA, Flores C, Kyriakidis K, Malousi A, Shafin K, Pesout T, Jain M, Paten B, Chang PC, Kolesnikov A, Nattestad M, Baid G, Goel S, Yang H, Carroll A, Eveleigh R, Bourgey M, Bourque G, Li G, Ma C, Tang L, Du Y, Zhang S, Morata J, Tonda R, Parra G, Trotta JR, Brueffer C, Demirkaya-Budak S, Kabakci-Zorlu D, Turgut D, Kalay Ö, Budak G, Narcı K, Arslan E, Brown R, Johnson IJ, Dolgoborodov A, Semenyuk V, Jain A, Tetikol HS, Jain V, Ruehle M, Lajoie B, Roddey C, Catreux S, Mehio R, Ahsan MU, Liu Q, Wang K, Ebrahim Sahraeian SM, Fang LT, Mohiyuddin M, Hung C, Jain C, Feng H, Li Z, Chen L, Sedlazeck FJ, Zook JM. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. CELL GENOMICS 2022; 2:S2666-979X(22)00058-1. [PMID: 35720974 PMCID: PMC9205427 DOI: 10.1016/j.xgen.2022.100129] [Citation(s) in RCA: 85] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 11/01/2021] [Accepted: 04/08/2022] [Indexed: 11/19/2022]
Abstract
The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
Collapse
Affiliation(s)
- Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| | | | | | | | - Elaine Johanson
- Office of Health Informatics, Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA
| | - Emily Boja
- Office of Health Informatics, Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA
| | - Ezekiel J. Maier
- Booz Allen Hamilton, 8283 Greensboro Drive, Mclean, VA 22102, USA
| | - Omar Serang
- DNAnexus, Inc., 1975 W El Camino Real #204, Mountain View, CA 94040, USA
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
- Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain
- Instituto de Tecnologías Biomédicas (ITB), Universidad de La Laguna, 38200 San Cristóbal de La Laguna, Spain
| | - Konstantinos Kyriakidis
- School of Pharmacy, Aristotle University of Thessaloniki (AUTH), 541 24 Thessaloniki, Greece
- Genomics and Epigenomics Translational Research (GENeTres), Center for Interdisciplinary Research and Innovation, 570 01 Thessaloniki, Greece
| | - Andigoni Malousi
- Genomics and Epigenomics Translational Research (GENeTres), Center for Interdisciplinary Research and Innovation, 570 01 Thessaloniki, Greece
- Laboratory of Biological Chemistry, School of Medicine, Aristotle University of Thessaloniki (AUTH), 541 24 Thessaloniki, Greece
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Miten Jain
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | | | - Maria Nattestad
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Gunjan Baid
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Sidharth Goel
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Howard Yang
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheater Pkwy, Mountain View, CA 94040, USA
| | - Robert Eveleigh
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Mathieu Bourgey
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Guillaume Bourque
- The Canadian Center for Computational Genomics (C3G), Montréal, QC, Canada
| | - Gen Li
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - ChouXian Ma
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - LinQi Tang
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - YuanPing Du
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - ShaoWei Zhang
- HuXinDao, QingZhuHu TaiYangShan Road, KaiFu, ChangSha, HuNan, China
| | - Jordi Morata
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Raúl Tonda
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Genís Parra
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jean-Rémi Trotta
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Christian Brueffer
- Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden
| | | | | | - Deniz Turgut
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Özem Kalay
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Gungor Budak
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Kübra Narcı
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | - Elif Arslan
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | | | | | | | | | - Amit Jain
- Seven Bridges Genomics, Inc, Charlestown, MA, USA
| | | | | | | | | | | | | | | | - Mian Umair Ahsan
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | - Li Tai Fang
- Roche Sequencing Solutions, Santa Clara, CA 95050, USA
| | | | | | - Chirag Jain
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
| |
Collapse
|
39
|
Miller DE, Lee L, Galey M, Kandhaya-Pillai R, Tischkowitz M, Amalnath D, Vithlani A, Yokote K, Kato H, Maezawa Y, Takada-Watanabe A, Takemoto M, Martin GM, Eichler EE, Hisama FM, Oshima J. Targeted long-read sequencing identifies missing pathogenic variants in unsolved Werner syndrome cases. J Med Genet 2022; 59:jmedgenet-2022-108485. [PMID: 35534204 PMCID: PMC9613861 DOI: 10.1136/jmedgenet-2022-108485] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Accepted: 04/14/2022] [Indexed: 11/05/2022]
Abstract
BACKGROUND Werner syndrome (WS) is an autosomal recessive progeroid syndrome caused by variants in WRN. The International Registry of Werner Syndrome has identified biallelic pathogenic variants in 179/188 cases of classical WS. In the remaining nine cases, only one heterozygous pathogenic variant has been identified. METHODS Targeted long-read sequencing (T-LRS) on an Oxford Nanopore platform was used to search for a second pathogenic variant in WRN. Previously, T-LRS was successfully used to identify missing variants and analyse complex rearrangements. RESULTS We identified a second pathogenic variant in eight of nine unsolved WS cases. In five cases, T-LRS identified intronic splice variants that were confirmed by either RT-PCR or exon trapping to affect splicing; in one case, T-LRS identified a 339 kbp deletion, and in two cases, pathogenic missense variants. Phasing of long reads predicted all newly identified variants were on a different haplotype than the previously known variant. Finally, in one case, RT-PCR previously identified skipping of exon 20; however, T-LRS did not detect a pathogenic DNA sequence variant. CONCLUSION T-LRS is an effective method for identifying missing pathogenic variants. Although limitations with computational prediction algorithms can hinder the interpretation of variants, T-LRS is particularly effective in identifying intronic variants.
Collapse
Affiliation(s)
- Danny E Miller
- Department of Pediatrics, Division of Genetic Medicine, University of Washington, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
| | - Lin Lee
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Miranda Galey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
| | - Renuka Kandhaya-Pillai
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Marc Tischkowitz
- Department of Medical Genetics, National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge, Cambridge, UK
| | - Deepak Amalnath
- Department of Medicine, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India
| | - Avadh Vithlani
- Department of Medicine, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India
| | - Koutaro Yokote
- Department of Endocrinology, Hematology and Gerontology, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Hisaya Kato
- Department of Endocrinology, Hematology and Gerontology, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Yoshiro Maezawa
- Department of Endocrinology, Hematology and Gerontology, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Aki Takada-Watanabe
- Department of Endocrinology, Hematology and Gerontology, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Minoru Takemoto
- Department of Diabetes, Metabolism and Endocrinology, International University of Health and Welfare, Otawara, Japan
| | - George M Martin
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA
| | - Fuki M Hisama
- Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, Washington, USA
| | - Junko Oshima
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
- Department of Endocrinology, Hematology and Gerontology, Chiba University Graduate School of Medicine, Chiba, Japan
| |
Collapse
|
40
|
Lang J, Sun J, Yang Z, He L, He Y, Chen Y, Huang L, Li P, Li J, Qin L. Nano2NGS-Muta: a framework for converting nanopore sequencing data to NGS-liked sequencing data for hotspot mutation detection. NAR Genom Bioinform 2022; 4:lqac033. [PMID: 35464239 PMCID: PMC9022462 DOI: 10.1093/nargab/lqac033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Revised: 03/30/2022] [Accepted: 04/13/2022] [Indexed: 12/12/2022] Open
Abstract
Nanopore sequencing, also known as single-molecule real-time sequencing, is a third/fourth generation sequencing technology that enables deciphering single DNA/RNA molecules without the polymerase chain reaction. Although nanopore sequencing has made significant progress in scientific research and clinical practice, its application has been limited compared with next-generation sequencing (NGS) due to specific design principle and data characteristics, especially in hotspot mutation detection. Therefore, we developed Nano2NGS-Muta as a data analysis framework for hotspot mutation detection based on long reads from nanopore sequencing. Nano2NGS-Muta is characterized by applying nanopore sequencing data to NGS-liked data analysis pipelines. Long reads can be converted into short reads and then processed through existing NGS analysis pipelines in combination with statistical methods for hotspot mutation detection. Nano2NGS-Muta not only effectively avoids false positive/negative results caused by non-random errors and unexpected insertions-deletions (indels) of nanopore sequencing data, improves the detection accuracy of hotspot mutations compared to conventional nanopore sequencing data analysis algorithms but also breaks the barriers of data analysis methods between short-read sequencing and long-read sequencing. We hope Nano2NGS-Muta can serves as a reference method for nanopore sequencing data and promotes higher application scope of nanopore sequencing technology in scientific research and clinical practice.
Collapse
Affiliation(s)
- Jidong Lang
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Jiguo Sun
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Zhi Yang
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Lei He
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Yu He
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Yanmei Chen
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Lei Huang
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Ping Li
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Jialin Li
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| | - Liu Qin
- Bioinformatics and Product Development Department, Qitan Technology (Beijing) Co., Ltd, Beijing 100192, China
| |
Collapse
|
41
|
Clinical Metagenomic Sequencing for Species Identification and Antimicrobial Resistance Prediction in Orthopedic Device Infection. J Clin Microbiol 2022; 60:e0215621. [PMID: 35354286 PMCID: PMC9020354 DOI: 10.1128/jcm.02156-21] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Diagnosis of orthopedic device-related infection is challenging, and causative pathogens may be difficult to culture. Metagenomic sequencing can diagnose infections without culture, but attempts to detect antimicrobial resistance (AMR) determinants using metagenomic data have been less successful. Human DNA depletion may maximize the amount of microbial DNA sequence data available for analysis. Human DNA depletion by saponin was tested in 115 sonication fluid samples generated following revision arthroplasty surgery, comprising 67 where pathogens were detected by culture and 48 culture-negative samples. Metagenomic sequencing was performed on the Oxford Nanopore Technologies GridION platform. Filtering thresholds for detection of true species versus contamination or taxonomic misclassification were determined. Mobile and chromosomal genetic AMR determinants were identified in Staphylococcus aureus-positive samples. Of 114 samples generating sequence data, species-level positive percent agreement between metagenomic sequencing and culture was 50/65 (77%; 95% confidence interval [CI], 65 to 86%) and negative percent agreement was 103/114 (90%; 95% CI, 83 to 95%). Saponin treatment reduced the proportion of human bases sequenced in comparison to 5-μm filtration from a median (interquartile range [IQR]) of 98.1% (87.0% to 99.9%) to 11.9% (0.4% to 67.0%), improving reference genome coverage at a 10-fold depth from 18.7% (0.30% to 85.7%) to 84.3% (12.9% to 93.8%). Metagenomic sequencing predicted 13/15 (87%) resistant and 74/74 (100%) susceptible phenotypes where sufficient data were available for analysis. Metagenomic nanopore sequencing coupled with human DNA depletion has the potential to detect AMR in addition to species detection in orthopedic device-related infection. Further work is required to develop pathogen-agnostic human DNA depletion methods, improving AMR determinant detection and allowing its application to other infection types.
Collapse
|
42
|
Liu Y, Kearney J, Mahmoud M, Kille B, Sedlazeck FJ, Treangen TJ. Rescuing low frequency variants within intra-host viral populations directly from Oxford Nanopore sequencing data. Nat Commun 2022; 13:1321. [PMID: 35288552 PMCID: PMC8921239 DOI: 10.1038/s41467-022-28852-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 02/10/2022] [Indexed: 12/28/2022] Open
Abstract
Infectious disease monitoring on Oxford Nanopore Technologies (ONT) platforms offers rapid turnaround times and low cost. Tracking low frequency intra-host variants provides important insights with respect to elucidating within-host viral population dynamics and transmission. However, given the higher error rate of ONT, accurate identification of intra-host variants with low allele frequencies remains an open challenge with no viable computational solutions available. In response to this need, we present Variabel, a novel approach and first method designed for rescuing low frequency intra-host variants from ONT data alone. We evaluate Variabel on both synthetic data (SARS-CoV-2) and patient derived datasets (Ebola virus, norovirus, SARS-CoV-2); our results show that Variabel can accurately identify low frequency variants below 0.5 allele frequency, outperforming existing state-of-the-art ONT variant callers for this task. Variabel is open-source and available for download at: www.gitlab.com/treangenlab/variabel. Tracking low frequency intra-host variants has helped understanding within-host viral population dynamics and transmission. Precise tracking, however, depends partially on the error rate of the sequencing platforms used. Here, Liu et al. present Variabel, a method to rescue low frequency intra-host variants from Oxford Nanopore Technologies (ONT) platforms and validate their approach on Ebola virus, norovirus, and SARS-CoV-2 datasets.
Collapse
|
43
|
Leung AWS, Leung HCM, Wong CL, Zheng ZX, Lui WW, Luk HM, Lo IFM, Luo R, Lam TW. ECNano: A cost-effective workflow for target enrichment sequencing and accurate variant calling on 4800 clinically significant genes using a single MinION flowcell. BMC Med Genomics 2022; 15:43. [PMID: 35246132 PMCID: PMC8895767 DOI: 10.1186/s12920-022-01190-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 02/22/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The application of long-read sequencing using the Oxford Nanopore Technologies (ONT) MinION sequencer is getting more diverse in the medical field. Having a high sequencing error of ONT and limited throughput from a single MinION flowcell, however, limits its applicability for accurate variant detection. Medical exome sequencing (MES) targets clinically significant exon regions, allowing rapid and comprehensive screening of pathogenic variants. By applying MES with MinION sequencing, the technology can achieve a more uniform capture of the target regions, shorter turnaround time, and lower sequencing cost per sample. METHOD We introduced a cost-effective optimized workflow, ECNano, comprising a wet-lab protocol and bioinformatics analysis, for accurate variant detection at 4800 clinically important genes and regions using a single MinION flowcell. The ECNano wet-lab protocol was optimized to perform long-read target enrichment and ONT library preparation to stably generate high-quality MES data with adequate coverage. The subsequent variant-calling workflow, Clair-ensemble, adopted a fast RNN-based variant caller, Clair, and was optimized for target enrichment data. To evaluate its performance and practicality, ECNano was tested on both reference DNA samples and patient samples. RESULTS ECNano achieved deep on-target depth of coverage (DoC) at average > 100× and > 98% uniformity using one MinION flowcell. For accurate ONT variant calling, the generated reads sufficiently covered 98.9% of pathogenic positions listed in ClinVar, with 98.96% having at least 30× DoC. ECNano obtained an average read length of 1000 bp. The long reads of ECNano also covered the adjacent splice sites well, with 98.5% of positions having ≥ 30× DoC. Clair-ensemble achieved > 99% recall and accuracy for SNV calling. The whole workflow from wet-lab protocol to variant detection was completed within three days. CONCLUSION We presented ECNano, an out-of-the-box workflow comprising (1) a wet-lab protocol for ONT target enrichment sequencing and (2) a downstream variant detection workflow, Clair-ensemble. The workflow is cost-effective, with a short turnaround time for high accuracy variant calling in 4800 clinically significant genes and regions using a single MinION flowcell. The long-read exon captured data has potential for further development, promoting the application of long-read sequencing in personalized disease treatment and risk prediction.
Collapse
Affiliation(s)
- Amy Wing-Sze Leung
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | | | - Chak-Lim Wong
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Zhen-Xian Zheng
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Wui-Wang Lui
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ho-Ming Luk
- Department of Health, Clinical Genetic Service, Hong Kong, SAR, China
| | - Ivan Fai-Man Lo
- Department of Health, Clinical Genetic Service, Hong Kong, SAR, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
44
|
Steinig E, Duchêne S, Aglua I, Greenhill A, Ford R, Yoannes M, Jaworski J, Drekore J, Urakoko B, Poka H, Wurr C, Ebos E, Nangen D, Manning L, Laman M, Firth C, Smith S, Pomat W, Tong SYC, Coin L, McBryde E, Horwood P. Phylodynamic Inference of Bacterial Outbreak Parameters Using Nanopore Sequencing. Mol Biol Evol 2022; 39:msac040. [PMID: 35171290 PMCID: PMC8963328 DOI: 10.1093/molbev/msac040] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Nanopore sequencing and phylodynamic modeling have been used to reconstruct the transmission dynamics of viral epidemics, but their application to bacterial pathogens has remained challenging. Cost-effective bacterial genome sequencing and variant calling on nanopore platforms would greatly enhance surveillance and outbreak response in communities without access to sequencing infrastructure. Here, we adapt random forest models for single nucleotide polymorphism (SNP) polishing developed by Sanderson and colleagues (2020. High precision Neisseria gonorrhoeae variant and antimicrobial resistance calling from metagenomic nanopore sequencing. Genome Res. 30(9):1354-1363) to estimate divergence and effective reproduction numbers (Re) of two methicillin-resistant Staphylococcus aureus (MRSA) outbreaks from remote communities in Far North Queensland and Papua New Guinea (PNG; n = 159). Successive barcoded panels of S. aureus isolates (2 × 12 per MinION) sequenced at low coverage (>5× to 10×) provided sufficient data to accurately infer genotypes with high recall when compared with Illumina references. Random forest models achieved high resolution on ST93 outbreak sequence types (>90% accuracy and precision) and enabled phylodynamic inference of epidemiological parameters using birth-death skyline models. Our method reproduced phylogenetic topology, origin of the outbreaks, and indications of epidemic growth (Re > 1). Nextflow pipelines implement SNP polisher training, evaluation, and outbreak alignments, enabling reconstruction of within-lineage transmission dynamics for infection control of bacterial disease outbreaks on portable nanopore platforms. Our study shows that nanopore technology can be used for bacterial outbreak reconstruction at competitive costs, providing opportunities for infection control in hospitals and communities without access to sequencing infrastructure, such as in remote northern Australia and PNG.
Collapse
Affiliation(s)
- Eike Steinig
- Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
- Australian Institute of Tropical Health and Medicine, James Cook University, Townsville and Cairns, Australia
| | - Sebastián Duchêne
- Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Izzard Aglua
- Joseph Nombri Memorial-Kundiawa General Hospital, Kundiawa, Papua New Guinea
| | - Andrew Greenhill
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
| | - Rebecca Ford
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
| | - Mition Yoannes
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
| | - Jan Jaworski
- Joseph Nombri Memorial-Kundiawa General Hospital, Kundiawa, Papua New Guinea
| | - Jimmy Drekore
- Simbu Children's Foundation, Kundiawa, Papua New Guinea
| | - Bohu Urakoko
- Joseph Nombri Memorial-Kundiawa General Hospital, Kundiawa, Papua New Guinea
| | - Harry Poka
- Joseph Nombri Memorial-Kundiawa General Hospital, Kundiawa, Papua New Guinea
| | - Clive Wurr
- Surgical Department, Goroka General Hospital, Goroka, Papua New Guinea
| | - Eri Ebos
- Surgical Department, Goroka General Hospital, Goroka, Papua New Guinea
| | - David Nangen
- Surgical Department, Goroka General Hospital, Goroka, Papua New Guinea
| | - Laurens Manning
- Department of Infectious Diseases, Fiona Stanley Hospital, Murdoch, Australia
- Medical School, University of Western Australia, Harry Perkins Research Institute, Fiona Stanley Hospital, Murdoch, Australia
| | - Moses Laman
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
| | - Cadhla Firth
- Australian Institute of Tropical Health and Medicine, James Cook University, Townsville and Cairns, Australia
| | - Simon Smith
- Cairns Hospital and Hinterland Health Service, Queensland Health, Cairns, Australia
| | - William Pomat
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
| | - Steven Y C Tong
- Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
- Victorian Infectious Diseases Service, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Lachlan Coin
- Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Emma McBryde
- Australian Institute of Tropical Health and Medicine, James Cook University, Townsville and Cairns, Australia
| | - Paul Horwood
- Papua New Guinea Institute of Medical Research, Goroka, Papua, Papua New Guinea
- College of Public Health, Medical & Veterinary Sciences, James Cook University, Townsville, Australia
| |
Collapse
|
45
|
Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 2022; 23:155. [PMID: 35193511 PMCID: PMC8862519 DOI: 10.1186/s12864-022-08365-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 02/03/2022] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets. RESULTS In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 "gold standard" WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline's performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting. CONCLUSIONS The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Bioinformatics Institute, St. Petersburg, Russia.
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg, Russia.
- Department of Genetics and Biotechnology, St. Petersburg State University, St. Petersburg, Russia.
| | - Ruslan Abasov
- Bioinformatics Institute, St. Petersburg, Russia
- Dmitry Rogachev National Research Center of Pediatric Hematology-Oncology and Immunology, Moscow, Russia
| | - Varvara E Tvorogova
- Bioinformatics Institute, St. Petersburg, Russia
- Department of Genetics and Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Andrey S Glotov
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg, Russia
| | | |
Collapse
|
46
|
Lou H, Gao Y, Xie B, Wang Y, Zhang H, Shi M, Ma S, Zhang X, Liu C, Xu S. Haplotype-resolved de novo assembly of a Tujia genome suggests the necessity for high-quality population-specific genome references. Cell Syst 2022; 13:321-333.e6. [PMID: 35180379 DOI: 10.1016/j.cels.2022.01.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 11/09/2021] [Accepted: 01/27/2022] [Indexed: 12/17/2022]
Abstract
Even though the human reference genome assembly is continually being improved, it remains debatable whether a population-specific reference is necessary for every ethnic group. Here, we de novo assembled an individual genome (TJ1) from the Tujia population, an ethnic minority group most closely related to the Han Chinese. TJ1 provided a high-quality haplotype-resolved assembly of chromosome-scale with a scaffold N50 size >78 Mb. Compared with GRCh38 and other de novo assemblies, TJ1 improved short-read mapping, enhanced calling precision for structural variants, and detected rare and low-frequency variants. This revealed fine-scale differences between the closely related Han and Tujia populations, such as population-stratified variants of LCT and UBXN8, and improved screening for ancestry informative markers. We demonstrated that TJ1 could reduce false positives in clinical diagnosis and analyzed the PRSS1-PRSS2 locus as a test case. Our results suggest that population-specific assemblies are necessary for genetic and medical analysis, especially when closely related populations are studied. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Haiyi Lou
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200438, China.
| | - Yang Gao
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China; Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Bo Xie
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yimin Wang
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | | | - Miao Shi
- Berry Genomics, Beijing 102200, China
| | - Sen Ma
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Xiaoxi Zhang
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China; Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Chang Liu
- Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200438, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China; Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai 200032, China; Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; Jiangsu Key Laboratory of Phylogenomics and Comparative Genomics, School of Life Sciences, Jiangsu Normal University, Xuzhou 221116, China; Henan Institute of Medical and Pharmaceutical Sciences, Zhengzhou University, Zhengzhou 450052, China; Ministry of Education Key Laboratory of Contemporary Anthropology, Human Phenome Institute, Fudan University, Shanghai 201203, China.
| |
Collapse
|
47
|
Han U, Kang T, Im J, Hong J. A small data-driven predictive model for adsorption properties in polymeric thin film. Chem Commun (Camb) 2022; 58:10953-10956. [DOI: 10.1039/d2cc03567g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Artificial intelligence allowing data-driven prediction of physicochemical properties of polymer is rapidly emerging as a powerful tool for advancing material science. Here, we provide a methodology to perceive polymer adsorption...
Collapse
|
48
|
Kuno A, Ikeda Y, Ayabe S, Kato K, Sakamoto K, Suzuki SR, Morimoto K, Wakimoto A, Mikami N, Ishida M, Iki N, Hamada Y, Takemura M, Daitoku Y, Tanimoto Y, Dinh TTH, Murata K, Hamada M, Muratani M, Yoshiki A, Sugiyama F, Takahashi S, Mizuno S. DAJIN enables multiplex genotyping to simultaneously validate intended and unintended target genome editing outcomes. PLoS Biol 2022; 20:e3001507. [PMID: 35041655 PMCID: PMC8765641 DOI: 10.1371/journal.pbio.3001507] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Accepted: 12/07/2021] [Indexed: 12/24/2022] Open
Abstract
Genome editing can introduce designed mutations into a target genomic site. Recent research has revealed that it can also induce various unintended events such as structural variations, small indels, and substitutions at, and in some cases, away from the target site. These rearrangements may result in confounding phenotypes in biomedical research samples and cause a concern in clinical or agricultural applications. However, current genotyping methods do not allow a comprehensive analysis of diverse mutations for phasing and mosaic variant detection. Here, we developed a genotyping method with an on-target site analysis software named Determine Allele mutations and Judge Intended genotype by Nanopore sequencer (DAJIN) that can automatically identify and classify both intended and unintended diverse mutations, including point mutations, deletions, inversions, and cis double knock-in at single-nucleotide resolution. Our approach with DAJIN can handle approximately 100 samples under different editing conditions in a single run. With its high versatility, scalability, and convenience, DAJIN-assisted multiplex genotyping may become a new standard for validating genome editing outcomes.
Collapse
Affiliation(s)
- Akihiro Kuno
- Department of Anatomy and Embryology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
| | - Yoshihisa Ikeda
- Doctoral Program in Biomedical Sciences, Graduate School of Comprehensive Human Sciences, University of Tsukuba, Tsukuba, Japan
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Shinya Ayabe
- Experimental Animal Division, RIKEN BioResource Research Center, Tsukuba, Japan
| | - Kanako Kato
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Kotaro Sakamoto
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Sayaka R. Suzuki
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
- Bioinformatics Laboratory, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Kento Morimoto
- Doctoral Program in Medical Sciences, Graduate School of Comprehensive Human Sciences, University of Tsukuba, Tsukuba, Japan
| | - Arata Wakimoto
- Department of Anatomy and Embryology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
| | - Natsuki Mikami
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
| | - Miyuki Ishida
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Natsumi Iki
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Yuko Hamada
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Megumi Takemura
- Department of Anatomy and Embryology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Yoko Daitoku
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Yoko Tanimoto
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Tra Thi Huong Dinh
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Kazuya Murata
- Ph.D Program in Human Biology, School of Integrative and Global Majors, University of Tsukuba, Tsukuba, Japan
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Michito Hamada
- Department of Anatomy and Embryology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Masafumi Muratani
- Department of Genome Biology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Atsushi Yoshiki
- Experimental Animal Division, RIKEN BioResource Research Center, Tsukuba, Japan
| | - Fumihiro Sugiyama
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Satoru Takahashi
- Department of Anatomy and Embryology, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| | - Seiya Mizuno
- Laboratory Animal Resource Center, Transborder Medical Research Center, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
49
|
Xie S, Leung AWS, Zheng Z, Zhang D, Xiao C, Luo R, Luo M, Zhang S. Applications and potentials of nanopore sequencing in the (epi)genome and (epi)transcriptome era. Innovation (N Y) 2021; 2:100153. [PMID: 34901902 PMCID: PMC8640597 DOI: 10.1016/j.xinn.2021.100153] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 08/09/2021] [Indexed: 02/08/2023] Open
Abstract
The Human Genome Project opened an era of (epi)genomic research, and also provided a platform for the development of new sequencing technologies. During and after the project, several sequencing technologies continue to dominate nucleic acid sequencing markets. Currently, Illumina (short-read), PacBio (long-read), and Oxford Nanopore (long-read) are the most popular sequencing technologies. Unlike PacBio or the popular short-read sequencers before it, which, as examples of the second or so-called Next-Generation Sequencing platforms, need to synthesize when sequencing, nanopore technology directly sequences native DNA and RNA molecules. Nanopore sequencing, therefore, avoids converting mRNA into cDNA molecules, which not only allows for the sequencing of extremely long native DNA and full-length RNA molecules but also document modifications that have been made to those native DNA or RNA bases. In this review on direct DNA sequencing and direct RNA sequencing using Oxford Nanopore technology, we focus on their development and application achievements, discussing their challenges and future perspective. We also address the problems researchers may encounter applying these approaches in their research topics, and how to resolve them.
Collapse
Affiliation(s)
- Shangqian Xie
- Key Laboratory of Ministry of Education for Genetics and Germplasm Innovation of Tropical Special Trees and Ornamental Plants, College of Forestry, Hainan University, Haikou 570228, China
| | - Amy Wing-Sze Leung
- Department of Computer Science, The University of Hong Kong, Hong Kong 999077, China
| | - Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Hong Kong 999077, China
| | - Dake Zhang
- Beijing Advanced Innovation Centre for Biomedical Engineering, Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, School of Biological Science and Medical Engineering, Beihang University, Beijing 100083, China
| | - Chuanle Xiao
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Centre, Sun Yat-sen University, Guangzhou 510060, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong 999077, China
| | - Ming Luo
- Agriculture and Biotechnology Research Center, Guangdong Provincial Key Laboratory of Applied Botany, Center of Economic Botany, Core Botanical Gardens, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
| | - Shoudong Zhang
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong 999077, China
- Center for Soybean Research of the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong 999077, China
| |
Collapse
|
50
|
Vollrath P, Chawla HS, Alnajar D, Gabur I, Lee H, Weber S, Ehrig L, Koopmann B, Snowdon RJ, Obermeier C. Dissection of Quantitative Blackleg Resistance Reveals Novel Variants of Resistance Gene Rlm9 in Elite Brassica napus. FRONTIERS IN PLANT SCIENCE 2021; 12:749491. [PMID: 34868134 PMCID: PMC8636856 DOI: 10.3389/fpls.2021.749491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 10/29/2021] [Indexed: 05/15/2023]
Abstract
Blackleg is one of the major fungal diseases in oilseed rape/canola worldwide. Most commercial cultivars carry R gene-mediated qualitative resistances that confer a high level of race-specific protection against Leptosphaeria maculans, the causal fungus of blackleg disease. However, monogenic resistances of this kind can potentially be rapidly overcome by mutations in the pathogen's avirulence genes. To counteract pathogen adaptation in this evolutionary arms race, there is a tremendous demand for quantitative background resistance to enhance durability and efficacy of blackleg resistance in oilseed rape. In this study, we characterized genomic regions contributing to quantitative L. maculans resistance by genome-wide association studies in a multiparental mapping population derived from six parental elite varieties exhibiting quantitative resistance, which were all crossed to one common susceptible parental elite variety. Resistance was screened using a fungal isolate with no corresponding avirulence (AvrLm) to major R genes present in the parents of the mapping population. Genome-wide association studies revealed eight significantly associated quantitative trait loci (QTL) on chromosomes A07 and A09, with small effects explaining 3-6% of the phenotypic variance. Unexpectedly, the qualitative blackleg resistance gene Rlm9 was found to be located within a resistance-associated haploblock on chromosome A07. Furthermore, long-range sequence data spanning this haploblock revealed high levels of single-nucleotide and structural variants within the Rlm9 coding sequence among the parents of the mapping population. The results suggest that novel variants of Rlm9 could play a previously unknown role in expression of quantitative disease resistance in oilseed rape.
Collapse
Affiliation(s)
- Paul Vollrath
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| | - Harmeet S. Chawla
- Department of Plant Sciences, Crop Development Centre, University of Saskatchewan, Saskatoon, SK, Canada
| | - Dima Alnajar
- Plant Pathology and Crop Protection Division, Department of Crop Sciences, Georg August University of Göttingen, Göttingen, Germany
| | - Iulian Gabur
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
- Department of Plant Sciences, Faculty of Agriculture, Iasi University of Life Sciences, Iaşi, Romania
| | - HueyTyng Lee
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| | - Sven Weber
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| | - Lennard Ehrig
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| | - Birger Koopmann
- Plant Pathology and Crop Protection Division, Department of Crop Sciences, Georg August University of Göttingen, Göttingen, Germany
| | - Rod J. Snowdon
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| | - Christian Obermeier
- Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University Giessen, Giessen, Germany
| |
Collapse
|