1
|
Harary Y, Snapir P, Tov SS, Kruphman C, Rechef E, Jahshan Z, Garzon E, Yavits L. GCOC: A Genome Classifier-On-Chip Based on Similarity Search Content Addressable Memory. IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2025; 19:484-495. [PMID: 39196751 DOI: 10.1109/tbcas.2024.3449788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/30/2024]
Abstract
GCOC is a genome classification system-on-chip (SoC) that classifies genomes by $k$-mer matching, an approach that divides a DNA query sequence into a set of short DNA fragments of size k, which are searched in a reference genome database, with the underlying assumption that sequenced DNA reads of the same organism (or its close variants) share most of such $k$-mers. At the core of GCOC is a similarity, or approximate search-capable Content Addressable Memory (SAS-CAM), which in addition to exact match, also supports approximate, or Hamming distance tolerant search. Classification operation is controlled by an embedded RISC-V processor. GCOC classification platform was designed and manufactured in a commercial 65nm process. We conduct a thorough analysis of GCOC classification efficiency as well as its performance, silicon area, and power consumption using silicon measurements. GCOC classifies 769.2K short DNA reads/sec. The silicon area of GCOC SoC is 3.12 $\mathrm{mm}^{2}$ and its power consumption is 1.27 $\mathrm{mW}$. We envision GCOC deployed as a field (for example at points of care) portable classifier where the classification is required to be real-time, easy to operate and energy efficient.
Collapse
|
2
|
Marini S, Barquero A, Wadhwani AA, Bian J, Ruiz J, Boucher C, Prosperi M. OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:798-807. [PMID: 40417475 PMCID: PMC12099329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Portable genomic sequencers such as Oxford Nanopore's MinION enable real-time applications in clinical and environmental health. However, there is a bottleneck in the downstream analytics when bioinformatics pipelines are unavailable, e.g., when cloud processing is unreachable due to absence of Internet connection, or only low-end computing devices can be carried on site. Here we present a platform-friendly software for portable metagenomic analysis of Nanopore data, the Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS). OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components for improving metagenomics classification on incomplete/sampled reference databases, making it ideal for running on smartphones or tablets. OCTOPUS obtains sensitivity and precision comparable to Kraken2, while dramatically decreasing (4- to 16-fold) the false positive rate, and yielding high correlation on real-word data. OCTOPUS is available along with customized databases at https://github.com/DataIntellSystLab/OCTOPUS and https://github.com/Ruiz-HCI-Lab/OctopusMobile.
Collapse
Affiliation(s)
- Simone Marini
- Department of Epidemiology, University of Florida, Gainesville, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, USA
| | - Alexander Barquero
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Anisha Ashok Wadhwani
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, USA
| | - Jaime Ruiz
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, USA
| |
Collapse
|
3
|
Ni K, Yu G, Zheng Z, Lu Y, Poe D, Chen Y, Sanborn M, Wang Z, Zhou S, Zhan X, Wang W, Xing J. LivecellX: A Scalable Deep Learning Framework for Single-Cell Object-Oriented Analysis in Live-Cell Imaging. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639532. [PMID: 40060645 PMCID: PMC11888277 DOI: 10.1101/2025.02.23.639532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/25/2025]
Abstract
Quantitative analysis of single-cell dynamics in live-cell imaging is pivotal for understanding cellular heterogeneity, disease mechanisms, and drug responses. However, this analysis demands stringent accuracy in cell segmentation and tracking. A single segmentation error can significantly impact trajectory analyses, leading to error cascades, despite recent advances that have improved segmentation precision. To tackle these challenges, we introduce LivecellX, a deep-learning-based, object-oriented framework designed for scalable analysis of live-cell dynamics. We have defined a new task: segmentation correction for both over-segmentation and under-segmentation errors, and developed innovative evaluation metrics and machine learning techniques to address this issue. Our work includes annotating a novel imaging dataset from two distinct microscope types and training a Corrective Segmentation Network (CS-Net). The network leverages normalized distance transforms and synthetic augmentation to rectify segmentation inaccuracies. We also propose trajectory-level correction algorithms that use temporal consistency and CS-Net to resolve errors at the trajectory level. After tracking, LivecellX facilitates biological process detection, diverse feature extraction, and lineage reconstruction across different datasets and imaging platforms. Its object-oriented architecture enables efficient data management and seamless integration across multiple datasets. Enhanced by Napari GUI support and parallelized computation, LivecellX offers a robust and extensible infrastructure for high-throughput single-cell imaging analysis, paving the way for future developments in live-cell foundation models.
Collapse
|
4
|
Gao R, Hu H, Jiang Z, Cao S, Wang G, Zhao Y, Jiang T. SVHunter: long-read-based structural variation detection through the transformer model. Brief Bioinform 2025; 26:bbaf203. [PMID: 40341921 PMCID: PMC12062572 DOI: 10.1093/bib/bbaf203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 03/31/2025] [Accepted: 04/15/2025] [Indexed: 05/11/2025] Open
Abstract
Structural variations (SVs) are genomic rearrangements larger than 50 bp, that are widely present in the human genome and are associated with various complex diseases. Existing long-read-based SV detection tools often rely on fixed rules or heuristic algorithms, which can oversimplify the complexity of SV signatures. Therefore, these methods usually lack flexibility and cannot fully capture SV signals, leading to reduced accuracy and robustness. To address these issues, we propose SVHunter, a transformer-based method for long-read SV detection. SVHunter combines convolutional neural networks and transformers to capture both local and global SV signatures, enabling accurate identification of SVs. Additionally, SVHunter employs the mean shift clustering algorithm, which dynamically adjusts bandwidth parameters to accommodate different types of SVs without requiring a preset number of clusters, thus allowing precise breakpoint clustering. Validation across multiple sequencing platforms and datasets demonstrates that SVHunter excels at detecting various types of SVs, with a notable reduction in the false discovery rate. This highlights considerable strong potential for both research and clinical applications.
Collapse
Affiliation(s)
- Runtian Gao
- College of Life Science, Northeast Forestry University, Harbin 150000, China
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
| | - Heng Hu
- College of Life Science, Northeast Forestry University, Harbin 150000, China
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
| | - Zhongjun Jiang
- College of Life Science, Northeast Forestry University, Harbin 150000, China
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
| | - Shuqi Cao
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
| | - Tao Jiang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
5
|
Depuydt L, Ahmed OY, Fostier J, Langmead B, Gagie T. Run-length compressed metagenomic read classification with SMEM-finding and tagging. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.25.640119. [PMID: 40060500 PMCID: PMC11888359 DOI: 10.1101/2025.02.25.640119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/15/2025]
Abstract
Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in O ( r ) space, where r is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least L between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of O ( r ) . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University
| |
Collapse
|
6
|
Kong T, Wang Y, Liu B. xRead: a coverage-guided approach for scalable construction of read overlapping graph. Gigascience 2025; 14:giaf007. [PMID: 39960665 PMCID: PMC11831799 DOI: 10.1093/gigascience/giaf007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 11/29/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced. FINDINGS Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies. CONCLUSIONS xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.
Collapse
Affiliation(s)
- Tangchao Kong
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
7
|
Zakeri M, Brown NK, Ahmed OY, Gagie T, Langmead B. Movi: A fast and cache-efficient full-text pangenome index. iScience 2024; 27:111464. [PMID: 39758981 PMCID: PMC11696632 DOI: 10.1016/j.isci.2024.111464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 10/11/2024] [Accepted: 11/20/2024] [Indexed: 01/07/2025] Open
Abstract
Pangenome indexes are promising tools for many applications, including classification of nanopore sequencing reads. Move structure is a compressed-index data structure based on the Burrows-Wheeler Transform (BWT). It offers simultaneous O(1)-time queries and O(r) space, where r is the number of BWT runs (consecutive sequence of identical characters). We developed Movi based on the move structure for indexing and querying pangenomes. Movi scales very well for repetitive text as its size grows strictly by r. Movi computes sophisticated matching queries for classification such as pseudo-matching lengths and backward search up to 30 times faster than existing methods by minimizing the number of cache misses and using memory prefetching to attain a degree of latency hiding. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, US
| | - Nathaniel K. Brown
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, US
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, US
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, US
| |
Collapse
|
8
|
Liu Y, Li Y, Chen E, Xu J, Zhang W, Zeng X, Luo X. Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat. Commun Biol 2024; 7:1678. [PMID: 39702496 DOI: 10.1038/s42003-024-07376-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Accepted: 12/05/2024] [Indexed: 12/21/2024] Open
Abstract
Error self-correction is crucial for analyzing long-read sequencing data, but existing methods often struggle with noisy data or are tailored to technologies like PacBio HiFi. There is a gap in methods optimized for Nanopore R10 simplex reads, which typically have error rates below 2%. We introduce DeChat, a novel approach designed specifically for these reads. DeChat enables repeat- and haplotype-aware error correction, leveraging the strengths of both de Bruijn graphs and variant-aware multiple sequence alignment to create a synergistic approach. This approach avoids read overcorrection, ensuring that variants in repeats and haplotypes are preserved while sequencing errors are accurately corrected. Benchmarking on simulated and real datasets shows that DeChat-corrected reads have significantly fewer errors-up to two orders of magnitude lower-compared to other methods, without losing read information. Furthermore, DeChat-corrected reads clearly improves genome assembly and taxonomic classification.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Enlian Chen
- College of Biology, Hunan University, Changsha, China
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| |
Collapse
|
9
|
Fuhrmann L, Langer B, Topolsky I, Beerenwinkel N. VILOCA: sequencing quality-aware viral haplotype reconstruction and mutation calling for short-read and long-read data. NAR Genom Bioinform 2024; 6:lqae152. [PMID: 39633724 PMCID: PMC11616694 DOI: 10.1093/nargab/lqae152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/15/2024] [Accepted: 10/25/2024] [Indexed: 12/07/2024] Open
Abstract
RNA viruses exist as large heterogeneous populations within their host. The structure and diversity of virus populations affects disease progression and treatment outcomes. Next-generation sequencing allows detailed viral population analysis, but inferring diversity from error-prone reads is challenging. Here, we present VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a method for mutation calling and reconstruction of local haplotypes from short- and long-read viral sequencing data. Local haplotypes refer to genomic regions that have approximately the length of the input reads. VILOCA recovers local haplotypes by using a Dirichlet process mixture model to cluster reads around their unobserved haplotypes and leveraging quality scores of the sequencing reads. We assessed the performance of VILOCA in terms of mutation calling and haplotype reconstruction accuracy on simulated and experimental Illumina, PacBio and Oxford Nanopore data. On simulated and experimental Illumina data, VILOCA performed better or similar to existing methods. On the simulated long-read data, VILOCA is able to recover on average [Formula: see text] of the ground truth mutations with perfect precision compared to only [Formula: see text] recall and [Formula: see text] precision of the second-best method. In summary, VILOCA provides significantly improved accuracy in mutation and haplotype calling, especially for long-read sequencing data, and therefore facilitates the comprehensive characterization of heterogeneous within-host viral populations.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, Lausanne 1015, Switzerland
| | - Benjamin Langer
- Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, Basel 4056, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, Lausanne 1015, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, Lausanne 1015, Switzerland
| |
Collapse
|
10
|
Luo J, Wang J, Wei J, Yan C, Luo H. DeepHapNet: a haplotype assembly method based on RetNet and deep spectral clustering. Brief Bioinform 2024; 26:bbae656. [PMID: 39690881 DOI: 10.1093/bib/bbae656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 10/18/2024] [Accepted: 12/05/2024] [Indexed: 12/19/2024] Open
Abstract
Gene polymorphism originates from single-nucleotide polymorphisms (SNPs), and the analysis and study of SNPs are of great significance in the field of biogenetics. The haplotype, which consists of the sequence of SNP loci, carries more genetic information than a single SNP. Haplotype assembly plays a significant role in understanding gene function, diagnosing complex diseases, and pinpointing species genes. We propose a novel method, DeepHapNet, for haplotype assembly through the clustering of reads and learning correlations between read pairs. We employ a sequence model called Retentive Network (RetNet), which utilizes a multiscale retention mechanism to extract read features and learn the global relationships among them. Based on the feature representation of reads learned from the RetNet model, the clustering process of reads is implemented using the SpectralNet model, and, finally, haplotypes are constructed based on the read clusters. Experiments with simulated and real datasets show that the method performs well in the haplotype assembly problem of diploid and polyploid based on either long or short reads. The code implementation of DeepHapNet and the processing scripts for experimental data are publicly available at https://github.com/wjj6666/DeepHapNet.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Century Road 2001, Jiaozuo 454003, China
| | - Jiaojiao Wang
- School of Software, Henan Polytechnic University, Century Road 2001, Jiaozuo 454003, China
| | - Jingjing Wei
- College of Chemical and Environmental Engineering, Anyang Institute of Technology, West Section of Huanghe Avenue, Anyang 455000, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, North Section of Jinming Avenue, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, North Section of Jinming Avenue, Kaifeng 475001, China
| |
Collapse
|
11
|
Zong P, Deng W, Liu J, Ruan J. TSTA: thread and SIMD-based trapezoidal pairwise/multiple sequence-alignment method. GIGABYTE 2024; 2024:gigabyte141. [PMID: 39539520 PMCID: PMC11558659 DOI: 10.46471/gigabyte.141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/01/2024] [Indexed: 11/16/2024] Open
Abstract
The rapid advancements in sequencing length necessitate the adoption of increasingly efficient sequence alignment algorithms. The Needleman-Wunsch method introduces the foundational dynamic-programming matrix calculation for global alignment, which evaluates the overall alignment of sequences. However, this method is known to be highly time-consuming. The proposed TSTA algorithm leverages both vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignments. Availability and implementation Source codes are available at https://github.com/bxskdh/TSTA.
Collapse
Affiliation(s)
- Peiyu Zong
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No 7, Pengfei Road, Dapeng District, Shenzhen, 518120, Guangdong, China
| | - Wenpeng Deng
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No 7, Pengfei Road, Dapeng District, Shenzhen, 518120, Guangdong, China
| | - Jian Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No 7, Pengfei Road, Dapeng District, Shenzhen, 518120, Guangdong, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No 7, Pengfei Road, Dapeng District, Shenzhen, 518120, Guangdong, China
| |
Collapse
|
12
|
Gao X, Liu K, Luo S, Tang M, Liu N, Jiang C, Fang J, Li S, Hou Y, Guo C, Qu K. Comparative analysis of methodologies for detecting extrachromosomal circular DNA. Nat Commun 2024; 15:9208. [PMID: 39448595 PMCID: PMC11502736 DOI: 10.1038/s41467-024-53496-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 10/14/2024] [Indexed: 10/26/2024] Open
Abstract
Extrachromosomal circular DNA (eccDNA) is crucial in oncogene amplification, gene transcription regulation, and intratumor heterogeneity. While various analysis pipelines and experimental methods have been developed for eccDNA identification, their detection efficiencies have not been systematically assessed. To address this, we evaluate the performance of 7 analysis pipelines using seven simulated datasets, in terms of accuracy, identity, duplication rate, and computational resource consumption. We also compare the eccDNA detection efficiency of 7 experimental methods through twenty-one real sequencing datasets. Here, we show that Circle-Map and Circle_finder (bwa-mem-samblaster) outperform the other short-read pipelines. However, Circle_finder (bwa-mem-samblaster) exhibits notable redundancy in its outcomes. CReSIL is the most effective pipeline for eccDNA detection in long-read sequencing data at depths higher than 10X. Moreover, long-read sequencing-based Circle-Seq shows superior efficiency in detecting copy number-amplified eccDNA over 10 kb in length. These results offer valuable insights for researchers in choosing the suitable methods for eccDNA research.
Collapse
Affiliation(s)
- Xuyuan Gao
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Ke Liu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Songwen Luo
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Meifang Tang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Nianping Liu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Chen Jiang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Jingwen Fang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- HanGene Biotech, Xiaoshan Innovation Polis, Hangzhou, Zhejiang, China
| | - Shouzhen Li
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Yanbing Hou
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Chuang Guo
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China.
- School of Pharmacy, Bengbu Medical University, Bengbu, China.
- Department of Rheumatology and Immunology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.
| | - Kun Qu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China.
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China.
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China.
| |
Collapse
|
13
|
Giurgiu M, Wittstruck N, Rodriguez-Fos E, Chamorro González R, Brückner L, Krienelke-Szymansky A, Helmsauer K, Hartebrodt A, Euskirchen P, Koche RP, Haase K, Reinert K, Henssen AG. Reconstructing extrachromosomal DNA structural heterogeneity from long-read sequencing data using Decoil. Genome Res 2024; 34:1355-1364. [PMID: 39111816 PMCID: PMC11529853 DOI: 10.1101/gr.279123.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 07/29/2024] [Indexed: 08/23/2024]
Abstract
Circular extrachromosomal DNA (ecDNA) is a form of oncogene amplification found across cancer types and associated with poor outcome in patients. ecDNA can be structurally complex and can contain rearranged DNA sequences derived from multiple chromosome locations. As the structure of ecDNA can impact oncogene regulation and may indicate mechanisms of its formation, disentangling it at high resolution from sequencing data is essential. Even though methods have been developed to identify and reconstruct ecDNA in cancer genome sequencing, it remains challenging to resolve complex ecDNA structures, in particular amplicons with shared genomic footprints. We here introduce Decoil, a computational method that combines a breakpoint-graph approach with LASSO regression to reconstruct complex ecDNA and deconvolve co-occurring ecDNA elements with overlapping genomic footprints from long-read nanopore sequencing. Decoil outperforms de novo assembly and alignment-based methods in simulated long-read sequencing data for both simple and complex ecDNAs. Applying Decoil on whole-genome sequencing data uncovered different ecDNA topologies and explored ecDNA structure heterogeneity in neuroblastoma tumors and cell lines, indicating that this method may improve ecDNA structural analyses in cancer.
Collapse
Affiliation(s)
- Mădălina Giurgiu
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany;
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
- Freie Universität Berlin, 14195 Berlin, Germany
| | - Nadine Wittstruck
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Elias Rodriguez-Fos
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Rocío Chamorro González
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
- Max Delbrück Center for Molecular Medicine, 13125 Berlin, Germany
| | - Lotte Brückner
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
- Max Delbrück Center for Molecular Medicine, 13125 Berlin, Germany
| | - Annabell Krienelke-Szymansky
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Konstantin Helmsauer
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
| | - Anne Hartebrodt
- Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Philipp Euskirchen
- German Cancer Consortium (DKTK), partner site Berlin, a partnership between DKFZ and Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
- Department of Neuropathology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany
| | - Richard P Koche
- Center for Epigenetics Research, Memorial Sloan Kettering Cancer Center, New York, New York 10065, USA
| | - Kerstin Haase
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
| | | | - Anton G Henssen
- Department of Pediatric Oncology and Hematology, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany;
- Experimental and Clinical Research Center of the Max Delbrück Center and Charité Berlin, 13125 Berlin, Germany
- Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany
- Max Delbrück Center for Molecular Medicine, 13125 Berlin, Germany
| |
Collapse
|
14
|
Chanin RB, West PT, Wirbel J, Gill MO, Green GZM, Park RM, Enright N, Miklos AM, Hickey AS, Brooks EF, Lum KK, Cristea IM, Bhatt AS. Intragenic DNA inversions expand bacterial coding capacity. Nature 2024; 634:234-242. [PMID: 39322669 DOI: 10.1038/s41586-024-07970-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 08/20/2024] [Indexed: 09/27/2024]
Abstract
Bacterial populations that originate from a single bacterium are not strictly clonal and often contain subgroups with distinct phenotypes1. Bacteria can generate heterogeneity through phase variation-a preprogrammed, reversible mechanism that alters gene expression levels across a population1. One well-studied type of phase variation involves enzyme-mediated inversion of specific regions of genomic DNA2. Frequently, these DNA inversions flip the orientation of promoters, turning transcription of adjacent coding regions on or off2. Through this mechanism, inversion can affect fitness, survival or group dynamics3,4. Here, we describe the development of PhaVa, a computational tool that identifies DNA inversions using long-read datasets. We also identify 372 'intragenic invertons', a novel class of DNA inversions found entirely within genes, in genomes of bacterial and archaeal isolates. Intragenic invertons allow a gene to encode two or more versions of a protein by flipping a DNA sequence within the coding region, thereby increasing coding capacity without increasing genome size. We validate ten intragenic invertons in the gut commensal Bacteroides thetaiotaomicron, and experimentally characterize an intragenic inverton in the thiamine biosynthesis gene thiC.
Collapse
Affiliation(s)
- Rachael B Chanin
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Patrick T West
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Jakob Wirbel
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Matthew O Gill
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Gabriella Z M Green
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Ryan M Park
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Nora Enright
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Arjun M Miklos
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Angela S Hickey
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Erin F Brooks
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA
| | - Krystal K Lum
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
| | - Ileana M Cristea
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
| | - Ami S Bhatt
- Department of Medicine, Division of Hematology, Stanford University, Stanford, CA, USA.
- Department of Genetics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
15
|
Baudeau T, Sahlin K. Improved sub-genomic RNA prediction with the ARTIC protocol. Nucleic Acids Res 2024; 52:e82. [PMID: 39149898 PMCID: PMC11417393 DOI: 10.1093/nar/gkae687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 07/18/2024] [Accepted: 07/25/2024] [Indexed: 08/17/2024] Open
Abstract
Viral subgenomic RNA (sgRNA) plays a major role in SARS-COV2's replication, pathogenicity, and evolution. Recent sequencing protocols, such as the ARTIC protocol, have been established. However, due to the viral-specific biological processes, analyzing sgRNA through viral-specific read sequencing data is a computational challenge. Current methods rely on computational tools designed for eukaryote genomes, resulting in a gap in the tools designed specifically for sgRNA detection. To address this, we make two contributions. Firstly, we present sgENERATE, an evaluation pipeline to study the accuracy and efficacy of sgRNA detection tools using the popular ARTIC sequencing protocol. Using sgENERATE, we evaluate periscope, a recently introduced tool that detects sgRNA from ARTIC sequencing data. We find that periscope has biased predictions and high computational costs. Secondly, using the information produced from sgENERATE, we redesign the algorithm in periscope to use multiple references from canonical sgRNAs to mitigate alignment issues and improve sgRNA and non-canonical sgRNA detection. We evaluate periscope and our algorithm, periscope_multi, on simulated and biological sequencing datasets and demonstrate periscope_multi's enhanced sgRNA detection accuracy. Our contribution advances tools for studying viral sgRNA, paving the way for more accurate and efficient analyses in the context of viral RNA discovery.
Collapse
Affiliation(s)
- Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91 Stockholm, Sweden
| |
Collapse
|
16
|
Huang Y, Gao Y, Ly K, Lin L, Lambooij JP, King EG, Janssen A, Wei KHC, Lee YCG. Varying recombination landscapes between individuals are driven by polymorphic transposable elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.17.613564. [PMID: 39345575 PMCID: PMC11429682 DOI: 10.1101/2024.09.17.613564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Meiotic recombination is a prominent force shaping genome evolution, and understanding the causes for varying recombination landscapes within and between species has remained a central, though challenging, question. Recombination rates are widely observed to negatively associate with the abundance of transposable elements (TEs), selfish genetic elements that move between genomic locations. While such associations are usually interpreted as recombination influencing the efficacy of selection at removing TEs, accumulating findings suggest that TEs could instead be the cause rather than the consequence. To test this prediction, we formally investigated the influence of polymorphic, putatively active TEs on recombination rates. We developed and benchmarked a novel approach that uses PacBio long-read sequencing to efficiently, accurately, and cost-effectively identify crossovers (COs), a key recombination product, among large numbers of pooled recombinant individuals. By applying this approach to Drosophila strains with distinct TE insertion profiles, we found that polymorphic TEs, especially RNA-based TEs and TEs with local enrichment of repressive marks, reduce the occurrence of COs. Such an effect leads to different CO frequencies between homologous sequences with and without TEs, contributing to varying CO maps between individuals. The suppressive effect of TEs on CO is further supported by two orthogonal approaches-analyzing the distributions of COs in panels of recombinant inbred lines in relation to TE polymorphism and applying marker-assisted estimations of CO frequencies to isogenic strains with and without transgenically inserted TEs. Our investigations reveal how the constantly changing mobilome can actively modify recombination landscapes, shaping genome evolution within and between species.
Collapse
Affiliation(s)
- Yuheng Huang
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | - Yi Gao
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | - Kayla Ly
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | - Leila Lin
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | - Jan Paul Lambooij
- Center for Molecular Medicine, University Medical Center Utrecht, the Netherlands
| | | | - Aniek Janssen
- Center for Molecular Medicine, University Medical Center Utrecht, the Netherlands
| | - Kevin H.-C. Wei
- Department of Zoology, University of British Columbia, Canada
| | - Yuh Chwen G. Lee
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| |
Collapse
|
17
|
Marini S, Barquero A, Wadhwani AA, Bian J, Ruiz J, Boucher C, Prosperi M. OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585215. [PMID: 38559026 PMCID: PMC10979967 DOI: 10.1101/2024.03.15.585215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Portable genomic sequencers such as Oxford Nanopore's MinION enable real-time applications in clinical and environmental health. However, there is a bottleneck in the downstream analytics when bioinformatics pipelines are unavailable, e.g., when cloud processing is unreachable due to absence of Internet connection, or only low-end computing devices can be carried on site. Here we present a platform-friendly software for portable metagenomic analysis of Nanopore data, the Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS). OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components for improving metagenomics classification on incomplete/sampled reference databases, making it ideal for running on smartphones or tablets. OCTOPUS obtains sensitivity and precision comparable to Kraken2, while dramatically decreasing (4- to 16-fold) the false positive rate, and yielding high correlation on real-word data. OCTOPUS is available along with customized databases at https://github.com/DataIntellSystLab/OCTOPUS and https://github.com/Ruiz-HCI-Lab/OctopusMobile.
Collapse
Affiliation(s)
- Simone Marini
- Department of Epidemiology, University of Florida, Gainesville, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, USA
| | - Alexander Barquero
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Anisha Ashok Wadhwani
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, USA
| | - Jaime Ruiz
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, USA
| |
Collapse
|
18
|
Liu C, Wu P, Wu X, Zhao X, Chen F, Cheng X, Zhu H, Wang O, Xu M. AsmMix: an efficient haplotype-resolved hybrid de novo genome assembling pipeline. Front Genet 2024; 15:1421565. [PMID: 39130747 PMCID: PMC11310137 DOI: 10.3389/fgene.2024.1421565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 07/05/2024] [Indexed: 08/13/2024] Open
Abstract
Accurate haplotyping facilitates distinguishing allele-specific expression, identifying cis-regulatory elements, and characterizing genomic variations, which enables more precise investigations into the relationship between genotype and phenotype. Recent advances in third-generation single-molecule long read and synthetic co-barcoded read sequencing techniques have harnessed long-range information to simplify the assembly graph and improve assembly genomic sequence. However, it remains methodologically challenging to reconstruct the complete haplotypes due to high sequencing error rates of long reads and limited capturing efficiency of co-barcoded reads. We here present a pipeline, AsmMix, for generating both contiguous and accurate diploid genomes. It first assembles co-barcoded reads to generate accurate haplotype-resolved assemblies that may contain many gaps, while the long-read assembly is contiguous but susceptible to errors. Then two assembly sets are integrated into haplotype-resolved assemblies with reduced misassembles. Through extensive evaluation on multiple synthetic datasets, AsmMix consistently demonstrates high precision and recall rates for haplotyping across diverse sequencing platforms, coverage depths, read lengths, and read accuracies, significantly outperforming other existing tools in the field. Furthermore, we validate the effectiveness of our pipeline using a human whole genome dataset (HG002), and produce highly contiguous, accurate, and haplotype-resolved assemblies. These assemblies are evaluated using the GIAB benchmarks, confirming the accuracy of variant calling. Our results demonstrate that AsmMix offers a straightforward yet highly efficient approach that effectively leverages both long reads and co-barcoded reads for haplotype-resolved assembly.
Collapse
Affiliation(s)
- Chao Liu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Pei Wu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Xue Wu
- BGI Research, Shenzhen, China
| | | | | | | | - Hongmei Zhu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Ou Wang
- BGI Research, Shenzhen, China
| | - Mengyang Xu
- BGI Research, Shenzhen, China
- BGI Research, Qingdao, China
| |
Collapse
|
19
|
Shao H, Ruan J. BSAlign: A Library for Nucleotide Sequence Alignment. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae025. [PMID: 39209796 PMCID: PMC12016559 DOI: 10.1093/gpbjnl/qzae025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 03/03/2024] [Accepted: 03/12/2024] [Indexed: 09/04/2024]
Abstract
Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research. Although classic dynamic programming (DP) algorithms (e.g., Smith-Waterman and Needleman-Wunsch) guarantee to produce the optimal result, their time complexity hinders the application of large-scale sequence alignment. Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives: redesigning data structures [e.g., diagonal or striped Single Instruction Multiple Data (SIMD) implementations], increasing the number of parallelisms in SIMD operations (e.g., difference recurrence relation), or reducing search space (e.g., banded DP). However, no methods combine all these three aspects to build an ultra-fast algorithm. In this study, we developed a Banded Striped Aligner (BSAlign) library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP. We applied our new acceleration design on both regular and edit distance pairwise alignment. BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment, and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads. BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.
Collapse
Affiliation(s)
- Haojing Shao
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| |
Collapse
|
20
|
Gamaarachchi H, Ferguson JM, Samarakoon H, Liyanage K, Deveson IW. Simulation of nanopore sequencing signal data with tunable parameters. Genome Res 2024; 34:778-783. [PMID: 38692839 PMCID: PMC11216307 DOI: 10.1101/gr.278730.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/24/2024] [Indexed: 05/03/2024]
Abstract
In silico simulation of high-throughput sequencing data is a technique used widely in the genomics field. However, there is currently a lack of effective tools for creating simulated data from nanopore sequencing devices, which measure DNA or RNA molecules in the form of time-series current signal data. Here, we introduce Squigulator, a fast and simple tool for simulation of realistic nanopore signal data. Squigulator takes a reference genome, a transcriptome, or read sequences, and generates corresponding raw nanopore signal data. This is compatible with basecalling software from Oxford Nanopore Technologies (ONT) and other third-party tools, thereby providing a useful substrate for development, testing, debugging, validation, and optimization at every stage of a nanopore analysis workflow. The user may generate data with preset parameters emulating specific ONT protocols or noise-free "ideal" data, or they may deterministically modify a range of experimental variables and/or noise parameters to shape the data to their needs. We present a brief example of Squigulator's use, creating simulated data to model the degree to which different parameters impact the accuracy of ONT basecalling and downstream variant detection. This analysis reveals new insights into the nature of ONT data and basecalling algorithms. We provide Squigulator as an open-source tool for the nanopore community.
Collapse
Affiliation(s)
- Hasindu Gamaarachchi
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia;
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - James M Ferguson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Hiruna Samarakoon
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Kisaru Liyanage
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Ira W Deveson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia;
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
- St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Sydney, New South Wales 2052, Australia
| |
Collapse
|
21
|
Hämälä T, Moore C, Cowan L, Carlile M, Gopaulchan D, Brandrud MK, Birkeland S, Loose M, Kolář F, Koch MA, Yant L. Impact of whole-genome duplications on structural variant evolution in Cochlearia. Nat Commun 2024; 15:5377. [PMID: 38918389 PMCID: PMC11199601 DOI: 10.1038/s41467-024-49679-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Accepted: 06/14/2024] [Indexed: 06/27/2024] Open
Abstract
Polyploidy, the result of whole-genome duplication (WGD), is a major driver of eukaryote evolution. Yet WGDs are hugely disruptive mutations, and we still lack a clear understanding of their fitness consequences. Here, we study whether WGDs result in greater diversity of genomic structural variants (SVs) and how they influence evolutionary dynamics in a plant genus, Cochlearia (Brassicaceae). By using long-read sequencing and a graph-based pangenome, we find both negative and positive interactions between WGDs and SVs. Masking of recessive mutations due to WGDs leads to a progressive accumulation of deleterious SVs across four ploidal levels (from diploids to octoploids), likely reducing the adaptive potential of polyploid populations. However, we also discover putative benefits arising from SV accumulation, as more ploidy-specific SVs harbor signals of local adaptation in polyploids than in diploids. Together, our results suggest that SVs play diverse and contrasting roles in the evolutionary trajectories of young polyploids.
Collapse
Affiliation(s)
- Tuomas Hämälä
- School of Life Sciences, University of Nottingham, Nottingham, UK.
- Production Systems, Natural Resources Institute Finland, Jokioinen, Finland.
| | | | - Laura Cowan
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | - Matthew Carlile
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | | | | | - Siri Birkeland
- Natural History Museum, University of Oslo, Oslo, Norway
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | - Matthew Loose
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | - Filip Kolář
- Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic
- Institute of Botany, Czech Academy of Sciences, Průhonice, Czech Republic
| | - Marcus A Koch
- Centre for Organismal Studies, University of Heidelberg, Heidelberg, Germany
| | - Levi Yant
- School of Life Sciences, University of Nottingham, Nottingham, UK.
- Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic.
| |
Collapse
|
22
|
Li X, Chen K, Shao M. Efficient Seeding for Error-Prone Sequences with SubseqHash2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.30.596711. [PMID: 38895288 PMCID: PMC11185578 DOI: 10.1101/2024.05.30.596711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Seeding is an essential preparatory step for large-scale sequence comparisons. Substring-based seeding methods such as kmers are ideal for sequences with low error rates but struggle to achieve high sensitivity while maintaining a reasonable precision for error-prone long reads. SubseqHash, a novel subsequence-based seeding method we recently developed, achieves superior accuracy to substring-based methods in seeding sequences with high mutation/error rates, while the only drawback is its computation speed. In this paper, we propose SubseqHash2, an improved algorithm that can compute multiple sets of seeds in one run by defining k orders over all length- k subsequences and identifying the optimal subsequence under each of the k orders in a single dynamic programming framework. The algorithm is further accelerated using SIMD instructions. SubseqHash2 achieves a 10-50× speedup over repeating SubseqHash while maintaining the high accuracy of seeds. We demonstrate that SubseqHash2 drastically outperforms popular substring-based methods including kmers, minimizers, syncmers, and Strobemers for three fundamental applications. In read mapping, SubseqHash2 can generate adequate seed-matches for aligning hard reads that minimap2 fails on. In sequence alignment, SubseqHash2 achieves high coverage of correct seeds and low coverage of incorrect seeds. In overlap detection, seeds produced by SubseqHash2 lead to more correct overlapping pairs at the same false-positive rate. With all the algorithmic breakthroughs of SubseqHash2, we clear the path for the wide adoption of subsequence-based seeds in long-read analysis. SubseqHash2 is available at https://github.com/Shao-Group/SubseqHash2.
Collapse
Affiliation(s)
- Xiang Li
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
| | - Ke Chen
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, United States
- Huck Institutes of the Life Science, The Pennsylvania State University, United Statess
| |
Collapse
|
23
|
Hu H, Gao R, Gao W, Gao B, Jiang Z, Zhou M, Wang G, Jiang T. SVDF: enhancing structural variation detect from long-read sequencing via automatic filtering strategies. Brief Bioinform 2024; 25:bbae336. [PMID: 38980375 PMCID: PMC11232458 DOI: 10.1093/bib/bbae336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 06/03/2024] [Accepted: 06/27/2024] [Indexed: 07/10/2024] Open
Abstract
Structural variation (SV) is an important form of genomic variation that influences gene function and expression by altering the structure of the genome. Although long-read data have been proven to better characterize SVs, SVs detected from noisy long-read data still include a considerable portion of false-positive calls. To accurately detect SVs in long-read data, we present SVDF, a method that employs a learning-based noise filtering strategy and an SV signature-adaptive clustering algorithm, for effectively reducing the likelihood of false-positive events. Benchmarking results from multiple orthogonal experiments demonstrate that, across different sequencing platforms and depths, SVDF achieves higher calling accuracy for each sample compared to several existing general SV calling tools. We believe that, with its meticulous and sensitive SV detection capability, SVDF can bring new opportunities and advancements to cutting-edge genomic research.
Collapse
Affiliation(s)
- Heng Hu
- College of Life Sciences, Northeast Forestry University, Harbin 150000, China
| | - Runtian Gao
- College of Life Sciences, Northeast Forestry University, Harbin 150000, China
| | - Wentao Gao
- College of Life Sciences, Northeast Forestry University, Harbin 150000, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin 150000, China
| | - Zhongjun Jiang
- College of Life Sciences, Northeast Forestry University, Harbin 150000, China
| | - Murong Zhou
- College of Life Sciences, Northeast Forestry University, Harbin 150000, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150000, China
- State Key Laboratory of Tree Genetics and Breeding, Harbin 150000, China
| | - Tao Jiang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| |
Collapse
|
24
|
Wang W, Li Y, Ko S, Feng N, Zhang M, Liu JJ, Zheng S, Ren B, Yu YP, Luo JH, Tseng GC, Liu S. IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.11.593690. [PMID: 38798496 PMCID: PMC11118288 DOI: 10.1101/2024.05.11.593690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Advancements in long-read transcriptome sequencing (long-RNA-seq) technology have revolutionized the study of isoform diversity. These full-length transcripts enhance the detection of various transcriptome structural variations, including novel isoforms, alternative splicing events, and fusion transcripts. By shifting the open reading frame or altering gene expressions, studies have proved that these transcript alterations can serve as crucial biomarkers for disease diagnosis and therapeutic targets. In this project, we proposed IFDlong, a bioinformatics and biostatistics tool to detect isoform and fusion transcripts using bulk or single-cell long-RNA-seq data. Specifically, the software performed gene and isoform annotation for each long-read, defined novel isoforms, quantified isoform expression by a novel expectation-maximization algorithm, and profiled the fusion transcripts. For evaluation, IFDlong pipeline achieved overall the best performance when compared with several existing tools in large-scale simulation studies. In both isoform and fusion transcript quantification, IFDlong is able to reach more than 0.8 Spearman's correlation with the truth, and more than 0.9 cosine similarity when distinguishing multiple alternative splicing events. In novel isoform simulation, IFDlong can successfully balance the sensitivity (higher than 90%) and specificity (higher than 90%). Furthermore, IFDlong has proved its accuracy and robustness in diverse in-house and public datasets on healthy tissues, cell lines and multiple types of diseases. Besides bulk long-RNA-seq, IFDlong pipeline has proved its compatibility to single-cell long-RNA-seq data. This new software may hold promise for significant impact on long-read transcriptome analysis. The IFDlong software is available at https://github.com/wenjiaking/IFDlong.
Collapse
Affiliation(s)
- Wenjia Wang
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Yuzhen Li
- Department of Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Sungjin Ko
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Ning Feng
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Manling Zhang
- Department of Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| | - Jia-Jun Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Songyang Zheng
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Baoguo Ren
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Yan P. Yu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
| | - Jian-Hua Luo
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
| | - George C. Tseng
- Department of Biostatistics, School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Silvia Liu
- Department of Pathology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
- Pittsburgh Liver Research Center, University of Pittsburgh, Pittsburgh, PA
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA
- Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
25
|
Su Y, Yu Z, Jin S, Ai Z, Yuan R, Chen X, Xue Z, Guo Y, Chen D, Liang H, Liu Z, Liu W. Comprehensive assessment of mRNA isoform detection methods for long-read sequencing data. Nat Commun 2024; 15:3972. [PMID: 38730241 PMCID: PMC11087464 DOI: 10.1038/s41467-024-48117-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 04/19/2024] [Indexed: 05/12/2024] Open
Abstract
The advancement of Long-Read Sequencing (LRS) techniques has significantly increased the length of sequencing to several kilobases, thereby facilitating the identification of alternative splicing events and isoform expressions. Recently, numerous computational tools for isoform detection using long-read sequencing data have been developed. Nevertheless, there remains a deficiency in comparative studies that systemically evaluate the performance of these tools, which are implemented with different algorithms, under various simulations that encompass potential influencing factors. In this study, we conducted a benchmark analysis of thirteen methods implemented in nine tools capable of identifying isoform structures from long-read RNA-seq data. We evaluated their performances using simulated data, which represented diverse sequencing platforms generated by an in-house simulator, RNA sequins (sequencing spike-ins) data, as well as experimental data. Our findings demonstrate IsoQuant as a highly effective tool for isoform detection with LRS, with Bambu and StringTie2 also exhibiting strong performance. These results offer valuable guidance for future research on alternative splicing analysis and the ongoing improvement of tools for isoform detection using LRS data.
Collapse
Affiliation(s)
- Yaqi Su
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
| | - Zhejian Yu
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Siqian Jin
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Zhipeng Ai
- Division of Human Reproduction and Developmental Genetics, Women's Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310006, Zhejiang, China
| | - Ruihong Yuan
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Xinyi Chen
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Ziwei Xue
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Yixin Guo
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Di Chen
- Center for Reproductive Medicine of the Second Affiliated Hospital Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre for Regeneration and Cell Therapy of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Hongqing Liang
- Division of Human Reproduction and Developmental Genetics, Women's Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310006, Zhejiang, China
| | - Zuozhu Liu
- Zhejiang University-Angel Align Inc. R&D Center for Intelligent Healthcare, Zhejiang University-University of Illinois at Urbana-Champaign Institute (ZJU-UIUC Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Wanlu Liu
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China.
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China.
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
- Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
26
|
Schulz T, Medvedev P. ESKEMAP: exact sketch-based read mapping. Algorithms Mol Biol 2024; 19:19. [PMID: 38704605 PMCID: PMC11069465 DOI: 10.1186/s13015-024-00261-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/19/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. RESULTS In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in O ( | t | + | p | + ℓ 2 ) time and O ( ℓ log ℓ ) space, where |t| is the number of k -mers inside the sketch of the reference, |p| is the number of k -mers inside the read's sketch and ℓ is the number of times that k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
Collapse
Affiliation(s)
- Tizian Schulz
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
- Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, Germany.
- Graduate School "Digital Infrastructure for the Life Sciences" (DILS), Bielefeld University, Bielefeld, Germany.
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA.
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, USA.
| |
Collapse
|
27
|
Deng WJ, Li QQ, Shuai HN, Wu RX, Niu SF, Wang QH, Miao BB. Whole-Genome Sequencing Analyses Reveal the Evolution Mechanisms of Typical Biological Features of Decapterus maruadsi. Animals (Basel) 2024; 14:1202. [PMID: 38672351 PMCID: PMC11047736 DOI: 10.3390/ani14081202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/11/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024] Open
Abstract
Decapterus maruadsi is a typical representative of small pelagic fish characterized by fast growth rate, small body size, and high fecundity. It is a high-quality marine commercial fish with high nutritional value. However, the underlying genetics and genomics research focused on D. maruadsi is not comprehensive. Herein, a high-quality chromosome-level genome of a male D. maruadsi was assembled. The assembled genome length was 716.13 Mb with contig N50 of 19.70 Mb. Notably, we successfully anchored 95.73% contig sequences into 23 chromosomes with a total length of 685.54 Mb and a scaffold N50 of 30.77 Mb. A total of 22,716 protein-coding genes, 274.90 Mb repeat sequences, and 10,060 ncRNAs were predicted, among which 22,037 (97%) genes were successfully functionally annotated. The comparative genome analysis identified 459 unique, 73 expanded, and 52 contracted gene families. Moreover, 2804 genes were identified as candidates for positive selection, of which some that were related to the growth and development of bone, muscle, cardioid, and ovaries, such as some members of the TGF-β superfamily, were likely involved in the evolution of typical biological features in D. maruadsi. The study provides an accurate and complete chromosome-level reference genome for further genetic conservation, genomic-assisted breeding, and adaptive evolution research for D. maruadsi.
Collapse
Affiliation(s)
| | | | | | | | - Su-Fang Niu
- College of Fisheries, Guangdong Ocean University, Zhanjiang 524088, China; (W.-J.D.); (Q.-Q.L.); (H.-N.S.); (R.-X.W.); (Q.-H.W.); (B.-B.M.)
| | | | | |
Collapse
|
28
|
Holmes MJ, Mahjour B, Castro CP, Farnum GA, Diehl AG, Boyle AP. HaplotagLR: An efficient and configurable utility for haplotagging long reads. PLoS One 2024; 19:e0298688. [PMID: 38478504 PMCID: PMC10936807 DOI: 10.1371/journal.pone.0298688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 01/30/2024] [Indexed: 03/17/2024] Open
Abstract
Understanding the functional effects of sequence variation is crucial in genomics. Individual human genomes contain millions of variants that contribute to phenotypic variability and disease risks at the population level. Because variants rarely act in isolation, we must consider potential interactions of neighboring variants to accurately predict functional effects. We can accomplish this using haplotagging, which matches sequencing reads to their parental haplotypes using alleles observed at known heterozygous variants. However, few published tools for haplotagging exist and these share several technical and usability-related shortcomings that limit applicability, in particular a lack of insight or control over error rates, and lack of key metrics on the underlying sources of haplotagging error. Here we present HaplotagLR: a user-friendly tool that haplotags long sequencing reads based on a multinomial model and existing phased variant lists. HaplotagLR is user-configurable and includes a basic error model to control the empirical FDR in its output. We show that HaplotagLR outperforms the leading haplotagging method in simulated datasets, especially at high levels of specificity, and displays 7% greater sensitivity in haplotagging real data. HaplotagLR advances both the immediate utility of haplotagging and paves the way for further improvements to this important method.
Collapse
Affiliation(s)
- Monica J. Holmes
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Babak Mahjour
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Christopher P. Castro
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gregory A. Farnum
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Adam G. Diehl
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Alan P. Boyle
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
29
|
Jahshan Z, Yavits L. ViTAL: Vision TrAnsformer based Low coverage SARS-CoV-2 lineage assignment. Bioinformatics 2024; 40:btae093. [PMID: 38374486 PMCID: PMC10913383 DOI: 10.1093/bioinformatics/btae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 02/04/2024] [Accepted: 02/18/2024] [Indexed: 02/21/2024] Open
Abstract
MOTIVATION Rapid spread of viral diseases such as Coronavirus disease 2019 (COVID-19) highlights an urgent need for efficient surveillance of virus mutation and transmission dynamics, which requires fast, inexpensive and accurate viral lineage assignment. The first two goals might be achieved through low-coverage whole-genome sequencing (LC-WGS) which enables rapid genome sequencing at scale and at reduced costs. Unfortunately, LC-WGS significantly diminishes the genomic details, rendering accurate lineage assignment very challenging. RESULTS We present ViTAL, a novel deep learning algorithm specifically designed to perform lineage assignment of low coverage-sequenced genomes. ViTAL utilizes a combination of MinHash for genomic feature extraction and Vision Transformer for fine-grain genome classification and lineage assignment. We show that ViTAL outperforms state-of-the-art tools across diverse coverage levels, reaching up to 87.7% lineage assignment accuracy at 1× coverage where state-of-the-art tools such as UShER and Kraken2 achieve the accuracy of 5.4% and 27.4% respectively. ViTAL achieves comparable accuracy results with up to 8× lower coverage than state-of-the-art tools. We explore ViTAL's ability to identify the lineages of novel genomes, i.e. genomes the Vision Transformer was not trained on. We show how ViTAL can be applied to preliminary phylogenetic placement of novel variants. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in https://github.com/zuherJahshan/vital and can be accessed with 10.5281/zenodo.10688110.
Collapse
Affiliation(s)
- Zuher Jahshan
- EnICS Labs, Engineering Department, Bar-Ilan University, Ramat Gan, Tel Aviv 5290002, Israel
| | - Leonid Yavits
- EnICS Labs, Engineering Department, Bar-Ilan University, Ramat Gan, Tel Aviv 5290002, Israel
| |
Collapse
|
30
|
Zakeri M, Brown NK, Ahmed OY, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.04.565615. [PMID: 37961660 PMCID: PMC10635132 DOI: 10.1101/2023.11.04.565615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | | | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
31
|
Ding L, Wu S, Hou Z, Li A, Xu Y, Feng H, Pan W, Ruan J. Improving error-correcting capability in DNA digital storage via soft-decision decoding. Natl Sci Rev 2024; 11:nwad229. [PMID: 38213525 PMCID: PMC10776348 DOI: 10.1093/nsr/nwad229] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 08/03/2023] [Accepted: 08/15/2023] [Indexed: 01/13/2024] Open
Abstract
Error-correcting codes (ECCs) employed in the state-of-the-art DNA digital storage (DDS) systems suffer from a trade-off between error-correcting capability and the proportion of redundancy. To address this issue, in this study, we introduce soft-decision decoding approach into DDS by proposing a DNA-specific error prediction model and a series of novel strategies. We demonstrate the effectiveness of our approach through a proof-of-concept DDS system based on Reed-Solomon (RS) code, named as Derrick. Derrick shows significant improvement in error-correcting capability without involving additional redundancy in both in vitro and in silico experiments, using various sequencing technologies such as Illumina, PacBio and Oxford Nanopore Technology (ONT). Notably, in vitro experiments using ONT sequencing at a depth of 7× reveal that Derrick, compared with the traditional hard-decision decoding strategy, doubles the error-correcting capability of RS code, decreases the proportion of matrices with decoding-failure by 229-fold, and amplifies the potential maximum storage volume by impressive 32 388-fold. Also, Derrick surpasses 'state-of-the-art' DDS systems by comprehensively considering the information density and the minimum sequencing depth required for complete information recovery. Crucially, the soft-decision decoding strategy and key steps of Derrick are generalizable to other ECCs' decoding algorithms.
Collapse
Affiliation(s)
- Lulu Ding
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Shigang Wu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Zhihao Hou
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
- Guangdong Provincial Key Laboratory of Plant Molecular Breeding, State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, South China Agricultural University, Guangzhou510642, China
| | - Alun Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Yaping Xu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Hu Feng
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen518120, China
| |
Collapse
|
32
|
Rajput J, Chandra G, Jain C. Co-linear chaining on pangenome graphs. Algorithms Mol Biol 2024; 19:4. [PMID: 38279113 PMCID: PMC11288099 DOI: 10.1186/s13015-024-00250-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 01/02/2024] [Indexed: 01/28/2024] Open
Abstract
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).
Collapse
Affiliation(s)
- Jyotshna Rajput
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India.
| |
Collapse
|
33
|
Wei ZG, Zhang XD, Fan XG, Qian Y, Liu F, Wu FX. pathMap: a path-based mapping tool for long noisy reads with high sensitivity. Brief Bioinform 2024; 25:bbae107. [PMID: 38517696 PMCID: PMC10959152 DOI: 10.1093/bib/bbae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 12/25/2023] [Accepted: 02/28/2024] [Indexed: 03/24/2024] Open
Abstract
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Xiao-Dan Zhang
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Xing-Guo Fan
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Yu Qian
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fei Liu
- School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
34
|
Chu J, Rong J, Feng X, Li H. ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection. Gigascience 2024; 13:giae024. [PMID: 38832466 PMCID: PMC11148594 DOI: 10.1093/gigascience/giae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 02/13/2024] [Accepted: 04/30/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important. FINDINGS The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons. CONCLUSIONS Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.
Collapse
Affiliation(s)
- Justin Chu
- Dana-Farber Cancer Institute, Department of Data Sciences, Boston, MA 02215, USA
- Harvard Medical School, Department of Biomedical Informatics, Boston, MA 02115, USA
| | - Jiazhen Rong
- Genomics and Computational Biology Graduate Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Xiaowen Feng
- Dana-Farber Cancer Institute, Department of Data Sciences, Boston, MA 02215, USA
- Harvard Medical School, Department of Biomedical Informatics, Boston, MA 02115, USA
| | - Heng Li
- Dana-Farber Cancer Institute, Department of Data Sciences, Boston, MA 02215, USA
- Harvard Medical School, Department of Biomedical Informatics, Boston, MA 02115, USA
| |
Collapse
|
35
|
Constantinides B, Hunt M, Crook DW. Hostile: accurate decontamination of microbial host sequences. Bioinformatics 2023; 39:btad728. [PMID: 38039142 PMCID: PMC10749771 DOI: 10.1093/bioinformatics/btad728] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 11/11/2023] [Accepted: 11/29/2023] [Indexed: 12/03/2023] Open
Abstract
MOTIVATION Microbial sequences generated from clinical samples are often contaminated with human host sequences that must be removed for ethical and legal reasons. Care must be taken to excise host sequences without inadvertently removing target microbial sequences to the detriment of downstream analyses such as variant calling and de novo assembly. RESULTS To facilitate accurate host decontamination of both short and long sequencing reads, we developed Hostile, a tool capable of accurate host read removal using a laptop. We demonstrate that our approach removes at least 99.6% of real human reads and retains at least 99.989% of simulated bacterial reads. Using Hostile with a masked reference genome further increases bacterial read retention (≥99.997%) with negligible (≤0.001%) reduction in human read removal performance. Compared with an existing tool, Hostile removes 21%-23% more human short reads and 21-43 times fewer bacterial reads, typically in less time. AVAILABILITY AND IMPLEMENTATION Hostile is implemented as an MIT-licensed Python package available from https://github.com/bede/hostile together with supplementary material.
Collapse
Affiliation(s)
- Bede Constantinides
- NDM Experimental Medicine, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
- The National Institute for Health Research Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
| | - Martin Hunt
- NDM Experimental Medicine, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
- EMBL-EBI, Wellcome Genome Campus, Cambridgeshire CB10 1SD, United Kingdom
| | - Derrick W Crook
- NDM Experimental Medicine, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
- The National Institute for Health Research Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
- The National Institute for Health Research Oxford Biomedical Research Centre, University of Oxford, John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom
| |
Collapse
|
36
|
Wei ZG, Bu PY, Zhang XD, Liu F, Qian Y, Wu FX. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics 2023; 39:btad726. [PMID: 38058196 PMCID: PMC11320709 DOI: 10.1093/bioinformatics/btad726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 11/02/2023] [Accepted: 12/05/2023] [Indexed: 12/08/2023] Open
Abstract
MOTIVATION Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION The invMap software is available at https://github.com/zhang134/invMap.git.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
- Division of Biomedical Engineering, Department of Computer Science and
Department of Mechanical Engineering, University of Saskatchewan,
Saskatoon, SK S7N 5A9, Canada
| | - Peng-Yu Bu
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Xiao-Dan Zhang
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Fei Liu
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Yu Qian
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and
Department of Mechanical Engineering, University of Saskatchewan,
Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
37
|
Magi A, Mattei G, Mingrino A, Caprioli C, Ronchini C, Frigè G, Semeraro R, Baragli M, Bolognini D, Colombo E, Mazzarella L, Pelicci PG. GASOLINE: detecting germline and somatic structural variants from long-reads data. Sci Rep 2023; 13:20817. [PMID: 38012350 PMCID: PMC10682169 DOI: 10.1038/s41598-023-48285-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 11/24/2023] [Indexed: 11/29/2023] Open
Abstract
Long-read sequencing allows analyses of single nucleic-acid molecules and produces sequences in the order of tens to hundreds kilobases. Its application to whole-genome analyses allows identification of complex genomic structural-variants (SVs) with unprecedented resolution. SV identification, however, requires complex computational methods, based on either read-depth or intra- and inter-alignment signatures approaches, which are limited by size or type of SVs. Moreover, most currently available tools only detect germline variants, thus requiring separate computation of sample pairs for comparative analyses. To overcome these limits, we developed a novel tool (Germline And SOmatic structuraL varIants detectioN and gEnotyping; GASOLINE) that groups SV signatures using a sophisticated clustering procedure based on a modified reciprocal overlap criterion, and is designed to identify germline SVs, from single samples, and somatic SVs from paired test and control samples. GASOLINE is a collection of Perl, R and Fortran codes, it analyzes aligned data in BAM format and produces VCF files with statistically significant somatic SVs. Germline or somatic analysis of 30[Formula: see text] sequencing coverage experiments requires 4-5 h with 20 threads. GASOLINE outperformed currently available methods in the detection of both germline and somatic SVs in synthetic and real long-reads datasets. Notably, when applied on a pair of metastatic melanoma and matched-normal sample, GASOLINE identified five genuine somatic SVs that were missed using five different sequencing technologies and state-of-the art SV calling approaches. Thus, GASOLINE identifies germline and somatic SVs with unprecedented accuracy and resolution, outperforming currently available state-of-the-art WGS long-reads computational methods.
Collapse
Affiliation(s)
- Alberto Magi
- Department of Information Engineering, University of Florence, 50100, Florence, Italy.
- Institute for Biomedical Technologies, National Research Council, Segrate, Milan, Italy.
| | - Gianluca Mattei
- Department of Information Engineering, University of Florence, 50100, Florence, Italy
| | - Alessandra Mingrino
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy
| | - Chiara Caprioli
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Chiara Ronchini
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy
| | - Gianmaria Frigè
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Roberto Semeraro
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy
| | - Marta Baragli
- Department of Information Engineering, University of Florence, 50100, Florence, Italy
| | - Davide Bolognini
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy
| | - Emanuela Colombo
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Luca Mazzarella
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy
| | - Pier Giuseppe Pelicci
- Department of Experimental Oncology, IEO European Institute of Oncology IRCCS, Milan, Italy.
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy.
| |
Collapse
|
38
|
Chandra G, Jain C. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. J Comput Biol 2023; 30:1182-1197. [PMID: 37902967 DOI: 10.1089/cmb.2023.0186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| |
Collapse
|
39
|
Guo Y, Feng X, Li H. Evaluation of haplotype-aware long-read error correction with hifieval. Bioinformatics 2023; 39:btad631. [PMID: 37851384 PMCID: PMC10612404 DOI: 10.1093/bioinformatics/btad631] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 09/18/2023] [Accepted: 10/17/2023] [Indexed: 10/19/2023] Open
Abstract
SUMMARY The PacBio High-Fidelity (HiFi) sequencing technology produces long reads of >99% in accuracy. It has enabled the development of a new generation of de novo sequence assemblers, which all have sequencing error correction (EC) as the first step. As HiFi is a new data type, this critical step has not been evaluated before. Here, we introduced hifieval, a new command-line tool for measuring over- and under-corrections produced by EC algorithms. We assessed the accuracy of the EC components of existing HiFi assemblers on the CHM13 and the HG002 datasets and further investigated the performance of EC methods in challenging regions such as homopolymer regions, centromeric regions, and segmental duplications. Hifieval will help HiFi assemblers to improve EC and assembly quality in the long run. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/magspho/hifieval.
Collapse
Affiliation(s)
- Yujie Guo
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, United States
| | - Xiaowen Feng
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, United States
| |
Collapse
|
40
|
Zhang Y, Lu HW, Ruan J. GAEP: a comprehensive genome assembly evaluating pipeline. J Genet Genomics 2023; 50:747-754. [PMID: 37245652 DOI: 10.1016/j.jgg.2023.05.009] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 05/19/2023] [Accepted: 05/23/2023] [Indexed: 05/30/2023]
Abstract
With the rapid development of sequencing technologies, especially the maturity of third-generation sequencing technologies, there has been a significant increase in the number and quality of published genome assemblies. The emergence of these high-quality genomes has raised higher requirements for genome evaluation. Although numerous computational methods have been developed to evaluate assembly quality from various perspectives, the selective use of these evaluation methods can be arbitrary and inconvenient for fairly comparing the assembly quality. To address this issue, we have developed the Genome Assembly Evaluating Pipeline (GAEP), which provides a comprehensive assessment pipeline for evaluating genome quality from multiple perspectives, including continuity, completeness, and correctness. Additionally, GAEP includes new functions for detecting misassemblies and evaluating the assembly redundancy, which performs well in our testing. GAEP is publicly available at https://github.com/zy-optimistic/GAEP under the GPL3.0 License. With GAEP, users can quickly obtain accurate and reliable evaluation results, facilitating the comparison and selection of high-quality genome assemblies.
Collapse
Affiliation(s)
- Yong Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Hong-Wei Lu
- State Key Laboratory of Rice Biology and Breeding, China National Rice Research Institute, Chinese Academy of Agricultural Sciences, Hangzhou, Zhejiang 311401, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China.
| |
Collapse
|
41
|
Poszewiecka B, Gogolewski K, Karolak JA, Stankiewicz P, Gambin A. PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution. Genome Biol 2023; 24:205. [PMID: 37697406 PMCID: PMC10496407 DOI: 10.1186/s13059-023-03022-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 07/25/2023] [Indexed: 09/13/2023] Open
Abstract
Resolving complex genomic regions rich in segmental duplications (SDs) is challenging due to the high error rate of long-read sequencing. Here, we describe a targeted approach with a novel genome assembler PhaseDancer that extends SD-rich regions of interest iteratively. We validate its robustness and efficiency using a golden-standard set of human BAC clones and in silico-generated SDs with predefined evolutionary scenarios. PhaseDancer enables extension of the incomplete complex SD-rich subtelomeric regions of Great Ape chromosomes orthologous to the human chromosome 2 (HSA2) fusion site, informing a model of HSA2 formation and unravelling the evolution of human and Great Ape genomes.
Collapse
Affiliation(s)
- Barbara Poszewiecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| | - Krzysztof Gogolewski
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| | - Justyna A. Karolak
- Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, 77030 Houston, TX USA
- Chair and Department of Genetics and Pharmaceutical Microbiology, Poznan University of Medical Sciences, 60-806 Poznan, Poland
| | - Paweł Stankiewicz
- Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, 77030 Houston, TX USA
| | - Anna Gambin
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
| |
Collapse
|
42
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
43
|
Yang X, Wang X, Zou Y, Zhang S, Xia M, Fu L, Vollger MR, Chen NC, Taylor DJ, Harvey WT, Logsdon GA, Meng D, Shi J, McCoy RC, Schatz MC, Li W, Eichler EE, Lu Q, Mao Y. Characterization of large-scale genomic differences in the first complete human genome. Genome Biol 2023; 24:157. [PMID: 37403156 PMCID: PMC10320979 DOI: 10.1186/s13059-023-02995-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/23/2023] [Indexed: 07/06/2023] Open
Abstract
BACKGROUND The first telomere-to-telomere (T2T) human genome assembly (T2T-CHM13) release is a milestone in human genomics. The T2T-CHM13 genome assembly extends our understanding of telomeres, centromeres, segmental duplication, and other complex regions. The current human genome reference (GRCh38) has been widely used in various human genomic studies. However, the large-scale genomic differences between these two important genome assemblies are not characterized in detail yet. RESULTS Here, in addition to the previously reported "non-syntenic" regions, we find 67 additional large-scale discrepant regions and precisely categorize them into four structural types with a newly developed website tool called SynPlotter. The discrepant regions (~ 21.6 Mbp) excluding telomeric and centromeric regions are highly structurally polymorphic in humans, where the deletions or duplications are likely associated with various human diseases, such as immune and neurodevelopmental disorders. The analyses of a newly identified discrepant region-the KLRC gene cluster-show that the depletion of KLRC2 by a single-deletion event is associated with natural killer cell differentiation in ~ 20% of humans. Meanwhile, the rapid amino acid replacements observed within KLRC3 are probably a result of natural selection in primate evolution. CONCLUSION Our study provides a foundation for understanding the large-scale structural genomic differences between the two crucial human reference genomes, and is thereby important for future human genomics studies.
Collapse
Affiliation(s)
- Xiangyu Yang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xuankai Wang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yawen Zou
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Manying Xia
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Lianting Fu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Dan Meng
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Junfeng Shi
- Shanghai Engineering Research Center of Advanced Dental Technology and Materials, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Weidong Li
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Qing Lu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
44
|
Ahmed O, Rossi M, Boucher C, Langmead B. Efficient taxa identification using a pangenome index. Genome Res 2023; 33:1069-1077. [PMID: 37258301 PMCID: PMC10538492 DOI: 10.1101/gr.277642.123] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 05/22/2023] [Indexed: 06/02/2023]
Abstract
Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.
Collapse
Affiliation(s)
- Omar Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
| | - Massimiliano Rossi
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
45
|
Li X, Shi Q, Chen K, Shao M. Seeding with minimized subsequence. Bioinformatics 2023; 39:i232-i241. [PMID: 37387132 DOI: 10.1093/bioinformatics/btad218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. RESULTS We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. AVAILABILITY AND IMPLEMENTATION SubseqHash is freely available at https://github.com/Shao-Group/subseqhash.
Collapse
Affiliation(s)
- Xiang Li
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Qian Shi
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Ke Chen
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
46
|
Guo Y, Feng X, Li H. Evaluation of haplotype-aware long-read error correction with hifieval. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.05.543788. [PMID: 37333189 PMCID: PMC10274712 DOI: 10.1101/2023.06.05.543788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
The PacBio High-Fidelity (HiFi) sequencing technology produces long reads of >99% in accuracy. It has enabled the development of a new generation of de novo sequence assemblers, which all have sequencing error correction as the first step. As HiFi is a new data type, this critical step has not been evaluated before. Here, we introduced hifieval, a new command-line tool for measuring over- and under-corrections produced by error correction algorithms. We assessed the accuracy of the error correction components of existing HiFi assemblers on the CHM13 and the HG002 datasets and further investigated the performance of error correction methods in challenging regions such as homopolymer regions, centromeric regions, and segmental duplications. Hifieval will help HiFi assemblers to improve error correction and assembly quality in the long run.
Collapse
Affiliation(s)
- Yujie Guo
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA, 02215
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA, 02215
| | - Xiaowen Feng
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA, 02215
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA, 02215
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA, 02215
| |
Collapse
|
47
|
Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol 2023; 24:122. [PMID: 37202771 PMCID: PMC10197461 DOI: 10.1186/s13059-023-02958-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 05/03/2023] [Indexed: 05/20/2023] Open
Abstract
Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
Collapse
Affiliation(s)
- Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | - Massimiliano Rossi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS Canada
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| |
Collapse
|
48
|
Chamorro González R, Conrad T, Stöber MC, Xu R, Giurgiu M, Rodriguez-Fos E, Kasack K, Brückner L, van Leen E, Helmsauer K, Dorado Garcia H, Stefanova ME, Hung KL, Bei Y, Schmelz K, Lodrini M, Mundlos S, Chang HY, Deubzer HE, Sauer S, Eggert A, Schulte JH, Schwarz RF, Haase K, Koche RP, Henssen AG. Parallel sequencing of extrachromosomal circular DNAs and transcriptomes in single cancer cells. Nat Genet 2023; 55:880-890. [PMID: 37142849 PMCID: PMC10181933 DOI: 10.1038/s41588-023-01386-y] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 03/28/2023] [Indexed: 05/06/2023]
Abstract
Extrachromosomal DNAs (ecDNAs) are common in cancer, but many questions about their origin, structural dynamics and impact on intratumor heterogeneity are still unresolved. Here we describe single-cell extrachromosomal circular DNA and transcriptome sequencing (scEC&T-seq), a method for parallel sequencing of circular DNAs and full-length mRNA from single cells. By applying scEC&T-seq to cancer cells, we describe intercellular differences in ecDNA content while investigating their structural heterogeneity and transcriptional impact. Oncogene-containing ecDNAs were clonally present in cancer cells and drove intercellular oncogene expression differences. In contrast, other small circular DNAs were exclusive to individual cells, indicating differences in their selection and propagation. Intercellular differences in ecDNA structure pointed to circular recombination as a mechanism of ecDNA evolution. These results demonstrate scEC&T-seq as an approach to systematically characterize both small and large circular DNA in cancer cells, which will facilitate the analysis of these DNA elements in cancer and beyond.
Collapse
Affiliation(s)
- Rocío Chamorro González
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Thomas Conrad
- Genomics Technology Platform, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Maja C Stöber
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
- Charité-Universitätsmedizin Berlin, Berlin, Germany
- Faculty of Life Science, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Robin Xu
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Mădălina Giurgiu
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
- Freie Universität Berlin, Berlin, Germany
| | - Elias Rodriguez-Fos
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Katharina Kasack
- Fraunhofer Institute for Cell Therapy and Immunology, Branch Bioanalytics and Bioprocesses IZI-BB, Potsdam, Germany
| | - Lotte Brückner
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany
| | - Eric van Leen
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Konstantin Helmsauer
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Heathcliff Dorado Garcia
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Maria E Stefanova
- RG Development and Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - King L Hung
- Center for Personal Dynamic Regulomes, Stanford University School of Medicine, Stanford, CA, USA
| | - Yi Bei
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
| | - Karin Schmelz
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Marco Lodrini
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Stefan Mundlos
- RG Development and Disease, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Howard Y Chang
- Center for Personal Dynamic Regulomes, Stanford University School of Medicine, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Hedwig E Deubzer
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
- German Cancer Consortium, partner site Berlin, and German Cancer Research Center, Heidelberg, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Sascha Sauer
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Angelika Eggert
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- German Cancer Consortium, partner site Berlin, and German Cancer Research Center, Heidelberg, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Johannes H Schulte
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- German Cancer Consortium, partner site Berlin, and German Cancer Research Center, Heidelberg, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Roland F Schwarz
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
- Institute for Computational Cancer Biology, Center for Integrated Oncology, Cancer Research Center Cologne Essen Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
- Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
| | - Kerstin Haase
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany
- German Cancer Consortium, partner site Berlin, and German Cancer Research Center, Heidelberg, Germany
| | - Richard P Koche
- Center for Epigenetics Research, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Anton G Henssen
- Department of Pediatric Oncology and Hematology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany.
- Experimental and Clinical Research Center of the MDC and Charité Berlin, Berlin, Germany.
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany.
- German Cancer Consortium, partner site Berlin, and German Cancer Research Center, Heidelberg, Germany.
| |
Collapse
|
49
|
Popic V, Rohlicek C, Cunial F, Hajirasouliha I, Meleshko D, Garimella K, Maheshwari A. Cue: a deep-learning framework for structural variant discovery and genotyping. Nat Methods 2023; 20:559-568. [PMID: 36959322 PMCID: PMC10152467 DOI: 10.1038/s41592-023-01799-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 01/29/2023] [Indexed: 03/25/2023]
Abstract
Structural variants (SVs) are a major driver of genetic diversity and disease in the human genome and their discovery is imperative to advances in precision medicine. Existing SV callers rely on hand-engineered features and heuristics to model SVs, which cannot scale to the vast diversity of SVs nor fully harness the information available in sequencing datasets. Here we propose an extensible deep-learning framework, Cue, to call and genotype SVs that can learn complex SV abstractions directly from the data. At a high level, Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image. We show that Cue outperforms the state of the art in the detection of several classes of SVs on synthetic and real short-read data and that it can be easily extended to other sequencing platforms, while achieving competitive performance.
Collapse
Affiliation(s)
| | | | - Fabio Cunial
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Iman Hajirasouliha
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
| | - Dmitry Meleshko
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
- Tri-Institutional Computational Biology and Medicine Program, Weill Cornell Medicine, New York, NY, USA
| | - Kiran Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | |
Collapse
|
50
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|