1
|
Alser M, Eudine J, Mutlu O. Taming large-scale genomic analyses via sparsified genomics. Nat Commun 2025; 16:876. [PMID: 39837860 PMCID: PMC11751491 DOI: 10.1038/s41467-024-55762-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 12/20/2024] [Indexed: 01/23/2025] Open
Abstract
Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland.
- Department of Computer Science, Georgia State University, Atlanta, GA, USA.
- Department of Clinical Pharmacy, University of Southern California, LA, CA, USA.
| | - Julien Eudine
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| |
Collapse
|
2
|
Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The De Bruijn graph (DBG) has been widely used in the algorithms for indexing or organizing read and reference sequences in bioinformatics. However, a DBG model that can locate each node, edge and path on sequence has not been proposed so far. Recently, DBG has been used for representing reference sequences in read mapping tasks. In this process, it is not a one-to-one correspondence between the paths of DBG and the substrings of reference sequence. This results in the false path on DBG, which means no substrings of reference producing the path. Moreover, if a candidate path of a read is true, we need to locate it and verify the candidate on sequence. To solve these problems, we proposed a DBG model, called MiniDBG, which stores the position lists of a minimal set of edges. With the position lists, MiniDBG can locate any node, edge and path efficiently. We also proposed algorithms for generating MiniDBG based on an original DBG and algorithms for locating edges or paths on sequence. We designed and ran experiments on real datasets for comparing them with BWT-based and position list-based methods. The experimental results show that MiniDBG can locate the edges and paths efficiently with lower memory costs.
Collapse
|
3
|
Wang J, Wang J, Kuang G, Wu W, Yang L, Yang W, Pan H, Han X, Yang T, Shi M, Feng Y. Meta-transcriptomics for the diversity of tick-borne virus in Nujiang, Yunnan Province. Front Cell Infect Microbiol 2023; 13:1283019. [PMID: 38179426 PMCID: PMC10766107 DOI: 10.3389/fcimb.2023.1283019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 11/17/2023] [Indexed: 01/06/2024] Open
Abstract
Ticks, an arthropod known for transmitting various pathogens such as viruses, bacteria, and fungi, pose a perpetual public health concern. A total of 2,570 ticks collected from Nujiang Prefecture in Yunnan Province between 2017 and 2022 were included in the study. Through the meta-transcriptomic sequencing of four locally distributed tick species, we identified 13 RNA viruses belonging to eight viral families, namely, Phenuiviridae, Nairoviridae, Peribunyaviridae, Flaviviridae, Chuviridae, Rhabdoviridae, Orthomyxoviridae, and Totiviridae. The most prevalent viruses were members of the order Bunyavirales, including three of Phenuiviridae, two were classified as Peribunyaviridae, and one was associated with Nairoviridae. However, whether they pose a threat to human health still remains unclear. Indeed, this study revealed the genetic diversity of tick species and tick-borne viruses in Nujiang Prefecture based on COI gene and tick-borne virus research. These data clarified the genetic evolution of some RNA viruses and furthered our understanding of the distribution pattern of tick-borne pathogens, highlighting the importance and necessity of monitoring tick-borne pathogens.
Collapse
Affiliation(s)
- Juan Wang
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Jing Wang
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Sun Yat-sen University, Shenzhen, China
| | - Guopeng Kuang
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Weichen Wu
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Sun Yat-sen University, Shenzhen, China
| | - Lifen Yang
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Weihong Yang
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Hong Pan
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Xi Han
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
| | - Tian Yang
- School of Public Health, Dali University, Dali, China
| | - Mang Shi
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Sun Yat-sen University, Shenzhen, China
| | - Yun Feng
- Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, Dali, China
- School of Public Health, Dali University, Dali, China
- State Key Laboratory of Remote Sensing Science, Center for Global Change and Public Health, Faculty of Geographical Science, Beijing Normal University, Beijing, China
| |
Collapse
|
4
|
Jung Y, Han D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 2022; 38:2404-2413. [PMID: 35253835 DOI: 10.1093/bioinformatics/btac137] [Citation(s) in RCA: 100] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 12/30/2021] [Accepted: 03/03/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses. RESULTS This paper presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2. AVAILABILITY The source code and test scripts are available for academic use at https://github.com/kaist-ina/BWA-MEME/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Youngmok Jung
- Department of electrical engineering, KAIST, Daejeon, 34141, REP. OF KOREA
| | - Dongsu Han
- Department of electrical engineering, KAIST, Daejeon, 34141, REP. OF KOREA
| |
Collapse
|
5
|
Feng Y, Gou QY, Yang WH, Wu WC, Wang J, Holmes EC, Liang G, Shi M. A time-series meta-transcriptomic analysis reveals the seasonal, host, and gender structure of mosquito viromes. Virus Evol 2022; 8:veac006. [PMID: 35242359 PMCID: PMC8887699 DOI: 10.1093/ve/veac006] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 01/25/2022] [Accepted: 01/27/2022] [Indexed: 11/21/2022] Open
Abstract
Although metagenomic sequencing has revealed high numbers of viruses in mosquitoes sampled globally, our understanding of how their diversity and abundance varies in time and space as well as by host species and gender remains unclear. To address this, we collected 23,109 mosquitoes over the course of 12 months from a bat-dwelling cave and a nearby village in Yunnan province, China. These samples were organized by mosquito species, mosquito gender, and sampling time for meta-transcriptomic sequencing. A total of 162 eukaryotic virus species were identified, of which 101 were novel, including representatives of seventeen RNA virus multi-family supergroups and four species of DNA virus from the families Parvoviridae, Circoviridae, and Nudiviridae. In addition, two known vector-borne viruses-Japanese encephalitis virus and Banna virus-were found. Analyses of the entire virome revealed strikingly different viral compositions and abundance levels in warmer compared to colder months, a strong host structure at the level of mosquito species, and no substantial differences between those viruses harbored by male and female mosquitoes. At the scale of individual viruses, some were found to be ubiquitous throughout the year and across four mosquito species, while most of the other viruses were season and/or host specific. Collectively, this study reveals the diversity, dynamics, and evolution of the mosquito virome at a single location and sheds new lights on the ecology of these important vector animals.
Collapse
Affiliation(s)
- Yun Feng
- Department of Viral and Rickettsial Disease Control, Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, No. 5 Wenhua Road, Xiaguan, Dali, Yunnan 671000, China
| | - Qin-yu Gou
- Shenzhen Campus of Sun-Yat Sen University, Sun-Yat Sen University Shenzhen Campus, Guangming New District, Shenzhen, Guangdong 518107, China
| | - Wei-hong Yang
- Department of Viral and Rickettsial Disease Control, Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, No. 5 Wenhua Road, Xiaguan, Dali, Yunnan 671000, China
| | - Wei-chen Wu
- Shenzhen Campus of Sun-Yat Sen University, Sun-Yat Sen University Shenzhen Campus, Guangming New District, Shenzhen, Guangdong 518107, China
| | - Juan Wang
- Department of Viral and Rickettsial Disease Control, Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Yunnan Institute of Endemic Disease Control and Prevention, No. 5 Wenhua Road, Xiaguan, Dali, Yunnan 671000, China
| | - Edward C Holmes
- Sydney Institute for Infectious Diseases, School of Life and Environmental Sciences and School of Medical Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - Guodong Liang
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Viral Disease Control and Prevention, Chinese Center for Disease Control and Prevention, 155 Changbai Road, Changping District, Beijing 102206, China
| | - Mang Shi
- Shenzhen Campus of Sun-Yat Sen University, Sun-Yat Sen University Shenzhen Campus, Guangming New District, Shenzhen, Guangdong 518107, China
| |
Collapse
|