1
|
Martí JM, Kok CR, Thissen JB, Mulakken NJ, Avila-Herrera A, Jaing CJ, Allen JE, Be NA. Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification. mSystems 2025; 10:e0123924. [PMID: 40111052 PMCID: PMC12013259 DOI: 10.1128/msystems.01239-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 02/14/2025] [Indexed: 03/22/2025] Open
Abstract
Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size-currently exceeding 1012 nucleotides-and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no current nt-based indices exist for the widely used Centrifuge classifier, and the last public version currently available was released in 2018, we addressed this critical gap by leveraging advanced high-performance computing resources. We present new Centrifuge-compatible nt databases, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, as shown through our reanalysis of published metagenomic data where Plasmodium annotations were dramatically reduced using our decontaminated database, highlighting how database quality can significantly impact research conclusions. Through temporal comparisons, we also reveal how our approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as Listeria monocytogenes and Naegleria fowleri, where classification accuracy varied significantly across database versions. These new databases, made available as pre-built Centrifuge indexes, respond to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, which require comprehensive taxonomic coverage, will benefit from this resource. Our work highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This approach is crucial for ensuring accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity. IMPORTANCE Accurately identifying the diverse microbes present in a sample, whether from the human gut, a soil sample, or a crime scene, is crucial for fields ranging from medicine to environmental science. Researchers rely on comprehensive DNA databases to match sequenced DNA fragments to known microbial species. However, the widely used NCBI nt database, while vast, poses significant challenges. Its massive size makes it difficult for many researchers to use effectively with taxonomic classifiers, and inconsistencies and contamination within the database can impact the accuracy of microbial identification. This work addresses these challenges by providing cleaned, updated, and validated nt-based databases specifically optimized for the widely used Centrifuge classification tool. This new resource demonstrably reduces errors and improves the reliability of microbial identification across diverse taxonomic groups. Moreover, by providing readily usable indexes, we overcome the size barrier, enabling researchers to leverage the full potential of the nt database for metagenomic analysis. Our findings underscore the need to treat reference databases as dynamic entities, emphasizing continuous quality control and versioning as essential practices for robust and reproducible metagenomics research.
Collapse
Affiliation(s)
- Jose Manuel Martí
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Car Reen Kok
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - James B. Thissen
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Nisha J. Mulakken
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Aram Avila-Herrera
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Crystal J. Jaing
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Jonathan E. Allen
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Nicholas A. Be
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA
| |
Collapse
|
2
|
Decadt H, Díaz-Muñoz C, Vermote L, Pradal I, De Vuyst L, Weckx S. Long-read metagenomics gives a more accurate insight into the microbiota of long-ripened gouda cheeses. Front Microbiol 2025; 16:1543079. [PMID: 40196035 PMCID: PMC11973332 DOI: 10.3389/fmicb.2025.1543079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Accepted: 03/04/2025] [Indexed: 04/09/2025] Open
Abstract
Metagenomic studies of the Gouda cheese microbiota and starter cultures are scarce. During the present study, short-read metagenomic sequencing (Illumina) was applied on 89 Gouda cheese and processed milk samples, which have been investigated before concerning their metabolite and taxonomic composition, the latter applying amplicon-based, high-throughput sequencing (HTS) of the full-length 16S rRNA gene. Selected samples were additionally investigated using long-read metagenomic sequencing (Oxford Nanopore Technologies, ONT). Whereas the species identified by amplicon-based HTS and metagenomic sequencing were identical, the relative abundances of the major species differed significantly. Lactococcus cremoris was more abundant in the metagenomics-based taxonomic analysis compared to the amplicon-based one, whereas the opposite was true for the non-starter lactic acid bacteria (NSLAB). This discrepancy was related to a higher fragmentation of the lactococcal DNA compared with the DNA of other species when applying ONT. Possibly, a higher fragmentation was linked with a higher percentage of dead or metabolically inactive cells, suggesting that full-length 16S rRNA gene amplicon-based HTS might give a more accurate view on active cells. Further, fungi were not abundantly present in the Gouda cheeses examined, whereas about 2% of the metagenomic sequence reads was related to phages, with higher relative abundances in the cheese rinds and long-ripened cheeses. Intraspecies differences found by short-read metagenomic sequencing were in agreement with the amplicon sequence variants obtained previously, confirming the ability of full-length 16S rRNA gene amplicon-based HTS to reach a taxonomic assignment below species level. Metagenome-assembled genomes (MAGs) were retrieved for 15 species, among which the starter cultures Lc. cremoris and Lactococcus lactis and the NSLAB Lacticaseibacillus paracasei, Loigolactobacillus rennini, and Tetragenococcus halophilus, although obtaining MAGs from Lc. cremoris and Lc. lactis was more challenging because of a high intraspecies diversity and high similarity between these species. Long-read metagenomic sequencing could not improve the retrieval of lactococcal MAGs, but, overall, MAGs obtained by long-read metagenomic sequencing solely were superior compared with those obtained by short-read metagenomic sequencing solely, reaching a high-quality draft status of the genomes.
Collapse
Affiliation(s)
| | | | | | | | | | - Stefan Weckx
- Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering Sciences, Vrije Universiteit Brussel, Brussels, Belgium
| |
Collapse
|
3
|
M. Ascensión A, Gorostidi-Aicua M, Otaegui-Chivite A, Alberro A, Bravo-Miana RDC, Castillo-Trivino T, Moles L, Otaegui D. A proposed workflow to robustly analyze bacterial transcripts in RNAseq data from extracellular vesicles. Front Microbiol 2025; 16:1486661. [PMID: 40207155 PMCID: PMC11981554 DOI: 10.3389/fmicb.2025.1486661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 02/27/2025] [Indexed: 04/11/2025] Open
Abstract
Introduction The microbiota has been unequivocally linked to various diseases, yet the mechanisms underlying these associations remain incompletely understood. One potential contributor to this relationship is the extracellular vesicles produced by bacteria (bEVs). However, the detection of these bEVs is challenging. Therefore, we propose a novel workflow to identify bacterial RNA present in circulating extracellular vesicles using Total EV RNA-seq data. As a proof of concept, we applied this workflow to a dataset from individuals with multiple sclerosis (MS). Methods We analyzed total EV RNA-seq data from blood samples of healthy controls and individuals with MS, encompassing both the Relapsing-Remitting (RR) and Secondary Progressive (SP) phases of the disease. Our workflow incorporates multiple reference mapping steps against the host genome, followed by a consensus selection of bacterial genera based on various taxonomic profiling tools. This consensus approach utilizes a flagging system to exclude genera with low abundance across profilers. Additionally, we included EVs derived from two cultured species that serve as biological controls, as well as artificially generated reads from 60 species as a technical control, to validate the specificity of this workflow. Results Our findings demonstrate that bacterial RNA can indeed be detected in total EV RNA-seq from blood samples, suggesting that this workflow can be a powerful tool for reanalyzing RNA-seq data from EV studies. Additionally, we identified promising bacterial candidates with differential expression between the RR and SP phases of MS. Discussion This approach provides valuable insights into the potential role of bEVs in the microbiota-host communication. Finally, this approach is translatable to other experiments using total RNA, where the lack of a robust pipeline can lead to an increased false positive detection of microbial genera. The workflow and instructions on how to use it are available at the following repository: https://github.com/NanoNeuro/EV_taxprofiling.
Collapse
Affiliation(s)
- Alex M. Ascensión
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
| | - Miriam Gorostidi-Aicua
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - Ane Otaegui-Chivite
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - Ainhoa Alberro
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - Rocio del Carmen Bravo-Miana
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - Tamara Castillo-Trivino
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - Laura Moles
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| | - David Otaegui
- Neuroimmunology Group, Biogipuzkoa Health Research Institute, P/ Doctor Begiristain s/n, Donostia-San Sebastián, Spain
- Neurodegenerative Diseases Research Area of CIBER (CIBERNED), Carlos III Health Institute (ISCIII), Madrid, Spain
| |
Collapse
|
4
|
Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
Collapse
|
5
|
Diener C, Holscher HD, Filek K, Corbin KD, Moissl-Eichinger C, Gibbons SM. Metagenomic estimation of dietary intake from human stool. Nat Metab 2025; 7:617-630. [PMID: 39966520 PMCID: PMC11949708 DOI: 10.1038/s42255-025-01220-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 01/16/2025] [Indexed: 02/20/2025]
Abstract
Dietary intake is tightly coupled to gut microbiota composition, human metabolism and the incidence of virtually all major chronic diseases. Dietary and nutrient intake are usually assessed using self-reporting methods, including dietary questionnaires and food records, which suffer from reporting biases and require strong compliance from study participants. Here, we present Metagenomic Estimation of Dietary Intake (MEDI): a method for quantifying food-derived DNA in human faecal metagenomes. We show that DNA-containing food components can be reliably detected in stool-derived metagenomic data, even when present at low abundances (more than ten reads). We show how MEDI dietary intake profiles can be converted into detailed metabolic representations of nutrient intake. MEDI identifies the onset of solid food consumption in infants, shows significant agreement with food frequency questionnaire responses in an adult population and shows agreement with food and nutrient intake in two controlled-feeding studies. Finally, we identify specific dietary features associated with metabolic syndrome in a large clinical cohort without dietary records, providing a proof-of-concept for detailed tracking of individual-specific, health-relevant dietary patterns without the need for questionnaires.
Collapse
Affiliation(s)
- Christian Diener
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria.
- Institute for Systems Biology, Seattle, WA, USA.
| | - Hannah D Holscher
- Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL, USA
| | - Klara Filek
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Karen D Corbin
- AdventHealth Translational Research Institute, Orlando, FL, USA
| | - Christine Moissl-Eichinger
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
- BioTechMed Graz, Graz, Austria
| | - Sean M Gibbons
- Institute for Systems Biology, Seattle, WA, USA.
- Department of Bioengineering, University of Washington, Seattle, WA, USA.
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- eScience Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
6
|
Li H, Chen Y, Xia Z, Zhuang D, Cong F, Lian YX. Metagenomic investigation of viruses in green sea turtles ( Chelonia mydas). Front Microbiol 2025; 16:1492038. [PMID: 39911250 PMCID: PMC11794262 DOI: 10.3389/fmicb.2025.1492038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Accepted: 01/07/2025] [Indexed: 02/07/2025] Open
Abstract
Green sea turtles are listed on the International Union for Conservation of Nature's Red List of Threatened Species. Thus, conservation efforts, including investigation of factors affecting the health of green sea turtles, are critical. Viral communities play vital roles in maintaining animal health. In the present study, shotgun metagenomics was used for the first time to survey viruses in the feces of green sea turtles. Most viral contigs were DNA viruses that mainly belonged to Caudoviricetes, followed by Crassvirales. Additionally, most of the viral contigs were not assigned to any known family or genus, implying a large knowledge gap in the taxonomy of green sea turtle gut viruses. Host prediction showed that most viruses were connected to two phyla: Bacteroidetes and Firmicutes. Furthermore, KEGG enrichment analysis showed that the viral genes were mainly involved in phage-associated and metabolic pathways. Phylogenetic tree reconstruction of Caudovirales terminase large-subunit (TerL) protein showed that most of the sequences were phylogenetically distant. This study expands our understanding of the viral diversity in green sea turtles. In particular, analysis of the virome RNA fraction is exceedingly important for investigating intestinal viromes; therefore, future studies could use metatranscriptomics to study RNA viruses.
Collapse
Affiliation(s)
- Hongwei Li
- School of Life Science, Huizhou University, Huizhou, China
| | - Yuan Chen
- School of Life Science, Huizhou University, Huizhou, China
| | - Zhongrong Xia
- Guangdong Huidong Sea Turtle National Nature Reserve Bureau, Sea Turtle Bay, Huizhou, China
| | - Daohua Zhuang
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, School of Life Sciences, Yunnan University, Kunming, China
| | - Feng Cong
- Guangdong Laboratory Animal Monitoring Institute and Guangdong Provincial Key Laboratory of Laboratory Animals, Guangzhou, China
| | - Yue-Xiao Lian
- Guangdong Laboratory Animal Monitoring Institute and Guangdong Provincial Key Laboratory of Laboratory Animals, Guangzhou, China
| |
Collapse
|
7
|
Gao Y, Luo H, Lyu H, Yang H, Yousuf S, Huang S, Liu YX. Benchmarking short-read metagenomics tools for removing host contamination. Gigascience 2025; 14:giaf004. [PMID: 40036691 PMCID: PMC11878760 DOI: 10.1093/gigascience/giaf004] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Revised: 10/31/2024] [Accepted: 01/09/2025] [Indexed: 03/06/2025] Open
Abstract
BACKGROUND The rapid evolution of metagenomic sequencing technology offers remarkable opportunities to explore the intricate roles of microbiome in host health and disease, as well as to uncover the unknown structure and functions of microbial communities. However, the swift accumulation of metagenomic data poses substantial challenges for data analysis. Contamination from host DNA can substantially compromise result accuracy and increase additional computational resources by including nontarget sequences. RESULTS In this study, we assessed the impact of computational host DNA decontamination on downstream analyses, highlighting its importance in producing accurate results efficiently. We also evaluated the performance of conventional tools like KneadData, Bowtie2, BWA, KMCP, Kraken2, and KrakenUniq, each offering unique advantages for different applications. Furthermore, we highlighted the importance of an accurate host reference genome, noting that its absence negatively affected the decontamination performance across all tools. CONCLUSIONS Our findings underscore the need for careful selection of decontamination tools and reference genomes to enhance the accuracy of metagenomic analyses. These insights provide valuable guidance for improving the reliability and reproducibility of microbiome research.
Collapse
Affiliation(s)
- Yunyun Gao
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Hao Luo
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Hujie Lyu
- Department of Life Sciences, Imperial College of London, London SW7 2AZ, UK
| | - Haifei Yang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- College of Life Sciences, Qingdao Agricultural University, Qingdao 266000, China
| | - Salsabeel Yousuf
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Shi Huang
- Faculty of Dentistry, The University of Hong Kong, Hong Kong SAR, China
| | - Yong-Xin Liu
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| |
Collapse
|
8
|
Chang WS, Harvey E, Mahar JE, Firth C, Shi M, Simon-Loriere E, Geoghegan JL, Wille M. Improving the reporting of metagenomic virome-scale data. Commun Biol 2024; 7:1687. [PMID: 39706917 DOI: 10.1038/s42003-024-07212-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 11/04/2024] [Indexed: 12/23/2024] Open
Abstract
Over the last decade metagenomic sequencing has facilitated an increasing number of virome-scale studies, leading to an exponential expansion in understanding of virus diversity. This is partially driven by the decreasing costs of metagenomic sequencing, improvements in computational tools for revealing novel viruses, and an increased understanding of the key role that viruses play in human and animal health. A central concern associated with this remarkable increase in the number of virome-scale studies is the lack of broadly accepted "gold standards" for reporting the data and results generated. This is of particular importance for animal virome studies as there are a multitude of nuanced approaches for both data presentation and analysis, all of which impact the resulting outcomes. As such, the results of published studies can be difficult to contextualise and may be of reduced utility due to reporting deficiencies. Herein, we aim to address these reporting issues by outlining recommendations for the presentation of virome data, encouraging a transparent communication of findings that can be interpreted in evolutionary and ecological contexts.
Collapse
Affiliation(s)
- Wei-Shan Chang
- School of Medical Sciences, The University of Sydney, Sydney, NSW, Australia
- Health and Biosecurity, Commonwealth Scientific and Industrial Research Organisation, Canberra, ACT, Australia
| | - Erin Harvey
- School of Medical Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Jackie E Mahar
- School of Medical Sciences, The University of Sydney, Sydney, NSW, Australia
- Australian Animal Health Laboratory and Health and Biosecurity, Commonwealth Scientific and Industrial Research Organisation, Geelong, VIC, Australia
| | - Cadhla Firth
- College of Public Health, Medical, and Veterinary Sciences, James Cook University, Townsville, Australia
| | - Mang Shi
- Sun Yat-Sen University, Shenzhen campus of Sun Yat-Sen University, Shenzhen, China
| | - Etienne Simon-Loriere
- Evolutionary Genomics of RNA Viruses, Institut Pasteur, Université Paris Cité, Paris, France
| | - Jemma L Geoghegan
- Department of Microbiology and Immunology, University of Otago, Dunedin, New Zealand
- Institute of Environmental Science and Research, Wellington, New Zealand
| | - Michelle Wille
- School of Medical Sciences, The University of Sydney, Sydney, NSW, Australia.
- Centre for Pathogen Genomics, Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia.
| |
Collapse
|
9
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
10
|
Zárate A, Díaz-González L, Taboada B. VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification. Brief Bioinform 2024; 26:bbaf001. [PMID: 39808116 PMCID: PMC11729733 DOI: 10.1093/bib/bbaf001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/12/2024] [Accepted: 08/01/2025] [Indexed: 01/16/2025] Open
Abstract
This study addresses the challenging task of identifying viruses within metagenomic data, which encompasses a broad array of biological samples, including animal reservoirs, environmental sources, and the human body. Traditional methods for virus identification often face limitations due to the diversity and rapid evolution of viral genomes. In response, recent efforts have focused on leveraging artificial intelligence (AI) techniques to enhance accuracy and efficiency in virus detection. However, existing AI-based approaches are primarily binary classifiers, lacking specificity in identifying viral types and reliant on nucleotide sequences. To address these limitations, VirDetect-AI, a novel tool specifically designed for the identification of eukaryotic viruses within metagenomic datasets, is introduced. The VirDetect-AI model employs a combination of convolutional neural networks and residual neural networks to effectively extract hierarchical features and detailed patterns from complex amino acid genomic data. The results demonstrated that the model has outstanding results in all metrics, with a sensitivity of 0.97, a precision of 0.98, and an F1-score of 0.98. VirDetect-AI improves our comprehension of viral ecology and can accurately classify metagenomic sequences into 980 viral protein classes, hence enabling the identification of new viruses. These classes encompass an extensive array of viral genera and families, as well as protein functions and hosts.
Collapse
Affiliation(s)
- Alida Zárate
- Doctorado en Ciencias, Instituto de Investigación en Ciencias Básicas Aplicadas (IICBA), Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Lorena Díaz-González
- Centro de Investigación en Ciencias, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Blanca Taboada
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| |
Collapse
|
11
|
Shaw J, Yu YW. Rapid species-level metagenome profiling and containment estimation with sylph. Nat Biotechnol 2024:10.1038/s41587-024-02412-y. [PMID: 39379646 DOI: 10.1038/s41587-024-02412-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/28/2024] [Indexed: 10/10/2024]
Abstract
Profiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
12
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Kurilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: coverage-based analysis for identification of microbiome. Brief Bioinform 2024; 25:bbae424. [PMID: 39222062 PMCID: PMC11367759 DOI: 10.1093/bib/bbae424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 06/26/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024] Open
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Stowers Institute for Medical Research, 1000 E 50 St, Kansas City, MO 64110, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Alongkorn Kurilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, 54 Kamphaeng Phet Rd., Laksi, Bangkok 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Influenza Research Institute, Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 575 Science Drive, Madison, WI 53711, United States
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Endocrinology, Department of Medicine, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Physiology and Cell Biology, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| |
Collapse
|
13
|
Ulrich JU, Renard BY. Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Res 2024; 34:914-924. [PMID: 38886068 PMCID: PMC11293544 DOI: 10.1101/gr.278623.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 05/23/2024] [Indexed: 06/20/2024]
Abstract
Metagenomic long-read sequencing is gaining popularity for various applications, including pathogen detection and microbiome studies. To analyze the large data created in those studies, software tools need to taxonomically classify the sequenced molecules and estimate the relative abundances of organisms in the sequenced sample. Because of the exponential growth of reference genome databases, the current taxonomic classification methods have large computational requirements. This issue motivated us to develop a new data structure for fast and memory-efficient querying of long reads. Here, we present Taxor as a new tool for long-read metagenomic classification using a hierarchical interleaved XOR filter data structure for indexing and querying large reference genome sets. Taxor implements several k-mer-based approaches, such as syncmers, for pseudoalignment to classify reads and an expectation-maximization algorithm for metagenomic profiling. Our results show that Taxor outperforms state-of-the-art tools regarding precision while having a similar recall for long-read taxonomic classification. Most notably, Taxor reduces the memory requirements and index size by >50% and is among the fastest tools regarding query times. This enables real-time metagenomics analysis with large reference databases on a small laptop in the field.
Collapse
Affiliation(s)
- Jens-Uwe Ulrich
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
- Phylogenomics Unit, Center for Artificial Intelligence in Public Health Research, Robert Koch Institute, 15745 Wildau, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
| |
Collapse
|
14
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Krulilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: Coverage-based Analysis for Identification of Microbiome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.25.591018. [PMID: 38746391 PMCID: PMC11091946 DOI: 10.1101/2024.04.25.591018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic (WMS) approach. In this study, we developed a new bioinformatics tool, CAIM, for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consitently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similality of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and primary 44 liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A. Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Alongkorn Krulilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, Bangkok, 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| |
Collapse
|
15
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA.
- Department of Computer Science, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Dartmouth College, Hanover, NH, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
16
|
Pinto Y, Chakraborty M, Jain N, Bhatt AS. Phage-inclusive profiling of human gut microbiomes with Phanta. Nat Biotechnol 2024; 42:651-662. [PMID: 37231259 DOI: 10.1038/s41587-023-01799-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 04/20/2023] [Indexed: 05/27/2023]
Abstract
Due to technical limitations, most gut microbiome studies have focused on prokaryotes, overlooking viruses. Phanta, a virome-inclusive gut microbiome profiling tool, overcomes the limitations of assembly-based viral profiling methods by using customized k-mer-based classification tools and incorporating recently published catalogs of gut viral genomes. Phanta's optimizations consider the small genome size of viruses, sequence homology with prokaryotes and interactions with other gut microbes. Extensive testing of Phanta on simulated data demonstrates that it quickly and accurately quantifies prokaryotes and viruses. When applied to 245 fecal metagenomes from healthy adults, Phanta identifies ~200 viral species per sample, ~5× more than standard assembly-based methods. We observe a ~2:1 ratio between DNA viruses and bacteria, with higher interindividual variability of the gut virome compared to the gut bacteriome. In another cohort, we observe that Phanta performs equally well on bulk versus virus-enriched metagenomes, making it possible to study prokaryotes and viruses in a single experiment, with a single analysis.
Collapse
Affiliation(s)
- Yishay Pinto
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | | | - Navami Jain
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | - Ami S Bhatt
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA.
| |
Collapse
|
17
|
Liu X, Liu Y, Liu J, Zhang H, Shan C, Guo Y, Gong X, Cui M, Li X, Tang M. Correlation between the gut microbiome and neurodegenerative diseases: a review of metagenomics evidence. Neural Regen Res 2024; 19:833-845. [PMID: 37843219 PMCID: PMC10664138 DOI: 10.4103/1673-5374.382223] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 04/19/2023] [Accepted: 06/17/2023] [Indexed: 10/17/2023] Open
Abstract
A growing body of evidence suggests that the gut microbiota contributes to the development of neurodegenerative diseases via the microbiota-gut-brain axis. As a contributing factor, microbiota dysbiosis always occurs in pathological changes of neurodegenerative diseases, such as Alzheimer's disease, Parkinson's disease, and amyotrophic lateral sclerosis. High-throughput sequencing technology has helped to reveal that the bidirectional communication between the central nervous system and the enteric nervous system is facilitated by the microbiota's diverse microorganisms, and for both neuroimmune and neuroendocrine systems. Here, we summarize the bioinformatics analysis and wet-biology validation for the gut metagenomics in neurodegenerative diseases, with an emphasis on multi-omics studies and the gut virome. The pathogen-associated signaling biomarkers for identifying brain disorders and potential therapeutic targets are also elucidated. Finally, we discuss the role of diet, prebiotics, probiotics, postbiotics and exercise interventions in remodeling the microbiome and reducing the symptoms of neurodegenerative diseases.
Collapse
Affiliation(s)
- Xiaoyan Liu
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Yi Liu
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
- Institute of Animal Husbandry, Jiangsu Academy of Agricultural Sciences, Nanjing, Jiangsu Province, China
| | - Junlin Liu
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Hantao Zhang
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Chaofan Shan
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Yinglu Guo
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Xun Gong
- Department of Rheumatology & Immunology, Affiliated Hospital of Jiangsu University, Zhenjiang, Jiangsu Province, China
| | - Mengmeng Cui
- Department of Neurology, The Second Affiliated Hospital of Shandong First Medical University, Taian, Shandong Province, China
| | - Xiubin Li
- Department of Neurology, The Second Affiliated Hospital of Shandong First Medical University, Taian, Shandong Province, China
| | - Min Tang
- School of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu Province, China
| |
Collapse
|
18
|
Chorlton SD. Ten common issues with reference sequence databases and how to mitigate them. FRONTIERS IN BIOINFORMATICS 2024; 4:1278228. [PMID: 38560517 PMCID: PMC10978663 DOI: 10.3389/fbinf.2024.1278228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 03/05/2024] [Indexed: 04/04/2024] Open
Abstract
Metagenomic sequencing has revolutionized our understanding of microbiology. While metagenomic tools and approaches have been extensively evaluated and benchmarked, far less attention has been given to the reference sequence database used in metagenomic classification. Issues with reference sequence databases are pervasive. Database contamination is the most recognized issue in the literature; however, it remains relatively unmitigated in most analyses. Other common issues with reference sequence databases include taxonomic errors, inappropriate inclusion and exclusion criteria, and sequence content errors. This review covers ten common issues with reference sequence databases and the potential downstream consequences of these issues. Mitigation measures are discussed for each issue, including bioinformatic tools and database curation strategies. Together, these strategies present a path towards more accurate, reproducible and translatable metagenomic sequencing.
Collapse
|
19
|
Diener C, Gibbons SM. Metagenomic estimation of dietary intake from human stool. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.02.578701. [PMID: 38370672 PMCID: PMC10871216 DOI: 10.1101/2024.02.02.578701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Dietary intake is tightly coupled to gut microbiota composition, human metabolism, and to the incidence of virtually all major chronic diseases. Dietary and nutrient intake are usually quantified using dietary questionnaires, which tend to focus on broad food categories, suffer from self-reporting biases, and require strong compliance from study participants. Here, we present MEDI (Metagenomic Estimation of Dietary Intake): a method for quantifying dietary intake using food-derived DNA in stool metagenomes. We show that food items can be accurately detected in metagenomic shotgun sequencing data, even when present at low abundances (>10 reads). Furthermore, we show how dietary intake, in terms of DNA abundance from specific organisms, can be converted into a detailed metabolic representation of nutrient intake. MEDI could identify the onset of solid food consumption in infants and it accurately predicted food questionnaire responses in an adult population. Additionally, we were able to identify specific dietary features associated with metabolic syndrome in a large clinical cohort, providing a proof-of-concept for detailed quantification of individual-specific dietary patterns without the need for questionnaires.
Collapse
Affiliation(s)
- Christian Diener
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
- Institute for Systems Biology, Seattle, WA, USA
| | - Sean M. Gibbons
- Institute for Systems Biology, Seattle, WA, USA
- Department of Bioengineering, University of Washington, Seattle, WA, USA
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- eScience Institute, University of Washington, Seattle, WA, USA
| |
Collapse
|
20
|
Yang C, Zhang Z, Huang Y, Xie X, Liao H, Xiao J, Veldsman WP, Yin K, Fang X, Zhang L. LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome. Gigascience 2024; 13:giae028. [PMID: 38869148 PMCID: PMC11170215 DOI: 10.1093/gigascience/giae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 03/15/2024] [Accepted: 05/09/2024] [Indexed: 06/14/2024] Open
Abstract
BACKGROUND Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform. FINDINGS To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots. CONCLUSIONS LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Yufen Huang
- BGI Research, Shenzhen 518083, China
- BGI Genomics, Shenzhen 518083, China
| | | | - Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR 999077, Hong Kong
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Werner Pieter Veldsman
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Xiaodong Fang
- BGI Genomics, Shenzhen 518083, China
- BGI Research, Sanya 572025, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
- Institute for Research and Continuing Education, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| |
Collapse
|
21
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.15.567129. [PMID: 38014029 PMCID: PMC10680779 DOI: 10.1101/2023.11.15.567129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
22
|
Fan J, Singh NP, Khan J, Pibiri GE, Patro R. Fulgor: A fast and compact k-mer index for large-scale matching and color queries. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.09.539895. [PMID: 37214944 PMCID: PMC10197524 DOI: 10.1101/2023.05.09.539895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an index for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| |
Collapse
|
23
|
Yorki S, Shea T, Cuomo CA, Walker BJ, LaRocque RC, Manson AL, Earl AM, Worby CJ. Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes. Brief Bioinform 2023; 24:bbad050. [PMID: 36804804 PMCID: PMC10025444 DOI: 10.1093/bib/bbad050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/13/2023] [Accepted: 01/26/2023] [Indexed: 02/23/2023] Open
Abstract
Recent technological and computational advances have made metagenomic assembly a viable approach to achieving high-resolution views of complex microbial communities. In previous benchmarking, short-read (SR) metagenomic assemblers had the highest accuracy, long-read (LR) assemblers generated the most contiguous sequences and hybrid (HY) assemblers balanced length and accuracy. However, no assessments have specifically compared the performance of these assemblers on low-abundance species, which include clinically relevant organisms in the gut. We generated semi-synthetic LR and SR datasets by spiking small and increasing amounts of Escherichia coli isolate reads into fecal metagenomes and, using different assemblers, examined E. coli contigs and the presence of antibiotic resistance genes (ARGs). For ARG assembly, although SR assemblers recovered more ARGs with high accuracy, even at low coverages, LR assemblies allowed for the placement of ARGs within longer, E. coli-specific contigs, thus pinpointing their taxonomic origin. HY assemblies identified resistance genes with high accuracy and had lower contiguity than LR assemblies. Each assembler type's strengths were maintained even when our isolate was spiked in with a competing strain, which fragmented and reduced the accuracy of all assemblies. For strain characterization and determining gene context, LR assembly is optimal, while for base-accurate gene identification, SR assemblers outperform other options. HY assembly offers contiguity and base accuracy, but requires generating data on multiple platforms, and may suffer high misassembly rates when strain diversity exists. Our results highlight the trade-offs associated with each approach for recovering low-abundance taxa, and that the optimal approach is goal-dependent.
Collapse
Affiliation(s)
- Sosie Yorki
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Terrance Shea
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christina A Cuomo
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Bruce J Walker
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Applied Invention, LLC, Cambridge, MA, USA
| | - Regina C LaRocque
- Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA, USA
| | - Abigail L Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ashlee M Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Colin J Worby
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|