1
|
Milhaven M, Garg A, Versoza CJ, Pfeifer SP. Quantifying the effects of computational filter criteria on the accurate identification of de novo mutations at varying levels of sequencing coverage. Heredity (Edinb) 2025; 134:273-279. [PMID: 40082647 PMCID: PMC12056167 DOI: 10.1038/s41437-025-00754-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 02/24/2025] [Accepted: 02/24/2025] [Indexed: 03/16/2025] Open
Abstract
The rate of spontaneous (de novo) germline mutation is a key parameter in evolutionary biology, impacting genetic diversity and contributing to the evolution of populations and species. Mutation rates themselves evolve over time but the mechanisms underlying the mutation rate variation observed across the Tree of Life remain largely to be elucidated. In recent years, whole genome sequencing has enabled the estimation of mutation rates for several organisms. However, due to a lack of community standards, many previous studies differ both empirically - most notably, in the depth of sequencing used to reliably identify de novo mutations - and computationally - utilizing different computational pipelines to detect germline mutations as well as different analysis strategies to mitigate technical artifacts - rendering comparisons between studies challenging. Using a pedigree of Western chimpanzees as an illustrative example, we here quantify the effects of commonly utilized quality metrics to reliably identify de novo mutations at different levels of sequencing coverage. We demonstrate that datasets with a mean depth of ≤ 30X are ill-suited for the detection of de novo mutations due to high false positive rates that can only be partially mitigated by computational filter criteria. In contrast, higher coverage datasets enable a comprehensive identification of de novo mutations at low false positive rates, with minimal benefits beyond a sequencing coverage of 60X, suggesting that future work should favor breadth (by sequencing additional individuals) over depth. Importantly, the simulation and analysis framework described here provides conceptual guidelines that will allow researchers to take study design and species-specific resources into account when determining computational filtering strategies for their organism of interest.
Collapse
Affiliation(s)
- Mark Milhaven
- School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ, 85281, USA
| | - Aman Garg
- School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA
| | - Cyril J Versoza
- School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ, 85281, USA
| | - Susanne P Pfeifer
- School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA.
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ, 85281, USA.
| |
Collapse
|
2
|
Ramos Lopez D, Flores FJ, Espindola AS. MeStanG-Resource for High-Throughput Sequencing Standard Data Sets Generation for Bioinformatic Methods Evaluation and Validation. BIOLOGY 2025; 14:69. [PMID: 39857299 PMCID: PMC11762867 DOI: 10.3390/biology14010069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 01/10/2025] [Accepted: 01/11/2025] [Indexed: 01/27/2025]
Abstract
Metagenomics analysis has enabled the measurement of the microbiome diversity in environmental samples without prior targeted enrichment. Functional and phylogenetic studies based on microbial diversity retrieved using HTS platforms have advanced from detecting known organisms and discovering unknown species to applications in disease diagnostics. Robust validation processes are essential for test reliability, requiring standard samples and databases deriving from real samples and in silico generated artificial controls. We propose a MeStanG as a resource for generating HTS Nanopore data sets to evaluate present and emerging bioinformatics pipelines. MeStanG allows samples to be designed with user-defined organism abundances expressed as number of reads, reference sequences, and predetermined or custom errors by sequencing profiles. The simulator pipeline was evaluated by analyzing its output mock metagenomic samples containing known read abundances using read mapping, genome assembly, and taxonomic classification on three scenarios: a bacterial community composed of nine different organisms, samples resembling pathogen-infected wheat plants, and a viral pathogen serial dilution sampling. The evaluation was able to report consistently the same organisms, and their read abundances as provided in the mock metagenomic sample design. Based on this performance and its novel capacity of generating exact number of reads, MeStanG can be used by scientists to develop mock metagenomic samples (artificial HTS data sets) to assess the diagnostic performance metrics of bioinformatic pipelines, allowing the user to choose predetermined or customized models for research and training.
Collapse
Affiliation(s)
- Daniel Ramos Lopez
- Institute for Biosecurity and Microbial Forensics (IBMF), Oklahoma State University, Stillwater, OK 74078, USA;
- Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
| | - Francisco J. Flores
- Departamento de Ciencias de la Vida y la Agricultura, Universidad de las Fuerzas Armadas-ESPE, Sangolquí 171103, Ecuador;
- Centro de Investigación de Alimentos, CIAL, Facultad de Ciencias de la Ingeniería e Industrias, Universidad UTE, Quito 170527, Ecuador
| | - Andres S. Espindola
- Institute for Biosecurity and Microbial Forensics (IBMF), Oklahoma State University, Stillwater, OK 74078, USA;
- Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
| |
Collapse
|
3
|
Chaabane F, Pillonel T, Bertelli C. MeSS and assembly_finder: a toolkit for in silico metagenomic sample generation. Bioinformatics 2024; 41:btae760. [PMID: 39739308 PMCID: PMC11755095 DOI: 10.1093/bioinformatics/btae760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 11/17/2024] [Accepted: 12/30/2024] [Indexed: 01/02/2025] Open
Abstract
SUMMARY The intrinsic complexity of the microbiota combined with technical variability render shotgun metagenomics challenging to analyze for routine clinical or research applications. In silico data generation offers a controlled environment allowing for example to benchmark bioinformatics tools, to optimize study design, statistical power, or to validate targeted applications. Here, we propose assembly_finder and the Metagenomic Sequence Simulator (MeSS), two easy-to-use Bioconda packages, as part of a benchmarking toolkit to download genomes and simulate shotgun metagenomics samples, respectively. Outperforming existing tools in speed while requiring less memory, MeSS reproducibly generates accurate complex communities based on a list of taxonomic ranks and their abundance. AVAILABILITY AND IMPLEMENTATION All code is released under MIT License and is available on https://github.com/metagenlab/MeSS and https://github.com/metagenlab/assembly_finder.
Collapse
Affiliation(s)
- Farid Chaabane
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Trestan Pillonel
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Claire Bertelli
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| |
Collapse
|
4
|
Van Uffelen A, Posadas A, Roosens NHC, Marchal K, De Keersmaecker SCJ, Vanneste K. Benchmarking bacterial taxonomic classification using nanopore metagenomics data of several mock communities. Sci Data 2024; 11:864. [PMID: 39127718 PMCID: PMC11316826 DOI: 10.1038/s41597-024-03672-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 07/22/2024] [Indexed: 08/12/2024] Open
Abstract
Taxonomic classification is crucial in identifying organisms within diverse microbial communities when using metagenomics shotgun sequencing. While second-generation Illumina sequencing still dominates, third-generation nanopore sequencing promises improved classification through longer reads. However, extensive benchmarking studies on nanopore data are lacking. We systematically evaluated performance of bacterial taxonomic classification for metagenomics nanopore sequencing data for several commonly used classifiers, using standardized reference sequence databases, on the largest collection of publicly available data for defined mock communities thus far (nine samples), representing different research domains and application scopes. Our results categorize classifiers into three categories: low precision/high recall; medium precision/medium recall, and high precision/medium recall. Most fall into the first group, although precision can be improved without excessively penalizing recall with suitable abundance filtering. No definitive 'best' classifier emerges, and classifier selection depends on application scope and practical requirements. Although few classifiers designed for long reads exist, they generally exhibit better performance. Our comprehensive benchmarking provides concrete recommendations, supported by publicly available code for reassessment and fine-tuning by other scientists.
Collapse
Affiliation(s)
- Alexander Van Uffelen
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
- Department of Information Technology, Internet Technology and Data Science Lab (IDLab), Interuniversity Microelectronics Centre (IMEC), Ghent University, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Andrés Posadas
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
- Department of Information Technology, Internet Technology and Data Science Lab (IDLab), Interuniversity Microelectronics Centre (IMEC), Ghent University, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Nancy H C Roosens
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Kathleen Marchal
- Department of Information Technology, Internet Technology and Data Science Lab (IDLab), Interuniversity Microelectronics Centre (IMEC), Ghent University, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Department of Genetics, University of Pretoria, Pretoria, South Africa
| | | | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium.
| |
Collapse
|
5
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
6
|
Choudalakis M, Bashtrykov P, Jeltsch A. RepEnTools: an automated repeat enrichment analysis package for ChIP-seq data reveals hUHRF1 Tandem-Tudor domain enrichment in young repeats. Mob DNA 2024; 15:6. [PMID: 38570859 PMCID: PMC10988844 DOI: 10.1186/s13100-024-00315-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/05/2024] [Indexed: 04/05/2024] Open
Abstract
BACKGROUND Repeat elements (REs) play important roles for cell function in health and disease. However, RE enrichment analysis in short-read high-throughput sequencing (HTS) data, such as ChIP-seq, is a challenging task. RESULTS Here, we present RepEnTools, a software package for genome-wide RE enrichment analysis of ChIP-seq and similar chromatin pulldown experiments. Our analysis package bundles together various software with carefully chosen and validated settings to provide a complete solution for RE analysis, starting from raw input files to tabular and graphical outputs. RepEnTools implementations are easily accessible even with minimal IT skills (Galaxy/UNIX). To demonstrate the performance of RepEnTools, we analysed chromatin pulldown data by the human UHRF1 TTD protein domain and discovered enrichment of TTD binding on young primate and hominid specific polymorphic repeats (SVA, L1PA1/L1HS) overlapping known enhancers and decorated with H3K4me1-K9me2/3 modifications. We corroborated these new bioinformatic findings with experimental data by qPCR assays using newly developed primate and hominid specific qPCR assays which complement similar research tools. Finally, we analysed mouse UHRF1 ChIP-seq data with RepEnTools and showed that the endogenous mUHRF1 protein colocalizes with H3K4me1-H3K9me3 on promoters of REs which were silenced by UHRF1. These new data suggest a functional role for UHRF1 in silencing of REs that is mediated by TTD binding to the H3K4me1-K9me3 double mark and conserved in two mammalian species. CONCLUSIONS RepEnTools improves the previously available programmes for RE enrichment analysis in chromatin pulldown studies by leveraging new tools, enhancing accessibility and adding some key functions. RepEnTools can analyse RE enrichment rapidly, efficiently, and accurately, providing the community with an up-to-date, reliable and accessible tool for this important type of analysis.
Collapse
Affiliation(s)
- Michel Choudalakis
- Department of Biochemistry, Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Allmandring 31, 70569, Stuttgart, Germany
| | - Pavel Bashtrykov
- Department of Biochemistry, Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Allmandring 31, 70569, Stuttgart, Germany.
| | - Albert Jeltsch
- Department of Biochemistry, Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Allmandring 31, 70569, Stuttgart, Germany.
| |
Collapse
|
7
|
Hall MB, Coin LJM. Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data. Gigascience 2024; 13:giae010. [PMID: 38573185 PMCID: PMC10993716 DOI: 10.1093/gigascience/giae010] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/10/2024] [Accepted: 02/27/2024] [Indexed: 04/05/2024] Open
Abstract
BACKGROUND Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. RESULTS We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. CONCLUSIONS Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.
Collapse
Affiliation(s)
- Michael B Hall
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, 3000 Victoria, Australia
| | - Lachlan J M Coin
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, 3000 Victoria, Australia
| |
Collapse
|