1
|
Pham DT, Phan V. MetaBIDx: a new computational approach to bacteria identification in microbiomes. MICROBIOME RESEARCH REPORTS 2024; 3:25. [PMID: 38841411 PMCID: PMC11149084 DOI: 10.20517/mrr.2024.01] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 03/04/2024] [Accepted: 03/25/2024] [Indexed: 06/07/2024]
Abstract
Objectives: This study introduces MetaBIDx, a computational method designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. Bacterial identification is essential for disease diagnosis and tracing outbreaks associated with microbial infections. Methods: MetaBIDx utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The approach was evaluated and compared with several well-established tools across various datasets. Precision, recall, and F1-score were used to quantify the accuracy of species prediction. Results: MetaBIDx demonstrated superior performance compared to other tools, especially in terms of precision and F1-score. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. Conclusion: With a novel approach to reducing false positives and the effective use of a modified Bloom filter to index species, MetaBIDx represents an advancement in metagenomic analysis. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases.
Collapse
Affiliation(s)
| | - Vinhthuy Phan
- Department of Computer Science, University of Memphis, Memphis, TN 38152, USA
| |
Collapse
|
2
|
PathoLive—Real-Time Pathogen Identification from Metagenomic Illumina Datasets. Life (Basel) 2022; 12:life12091345. [PMID: 36143382 PMCID: PMC9505849 DOI: 10.3390/life12091345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 08/24/2022] [Accepted: 08/24/2022] [Indexed: 11/18/2022] Open
Abstract
Over the past years, NGS has become a crucial workhorse for open-view pathogen diagnostics. Yet, long turnaround times result from using massively parallel high-throughput technologies as the analysis can only be performed after sequencing has finished. The interpretation of results can further be challenged by contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data. We implemented PathoLive, a real-time diagnostics pipeline for the detection of pathogens from clinical samples hours before sequencing has finished. Based on real-time alignment with HiLive2, mappings are scored with respect to common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms. The results are visualized using an interactive taxonomic tree that provides an easily interpretable overview of the relevance of hits. For a human plasma sample that was spiked in vitro with six pathogenic viruses, all agents were clearly detected after only 40 of 200 sequencing cycles. For a real-world sample from Sudan, the results correctly indicated the presence of Crimean-Congo hemorrhagic fever virus. In a second real-world dataset from the 2019 SARS-CoV-2 outbreak in Wuhan, we found the presence of a SARS coronavirus as the most relevant hit without the novel virus reference genome being included in the database. For all samples, clinically irrelevant hits were correctly de-emphasized. Our approach is valuable to obtain fast and accurate NGS-based pathogen identifications and correctly prioritize and visualize them based on their clinical significance: PathoLive is open source and available on GitLab and BioConda.
Collapse
|
3
|
Aubert J, Schbath S, Robin S. Model‐based biclustering for overdispersed count data with application in microbial ecology. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13582] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Julie Aubert
- Université Paris‐SaclayAgroParisTechINRAEMIA‐Paris Paris France
| | | | - Stéphane Robin
- Université Paris‐SaclayAgroParisTechINRAEMIA‐Paris Paris France
| |
Collapse
|
4
|
Sharma P, Tripathi S, Chandra R. Metagenomic analysis for profiling of microbial communities and tolerance in metal-polluted pulp and paper industry wastewater. BIORESOURCE TECHNOLOGY 2021; 324:124681. [PMID: 33454444 DOI: 10.1016/j.biortech.2021.124681] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Revised: 01/02/2021] [Accepted: 01/04/2021] [Indexed: 06/12/2023]
Abstract
This work aimed to study the profiling and efficiency of microbial communities and their abundance in the pulp and paper industry wastewater, which contained toxic metals, high biological oxygen demands, chemical oxygen demand, and ions contents. Sequence alignment of the 16S rRNA V3-V4 variable region zone with the Illumina MiSeq framework revealed 25356 operating taxonomical units (OTUs) derived from the wastewater sample. The major phyla identified in wastewater were Proteobacteria, Bacteroidetes, Firmicutes, Chloroflexi, Actinobacteria, Spirochetes, Patesibacteria, Acidobacteria, and others including unknown microbes. The study showed the function of microbial communities essential for the oxidation and detoxifying of complex contaminants and design of effective remediation techniques for the re-use of polluted wastewater. Findings demonstrated that the ability of different classes of microbes to adapt and survive in metal-polluted wastewater irrespective of their relative distribution, as well as further attention can be provided to its use in the bioremediation process.
Collapse
Affiliation(s)
- Pooja Sharma
- Department of Environmental Microbiology, School for Environmental Sciences, Babasaheb Bhimrao Ambedkar University (A Central University), Lucknow 226 025, Uttar Pradesh, India
| | - Sonam Tripathi
- Department of Environmental Microbiology, School for Environmental Sciences, Babasaheb Bhimrao Ambedkar University (A Central University), Lucknow 226 025, Uttar Pradesh, India
| | - Ram Chandra
- Department of Environmental Microbiology, School for Environmental Sciences, Babasaheb Bhimrao Ambedkar University (A Central University), Lucknow 226 025, Uttar Pradesh, India.
| |
Collapse
|
5
|
Kirzhner V, Toledano-Kitai D, Volkovich Z. Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLoS One 2020; 15:e0237205. [PMID: 33156862 PMCID: PMC7647110 DOI: 10.1371/journal.pone.0237205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 10/22/2020] [Indexed: 01/02/2023] Open
Abstract
Determination of metagenome composition is still one of the most interesting problems of bioinformatics. It involves a wide range of mathematical methods, from probabilistic models of combinatorics to cluster analysis and pattern recognition techniques. The successful advance of rapid sequencing methods and fast and precise metagenome analysis will increase the diagnostic value of healthy or pathological human metagenomes. The article presents the theoretical foundations of the algorithm for calculating the number of different genomes in the medium under study. The approach is based on analysis of the compositional spectra of subsequently sequenced samples of the medium. Its essential feature is using random fluctuations in the bacteria number in different samples of the same metagenome. The possibility of effective implementation of the algorithm in the presence of data errors is also discussed. In the work, the algorithm of a metagenome evaluation is described, including the estimation of the genome number and the identification of the genomes with known compositional spectra. It should be emphasized that evaluating the genome number in a metagenome can be always helpful, regardless of the metagenome separation techniques, such as clustering the sequencing results or marker analysis.
Collapse
Affiliation(s)
- Valery Kirzhner
- Institute of Evolution, University of Haifa, Haifa, Israel
- * E-mail:
| | - Dvora Toledano-Kitai
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| | - Zeev Volkovich
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| |
Collapse
|
6
|
Bartoszewicz JM, Seidel A, Rentzsch R, Renard BY. DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks. Bioinformatics 2020; 36:81-89. [PMID: 31298694 DOI: 10.1093/bioinformatics/btz541] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 06/22/2019] [Accepted: 07/10/2019] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. RESULTS We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. AVAILABILITY AND IMPLEMENTATION The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Anja Seidel
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Robert Rentzsch
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
| |
Collapse
|
7
|
Liu Y, Bible PW, Zou B, Liang Q, Dong C, Wen X, Li Y, Ge X, Li X, Deng X, Ma R, Guo S, Liang J, Chen T, Pan W, Liu L, Chen W, Wang X, Wei L. CSMD: a computational subtraction-based microbiome discovery pipeline for species-level characterization of clinical metagenomic samples. Bioinformatics 2020; 36:1577-1583. [PMID: 31626280 DOI: 10.1093/bioinformatics/btz790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 09/22/2019] [Accepted: 10/16/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Microbiome analyses of clinical samples with low microbial biomass are challenging because of the very small quantities of microbial DNA relative to the human host, ubiquitous contaminating DNA in sequencing experiments and the large and rapidly growing microbial reference databases. RESULTS We present computational subtraction-based microbiome discovery (CSMD), a bioinformatics pipeline specifically developed to generate accurate species-level microbiome profiles for clinical samples with low microbial loads. CSMD applies strategies for the maximal elimination of host sequences with minimal loss of microbial signal and effectively detects microorganisms present in the sample with minimal false positives using a stepwise convergent solution. CSMD was benchmarked in a comparative evaluation with other classic tools on previously published well-characterized datasets. It showed higher sensitivity and specificity in host sequence removal and higher specificity in microbial identification, which led to more accurate abundance estimation. All these features are integrated into a free and easy-to-use tool. Additionally, CSMD applied to cell-free plasma DNA showed that microbial diversity within these samples is substantially broader than previously believed. AVAILABILITY AND IMPLEMENTATION CSMD is freely available at https://github.com/liuyu8721/csmd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu Liu
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.,State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Paul W Bible
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China.,College of Arts and Sciences, Marian University, Indianapolis, IN 46222, USA
| | - Bin Zou
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Qiaoxing Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Cong Dong
- College of Chemistry, Sun Yat-Sen University, Guangzhou 510275, China
| | - Xiaofeng Wen
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Yan Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xiaofei Ge
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xifang Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xiuli Deng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Rong Ma
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Shixin Guo
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Juanran Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Tingting Chen
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Wenliang Pan
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China
| | - Lixin Liu
- College of Chemistry, Sun Yat-Sen University, Guangzhou 510275, China
| | - Wei Chen
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA.,Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Xueqin Wang
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.,Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
| | - Lai Wei
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| |
Collapse
|
8
|
Seiler E, Trappe K, Renard BY. Where did you come from, where did you go: Refining metagenomic analysis tools for horizontal gene transfer characterisation. PLoS Comput Biol 2019; 15:e1007208. [PMID: 31335917 PMCID: PMC6677323 DOI: 10.1371/journal.pcbi.1007208] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 08/02/2019] [Accepted: 06/24/2019] [Indexed: 12/22/2022] Open
Abstract
Horizontal gene transfer (HGT) has changed the way we regard evolution. Instead of waiting for the next generation to establish new traits, especially bacteria are able to take a shortcut via HGT that enables them to pass on genes from one individual to another, even across species boundaries. The tool Daisy offers the first HGT detection approach based on read mapping that provides complementary evidence compared to existing methods. However, Daisy relies on the acceptor and donor organism involved in the HGT being known. We introduce DaisyGPS, a mapping-based pipeline that is able to identify acceptor and donor reference candidates of an HGT event based on sequencing reads. Acceptor and donor identification is akin to species identification in metagenomic samples based on sequencing reads, a problem addressed by metagenomic profiling tools. However, acceptor and donor references have certain properties such that these methods cannot be directly applied. DaisyGPS uses MicrobeGPS, a metagenomic profiling tool tailored towards estimating the genomic distance between organisms in the sample and the reference database. We enhance the underlying scoring system of MicrobeGPS to account for the sequence patterns in terms of mapping coverage of an acceptor and donor involved in an HGT event, and report a ranked list of reference candidates. These candidates can then be further evaluated by tools like Daisy to establish HGT regions. We successfully validated our approach on both simulated and real data, and show its benefits in an investigation of an outbreak involving Methicillin-resistant Staphylococcus aureus data.
Collapse
Affiliation(s)
- Enrico Seiler
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, and Algorithmic Bioinformatics, Institute for Bioinformatics, Freie Universität Berlin, Berlin, Germany
| | - Kathrin Trappe
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Bernhard Y. Renard
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| |
Collapse
|
9
|
Phylogenetic tree and Submission of Staphylococcus aureus Isolate from Skin Infection. JOURNAL OF PURE AND APPLIED MICROBIOLOGY 2018. [DOI: 10.22207/jpam.12.4.59] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
|
10
|
Tran Q, Pham DT, Phan V. Using 16S rRNA gene as marker to detect unknown bacteria in microbial communities. BMC Bioinformatics 2017; 18:499. [PMID: 29297282 PMCID: PMC5751639 DOI: 10.1186/s12859-017-1901-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantification and identification of microbial genomes based on next-generation sequencing data is a challenging problem in metagenomics. Although current methods have mostly focused on analyzing bacteria whose genomes have been sequenced, such analyses are, however, complicated by the presence of unknown bacteria or bacteria whose genomes have not been sequence. RESULTS We propose a method for detecting unknown bacteria in environmental samples. Our approach is unique in its utilization of short reads only from 16S rRNA genes, not from entire genomes. We show that short reads from 16S rRNA genes retain sufficient information for detecting unknown bacteria in oral microbial communities. CONCLUSION In our experimentation with bacterial genomes from the Human Oral Microbiome Database, we found that this method made accurate and robust predictions at different read coverages and percentages of unknown bacteria. Advantages of this approach include not only a reduction in experimental and computational costs but also a potentially high accuracy across environmental samples due to the strong conservation of the 16S rRNA gene.
Collapse
Affiliation(s)
- Quang Tran
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Diem-Trang Pham
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Vinhthuy Phan
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA.
| |
Collapse
|
11
|
Olson ND, Zook JM, Morrow JB, Lin NJ. Challenging a bioinformatic tool's ability to detect microbial contaminants using in silico whole genome sequencing data. PeerJ 2017; 5:e3729. [PMID: 28924496 PMCID: PMC5600177 DOI: 10.7717/peerj.3729] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Accepted: 08/02/2017] [Indexed: 12/20/2022] Open
Abstract
High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need for a priori assumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, with Staphylococcus, Escherichia, and Shigella having the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in the in-silico datasets at the equivalent of 1 in 1,000 cells, though F. tularensis was not detected in any of the simulated contaminant mixtures and Y. pestis was only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, United States of America
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, United States of America
| | - Jayne B Morrow
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, United States of America
| | - Nancy J Lin
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, United States of America
| |
Collapse
|
12
|
Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K. SLIMM: species level identification of microorganisms from metagenomes. PeerJ 2017; 5:e3138. [PMID: 28367376 PMCID: PMC5372838 DOI: 10.7717/peerj.3138] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 03/02/2017] [Indexed: 12/21/2022] Open
Abstract
Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. However, shared regions of DNA among reference genomes and taxonomic units pose a significant challenge in assigning reads correctly to their true origins. The existing microbial community profiling tools commonly deal with this problem by either preparing signature-based unique references or assigning an ambiguous read to its least common ancestor in a taxonomic tree. The former method is limited to making use of the reads which can be mapped to the curated regions, while the latter suffer from the lack of uniquely mapped reads at lower (more specific) taxonomic ranks. Moreover, even if the tools exhibited good performance in calling the organisms present in a sample, there is still room for improvement in determining the correct relative abundance of the organisms. We present a new method Species Level Identification of Microorganisms from Metagenomes (SLIMM) which addresses the above issues by using coverage information of reference genomes to remove unlikely genomes from the analysis and subsequently gain more uniquely mapped reads to assign at lower ranks of a taxonomic tree. SLIMM is based on a few, seemingly easy steps which when combined create a tool that outperforms state-of-the-art tools in run-time and memory usage while being on par or better in computing quantitative and qualitative information at species-level.
Collapse
Affiliation(s)
- Temesgen Hailemariam Dadi
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany; International Max Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC), Berlin, Germany; Department of Veterinary Medicine, Freie Universität Berlin, Berlin, Germany
| | | | | | - Torsten Semmler
- Department of Veterinary Medicine, Freie Universität Berlin, Berlin, Germany; Robert Koch Institute, Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany; Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
13
|
PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data. Sci Rep 2017; 7:39194. [PMID: 28051068 PMCID: PMC5209729 DOI: 10.1038/srep39194] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 11/18/2016] [Indexed: 12/20/2022] Open
Abstract
The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.
Collapse
|
14
|
Trappe K, Marschall T, Renard BY. Detecting horizontal gene transfer by mapping sequencing reads across species boundaries. Bioinformatics 2016; 32:i595-i604. [DOI: 10.1093/bioinformatics/btw423] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
15
|
Hilton SK, Castro-Nallar E, Pérez-Losada M, Toma I, McCaffrey TA, Hoffman EP, Siegel MO, Simon GL, Johnson WE, Crandall KA. Metataxonomic and Metagenomic Approaches vs. Culture-Based Techniques for Clinical Pathology. Front Microbiol 2016; 7:484. [PMID: 27092134 PMCID: PMC4823605 DOI: 10.3389/fmicb.2016.00484] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Accepted: 03/22/2016] [Indexed: 12/12/2022] Open
Abstract
Diagnoses that are both timely and accurate are critically important for patients with life-threatening or drug resistant infections. Technological improvements in High-Throughput Sequencing (HTS) have led to its use in pathogen detection and its application in clinical diagnoses of infectious diseases. The present study compares two HTS methods, 16S rRNA marker gene sequencing (metataxonomics) and whole metagenomic shotgun sequencing (metagenomics), in their respective abilities to match the same diagnosis as traditional culture methods (culture inference) for patients with ventilator associated pneumonia (VAP). The metagenomic analysis was able to produce the same diagnosis as culture methods at the species-level for five of the six samples, while the metataxonomic analysis was only able to produce results with the same species-level identification as culture for two of the six samples. These results indicate that metagenomic analyses have the accuracy needed for a clinical diagnostic tool, but full integration in diagnostic protocols is contingent on technological improvements to decrease turnaround time and lower costs.
Collapse
Affiliation(s)
- Sarah K Hilton
- Computational Biology Institute, The George Washington University Ashburn, VA, USA
| | - Eduardo Castro-Nallar
- Computational Biology Institute, The George Washington UniversityAshburn, VA, USA; Facultad de Ciencias Biológicas, Center for Bioinformatics and Integrative Biology, Universidad Andres BelloSantiago, Chile
| | - Marcos Pérez-Losada
- Computational Biology Institute, The George Washington UniversityAshburn, VA, USA; Centro de Investigação em Biodiversidade e Recursos Genéticos (CIBIO-InBIO)Vairão, Portugal; Children's National Medical Research CenterWashington DC, USA
| | - Ian Toma
- Division of Genomic Medicine, Department of Medicine, The George Washington University School of Medicine and Health Sciences Washington DC, USA
| | - Timothy A McCaffrey
- Division of Genomic Medicine, Department of Medicine, Department of Microbiology, Immunology, and Tropical Medicine, The George Washington University School of Medicine and Health Sciences Washington DC, USA
| | - Eric P Hoffman
- Children's National Medical Research Center Washington DC, USA
| | - Marc O Siegel
- Division of Infectious Diseases, Department of Medicine, School of Medicine and Health Sciences, The George Washington University Washington DC, USA
| | - Gary L Simon
- Division of Infectious Diseases, Department of Medicine, School of Medicine and Health Sciences, The George Washington University Washington DC, USA
| | - W Evan Johnson
- Computational Biomedicine, Boston University School of Medicine Boston, MA, USA
| | - Keith A Crandall
- Computational Biology Institute, The George Washington University Ashburn, VA, USA
| |
Collapse
|
16
|
Piro VC, Lindner MS, Renard BY. DUDes: a top-down taxonomic profiler for metagenomics. ACTA ACUST UNITED AC 2016; 32:2272-80. [PMID: 27153591 DOI: 10.1093/bioinformatics/btw150] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 03/11/2016] [Indexed: 12/19/2022]
Abstract
MOTIVATION Species identification and quantification are common tasks in metagenomics and pathogen detection studies. The most recent techniques are built on mapping the sequenced reads against a reference database (e.g. whole genomes, marker genes, proteins) followed by application-dependent analysis steps. Although these methods have been proven to be useful in many scenarios, there is still room for improvement in species and strain level detection, mainly for low abundant organisms. RESULTS We propose a new method: DUDes, a reference-based taxonomic profiler that introduces a novel top-down approach to analyze metagenomic Next-generation sequencing (NGS) samples. Rather than predicting an organism presence in the sample based only on relative abundances, DUDes first identifies possible candidates by comparing the strength of the read mapping in each node of the taxonomic tree in an iterative manner. Instead of using the lowest common ancestor we propose a new approach: the deepest uncommon descendent. We showed in experiments that DUDes works for single and multiple organisms and can identify low abundant taxonomic groups with high precision. AVAILABILITY AND IMPLEMENTATION DUDes is open source and it is available at http://sf.net/p/dudes SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT renardB@rki.de.
Collapse
Affiliation(s)
- Vitor C Piro
- Research Group Bioinformatics (NG4), Robert Koch Institute, Nordufer 20, Berlin 13353, Germany CAPES Foundation, Ministry of Education of Brazil, Brasília - DF, 70040-020 Brazil
| | - Martin S Lindner
- Research Group Bioinformatics (NG4), Robert Koch Institute, Nordufer 20, Berlin 13353, Germany 4-Antibody AG, Hochberger Strasse 60C, Basel 4057, Switzerland
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch Institute, Nordufer 20, Berlin 13353, Germany
| |
Collapse
|