1
|
Fautt C, Couradeau E, Hockett KL. Naïve Bayes Classifiers and accompanying dataset for Pseudomonas syringae isolate characterization. Sci Data 2024; 11:178. [PMID: 38326362 PMCID: PMC10850129 DOI: 10.1038/s41597-024-03003-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 01/26/2024] [Indexed: 02/09/2024] Open
Abstract
The Pseudomonas syringae species complex (PSSC) is a diverse group of plant pathogens with a collective host range encompassing almost every food crop grown today. As a threat to global food security, rapid detection and characterization of epidemic and emerging pathogenic lineages is essential. However, phylogenetic identification is often complicated by an unclarified and ever-changing taxonomy, making practical use of available databases and the proper training of classifiers difficult. As such, while amplicon sequencing is a common method for routine identification of PSSC isolates, there is no efficient method for accurate classification based on this data. Here we present a suite of five Naïve bayes classifiers for PCR primer sets widely used for PSSC identification, trained on in-silico amplicon data from 2,161 published PSSC genomes using the life identification number (LIN) hierarchical clustering algorithm in place of traditional Linnaean taxonomy. Additionally, we include a dataset for translating classification results back into traditional taxonomic nomenclature (i.e. species, phylogroup, pathovar), and for predicting virulence factor repertoires.
Collapse
Affiliation(s)
- Chad Fautt
- Department of Plant Pathology and Environmental Microbiology, Pennsylvania State University, University Park, Pennsylvania, USA.
- Department of Ecosystem Science and Management, Pennsylvania State University, University Park, Pennsylvania, USA.
- Intercollege Graduate Degree Program in Ecology, Pennsylvania State University, University Park, Pennsylvania, USA.
| | - Estelle Couradeau
- Department of Ecosystem Science and Management, Pennsylvania State University, University Park, Pennsylvania, USA.
- Intercollege Graduate Degree Program in Ecology, Pennsylvania State University, University Park, Pennsylvania, USA.
| | - Kevin L Hockett
- Department of Plant Pathology and Environmental Microbiology, Pennsylvania State University, University Park, Pennsylvania, USA.
- Intercollege Graduate Degree Program in Ecology, Pennsylvania State University, University Park, Pennsylvania, USA.
| |
Collapse
|
2
|
Xu CCY, Lemoine J, Albert A, Whirter ÉM, Barrett RDH. Community assembly of the human piercing microbiome. Proc Biol Sci 2023; 290:20231174. [PMID: 38018103 PMCID: PMC10685111 DOI: 10.1098/rspb.2023.1174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 11/03/2023] [Indexed: 11/30/2023] Open
Abstract
Predicting how biological communities respond to disturbance requires understanding the forces that govern their assembly. We propose using human skin piercings as a model system for studying community assembly after rapid environmental change. Local skin sterilization provides a 'clean slate' within the novel ecological niche created by the piercing. Stochastic assembly processes can dominate skin microbiomes due to the influence of environmental exposure on local dispersal, but deterministic processes might play a greater role within occluded skin piercings if piercing habitats impose strong selection pressures on colonizing species. Here we explore the human ear-piercing microbiome and demonstrate that community assembly is predominantly stochastic but becomes significantly more deterministic with time, producing increasingly diverse and ecologically complex communities. We also observed changes in two dominant and medically relevant antagonists (Cutibacterium acnes and Staphylococcus epidermidis), consistent with competitive exclusion induced by a transition from sebaceous to moist environments. By exploiting this common yet uniquely human practice, we show that skin piercings are not just culturally significant but also represent ecosystem engineering on the human body. The novel habitats and communities that skin piercings produce may provide general insights into biological responses to environmental disturbances with implications for both ecosystem and human health.
Collapse
Affiliation(s)
- Charles C. Y. Xu
- Redpath Museum, McGill University, 859 Sherbrooke Street West, Montreal, Quebec, Canada H3A 0C4
- Department of Biology, McGill University, Montreal, Quebec, Canada H3A 1B1
| | - Juliette Lemoine
- Redpath Museum, McGill University, 859 Sherbrooke Street West, Montreal, Quebec, Canada H3A 0C4
- Department of Biology, McGill University, Montreal, Quebec, Canada H3A 1B1
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland
| | - Avery Albert
- Redpath Museum, McGill University, 859 Sherbrooke Street West, Montreal, Quebec, Canada H3A 0C4
- Department of Natural Resource Sciences, McGill University, Sainte-Anne-de-Bellevue, Quebec, Canada H9X 3V9
- Trottier Space Institute, McGill University, Montreal, Quebec, Canada H3A 2A7
| | | | - Rowan D. H. Barrett
- Redpath Museum, McGill University, 859 Sherbrooke Street West, Montreal, Quebec, Canada H3A 0C4
- Department of Biology, McGill University, Montreal, Quebec, Canada H3A 1B1
| |
Collapse
|
3
|
Liu G, Li T, Zhu X, Zhang X, Wang J. An independent evaluation in a CRC patient cohort of microbiome 16S rRNA sequence analysis methods: OTU clustering, DADA2, and Deblur. Front Microbiol 2023; 14:1178744. [PMID: 37560524 PMCID: PMC10408458 DOI: 10.3389/fmicb.2023.1178744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 06/14/2023] [Indexed: 08/11/2023] Open
Abstract
16S rRNA is the universal gene of microbes, and it is often used as a target gene to obtain profiles of microbial communities via next-generation sequencing (NGS) technology. Traditionally, sequences are clustered into operational taxonomic units (OTUs) at a 97% threshold based on the taxonomic standard using 16S rRNA, and methods for the reduction of sequencing errors are bypassed, which may lead to false classification units. Several denoising algorithms have been published to solve this problem, such as DADA2 and Deblur, which can correct sequencing errors at single-nucleotide resolution by generating amplicon sequence variants (ASVs). As high-resolution ASVs are becoming more popular than OTUs and only one analysis method is usually selected in a particular study, there is a need for a thorough comparison of OTU clustering and denoising pipelines. In this study, three of the most widely used 16S rRNA methods (two denoising algorithms, DADA2 and Deblur, along with de novo OTU clustering) were thoroughly compared using 16S rRNA amplification sequencing data generated from 358 clinical stool samples from the Colorectal Cancer (CRC) Screening Cohort. Our findings indicated that all approaches led to similar taxonomic profiles (with P > 0.05 in PERMNAOVA and P <0.001 in the Mantel test), although the number of ASVs/OTUs and the alpha-diversity indices varied considerably. Despite considerable differences in disease-related markers identified, disease-related analysis showed that all methods could result in similar conclusions. Fusobacterium, Streptococcus, Peptostreptococcus, Parvimonas, Gemella, and Haemophilus were identified by all three methods as enriched in the CRC group, while Roseburia, Faecalibacterium, Butyricicoccus, and Blautia were identified by all three methods as enriched in the healthy group. In addition, disease-diagnostic models generated using machine learning algorithms based on the data from these different methods all achieved good diagnostic efficiency (AUC: 0.87-0.89), with the model based on DADA2 producing the highest AUC (0.8944 and 0.8907 in the training set and test set, respectively). However, there was no significant difference in performance between the models (P >0.05). In conclusion, this study demonstrates that DADA2, Deblur, and de novo OTU clustering display similar power levels in taxa assignment and can produce similar conclusions in the case of the CRC cohort.
Collapse
Affiliation(s)
- Guang Liu
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
- Guangdong Hongyuan Pukong Medical Technology Co., Ltd., Guangzhou, China
| | - Tong Li
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, China
| | - Xiaoyan Zhu
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Xuanping Zhang
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Jiayin Wang
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|
4
|
Parente E, Zotta T, Giavalisco M, Ricciardi A. Metataxonomic insights in the distribution of Lactobacillaceae in foods and food environments. Int J Food Microbiol 2023; 391-393:110124. [PMID: 36841075 DOI: 10.1016/j.ijfoodmicro.2023.110124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Revised: 01/09/2023] [Accepted: 02/05/2023] [Indexed: 02/23/2023]
Abstract
Members of the family Lactobacillaceae, which now includes species formerly belonging to the genera Lactobacillus and Pediococcus, but also Leuconostocaceae, are of foremost importance in food fermentations and spoilage, but also as components of animal and human microbiota and as potentially pathogenic microorganisms. Knowledge of the ecological distribution of a given species and genus is important, among other things, for the inclusion in lists of microorganisms with a Qualified Presumption of Safety or with beneficial use. The objective of this work is to use the data in FoodMicrobionet database to obtain quantitative insights (in terms of both abundance and prevalence) on the distribution of these bacteria in foods and food environments. We first explored the reliability of taxonomic assignments using the SILVA v138.1 reference database with full length and partial sequences of the 16S rRNA gene for type strain sequences. Full length 16S rRNA gene sequences allow a reasonably good classification at the genus and species level in phylogenetic trees but shorter sequences (V1-V3, V3-V4, V4) perform much worse, with type strains of many species sharing identical V4 and V3-V4 sequences. Taxonomic assignment at the genus level of 16S rRNA genes sequences and the SILVA v138.1 reference database can be done for almost all genera of the family Lactobacillaceae with a high degree of confidence for full length sequences, and with a satisfactory level of accuracy for the V1-V3 regions. Results for the V3-V4 and V4 region are still acceptable but significantly worse. Taxonomic assignment at the species level for sequences for the V1-V3, V3-V4, V4 regions of the 16S rRNA gene of members of the family Lactobacillaceae is hardly possible and, even for full length sequences, and only 49.9 % of the type strain sequences can be unambiguously assigned to species. We then used the FoodMicrobionet database to evaluate the prevalence and abundance of Lactobacillaceae in food samples and in food related environments. Generalist and specialist genera were clearly evident. The ecological distribution of several genera was confirmed and insights on the distribution and potential origin of rare genera (Dellaglioa, Holzapfelia, Schleiferilactobacillus) were obtained. We also found that combining Amplicon Sequence Variants from different studies is indeed possible, but provides little additional information, even when strict criteria are used for the filtering of sequences.
Collapse
|
5
|
Abstract
BACKGROUND An appropriate sample size is essential for obtaining a precise and reliable outcome of a study. In machine learning (ML), studies with inadequate samples suffer from overfitting of data and have a lower probability of producing true effects, while the increment in sample size increases the accuracy of prediction but may not cause a significant change after a certain sample size. Existing statistical approaches using standardized mean difference, effect size, and statistical power for determining sample size are potentially biased due to miscalculations or lack of experimental details. This study aims to design criteria for evaluating sample size in ML studies. We examined the average and grand effect sizes and the performance of five ML methods using simulated datasets and three real datasets to derive the criteria for sample size. We systematically increase the sample size, starting from 16, by randomly sampling and examine the impact of sample size on classifiers' performance and both effect sizes. Tenfold cross-validation was used to quantify the accuracy. RESULTS The results demonstrate that the effect sizes and the classification accuracies increase while the variances in effect sizes shrink with the increment of samples when the datasets have a good discriminative power between two classes. By contrast, indeterminate datasets had poor effect sizes and classification accuracies, which did not improve by increasing sample size in both simulated and real datasets. A good dataset exhibited a significant difference in average and grand effect sizes. We derived two criteria based on the above findings to assess a decided sample size by combining the effect size and the ML accuracy. The sample size is considered suitable when it has appropriate effect sizes (≥ 0.5) and ML accuracy (≥ 80%). After an appropriate sample size, the increment in samples will not benefit as it will not significantly change the effect size and accuracy, thereby resulting in a good cost-benefit ratio. CONCLUSION We believe that these practical criteria can be used as a reference for both the authors and editors to evaluate whether the selected sample size is adequate for a study.
Collapse
Affiliation(s)
- Daniyal Rajput
- Institute of Cognitive Neuroscience, National Central University, Zhongda Rd, No. 300, Zhongli District, Taoyuan City, 320317, Taiwan, ROC. .,Taiwan International Graduate Program in Interdisciplinary Neuroscience, National Central University and Academia Sinica, Taipei, Taiwan, ROC.
| | - Wei-Jen Wang
- grid.37589.300000 0004 0532 3167Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan, ROC
| | - Chun-Chuan Chen
- grid.37589.300000 0004 0532 3167Institute of Cognitive Neuroscience, National Central University, Zhongda Rd, No. 300, Zhongli District, Taoyuan City, 320317 Taiwan, ROC ,grid.37589.300000 0004 0532 3167Department of Biomedical Sciences and Engineering, National Central University, Taoyuan, Taiwan, ROC
| |
Collapse
|
6
|
Ultsch A, Lötsch J. Robust Classification Using Posterior Probability Threshold Computation Followed by Voronoi Cell Based Class Assignment Circumventing Pitfalls of Bayesian Analysis of Biomedical Data. Int J Mol Sci 2022; 23:ijms232214081. [PMID: 36430580 PMCID: PMC9693220 DOI: 10.3390/ijms232214081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/09/2022] [Accepted: 11/11/2022] [Indexed: 11/17/2022] Open
Abstract
Bayesian inference is ubiquitous in science and widely used in biomedical research such as cell sorting or “omics” approaches, as well as in machine learning (ML), artificial neural networks, and “big data” applications. However, the calculation is not robust in regions of low evidence. In cases where one group has a lower mean but a higher variance than another group, new cases with larger values are implausibly assigned to the group with typically smaller values. An approach for a robust extension of Bayesian inference is proposed that proceeds in two main steps starting from the Bayesian posterior probabilities. First, cases with low evidence are labeled as “uncertain” class membership. The boundary for low probabilities of class assignment (threshold ε) is calculated using a computed ABC analysis as a data-based technique for item categorization. This leaves a number of cases with uncertain classification (p < ε). Second, cases with uncertain class membership are relabeled based on the distance to neighboring classified cases based on Voronoi cells. The approach is demonstrated on biomedical data typically analyzed with Bayesian statistics, such as flow cytometric data sets or biomarkers used in medical diagnostics, where it increased the class assignment accuracy by 1−10% depending on the data set. The proposed extension of the Bayesian inference of class membership can be used to obtain robust and plausible class assignments even for data at the extremes of the distribution and/or for which evidence is weak.
Collapse
Affiliation(s)
- Alfred Ultsch
- DataBionics Research Group, University of Marburg, Hans-Meerwein-Straße 22, 35032 Marburg, Germany
| | - Jörn Lötsch
- Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
- Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Theodor-Stern-Kai 7, 60596 Frankfurt am Main, Germany
- Correspondence:
| |
Collapse
|
7
|
Sorbie A, Delgado Jiménez R, Benakis C. Increasing transparency and reproducibility in stroke-microbiota research: A toolbox for microbiota analysis. iScience 2022; 25:103998. [PMID: 35310944 PMCID: PMC8931359 DOI: 10.1016/j.isci.2022.103998] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 01/18/2022] [Accepted: 02/24/2022] [Indexed: 12/29/2022] Open
Abstract
Homeostasis of gut microbiota is crucial in maintaining human health. Alterations, or “dysbiosis,” are increasingly implicated in human diseases, such as cancer, inflammatory bowel diseases, and, more recently, neurological disorders. In ischemic stroke patients, gut microbial profiles are markedly different compared to healthy controls, whereas manipulation of microbiota in animal models of stroke modulates outcome, further implicating microbiota in stroke pathobiology. Despite this, evidence for the involvement of specific microbes or microbial products and microbial signatures have yet to be identified, likely owing to differences in methodology, data analysis, and confounding variables between different studies. Here, we provide a set of guidelines to enable researchers to conduct high-quality, reproducible, and transparent microbiota studies, focusing on 16S rRNA sequencing in the emerging subfield of the stroke-microbiota. In doing so, we aim to facilitate novel and reproducible associations between the microbiota and brain diseases, including stroke, and translation into clinical practice. Guidelines for reproducible stroke-microbiota research in patients and animal models Current best practices for 16S rRNA profiling and analysis Easy-to-use, freely available bioinformatics pipeline for gut microbiota analysis
Collapse
|
8
|
Abstract
The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To evaluate the accuracy of the model, measures of the area under the ROC curve and the overall accuracy of the classifier were used. In the experiments, the authors examined how classification results were affected by feature selection and increased size of the data set.
Collapse
Affiliation(s)
- Jana Busa
- Riga Technical University, Riga, Latvia
| | | |
Collapse
|