1
|
Konopka T, Ng S, Smedley D. Diffusion enables integration of heterogeneous data and user-driven learning in a desktop knowledge-base. PLoS Comput Biol 2021; 17:e1009283. [PMID: 34379637 PMCID: PMC8382188 DOI: 10.1371/journal.pcbi.1009283] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 08/23/2021] [Accepted: 07/16/2021] [Indexed: 11/20/2022] Open
Abstract
Integrating reference datasets (e.g. from high-throughput experiments) with unstructured and manually-assembled information (e.g. notes or comments from individual researchers) has the potential to tailor bioinformatic analyses to specific needs and to lead to new insights. However, developing bespoke analysis pipelines from scratch is time-consuming, and general tools for exploring such heterogeneous data are not available. We argue that by treating all data as text, a knowledge-base can accommodate a range of bioinformatic data types and applications. We show that a database coupled to nearest-neighbor algorithms can address common tasks such as gene-set analysis as well as specific tasks such as ontology translation. We further show that a mathematical transformation motivated by diffusion can be effective for exploration across heterogeneous datasets. Diffusion enables the knowledge-base to begin with a sparse query, impute more features, and find matches that would otherwise remain hidden. This can be used, for example, to map multi-modal queries consisting of gene symbols and phenotypes to descriptions of diseases. Diffusion also enables user-driven learning: when the knowledge-base cannot provide satisfactory search results in the first instance, users can improve the results in real-time by adding domain-specific knowledge. User-driven learning has implications for data management, integration, and curation.
Collapse
Affiliation(s)
- Tomasz Konopka
- William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
| | - Sandra Ng
- William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
| | - Damian Smedley
- William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
| |
Collapse
|
2
|
Abstract
We present a bipartite graph-based approach to calculate drug pairwise similarity for identifying potential new indications of approved drugs. Both chemical and molecular features were used in drug similarity calculation. In this paper, we first extracted drug chemical structures and drug-target interactions. Second, we computed chemical structure similarity and drug- target profile similarity. Further, we constructed a bipartite graph model with known relationships between drugs and their target proteins. Finally, we weighted summing drug structure similarity with target profile similarity to derive drug pairwise similarity, so that we can predict potential indication of a drug from its similar drugs. In addition, we summarized some alternative strategies and variations follow-up to each section in the overall analysis.
Collapse
|
3
|
Ames NJ, Barb JJ, Ranucci A, Kim H, Mudra SE, Cashion AK, Townsley DM, Childs R, Paster BJ, Faller LL, Wallen GR. The oral microbiome of patients undergoing treatment for severe aplastic anemia: a pilot study. Ann Hematol 2019; 98:1351-1365. [PMID: 30919073 DOI: 10.1007/s00277-019-03599-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Accepted: 01/07/2019] [Indexed: 12/11/2022]
Abstract
The microbiome, an intriguing component of the human body, composed of trillions of microorganisms, has prompted scientific exploration to identify and understand its function and role in health and disease. As associations between microbiome composition, disease, and symptoms accumulate, the future of medicine hinges upon a comprehensive knowledge of these microorganisms for patient care. The oral microbiome may provide valuable and efficient insight for predicting future changes in disease status, infection, or treatment course. The main aim of this pilot study was to characterize the oral microbiome in patients with severe aplastic anemia (SAA) during their therapeutic course. SAA is a hematologic disease characterized by bone marrow failure which if untreated is fatal. Treatment includes either hematopoietic stem cell transplantation (HSCT) or immunosuppressive therapy (IST). In this study, we examined the oral microbiome composition of 24 patients admitted to the National Institutes of Health (NIH) Clinical Center for experimental SAA treatment. Tongue brushings were collected to assess the effects of treatment on the oral microbiome. Twenty patients received standard IST (equine antithymocyte globulin and cyclosporine) plus eltrombopag. Four patients underwent HSCT. Oral specimens were obtained at three time points during treatment and clinical follow-up. Using a novel approach to 16S rRNA gene sequence analysis encompassing seven hypervariable regions, results demonstrated a predictable decrease in microbial diversity over time among the transplant patients. Linear discriminant analysis or LefSe reported a total of 14 statistically significant taxa (p < 0.05) across time points in the HSCT patients. One-way plots of relative abundance for two bacterial species (Haemophilus parainfluenzae and Rothia mucilaginosa) in the HSCT group, show the differences in abundance between time points. Only one bacterial species (Prevotella histicola) was noted in the IST group with a p value of 0.065. The patients receiving immunosuppressive therapy did not exhibit a clear change in diversity over time; however, patient-specific changes were noted. In addition, we compared our findings to tongue dorsum samples from healthy participants in the Human Microbiome Project (HMP) database and found among HSCT patients, approximately 35% of bacterial identifiers (N = 229) were unique to this study population and were not present in tongue dorsum specimens obtained from the HMP. Among IST-treated patients, 45% (N = 351) were unique to these patients and not identified by the HMP. Although antibiotic use may have likely influenced bacterial composition and diversity, some literature suggests a decreased impact of antimicrobials on the oral microbiome as compared to their effect on the gut microbiome. Future studies with larger sample sizes that focus on the oral microbiome and the effects of antibiotics in an immunosuppressed patient population may help establish these potential associations.
Collapse
Affiliation(s)
- N J Ames
- Clinical Center Nursing Department, National Institutes of Health, Bethesda, MD, USA.
| | - J J Barb
- Mathematical and Statistical Computing Lab, Center for Information Technology, National Institutes of Health, Bethesda, MD, USA
| | - A Ranucci
- Clinical Center Nursing Department, National Institutes of Health, Bethesda, MD, USA.,Tulane University School of Medicine, New Orleans, LA, USA
| | - H Kim
- National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, USA
| | - S E Mudra
- Clinical Center Nursing Department, National Institutes of Health, Bethesda, MD, USA.,University of Louisville School of Medicine, Louisville, KY, USA
| | - A K Cashion
- National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, USA
| | - D M Townsley
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - R Childs
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - B J Paster
- Forsyth Institute, Cambridge, MA, USA.,Harvard School of Dental Medicine, Boston, MA, USA
| | - L L Faller
- Forsyth Institute, Cambridge, MA, USA.,Ginkgo Bioworks, Boston, MA, USA
| | - G R Wallen
- Clinical Center Nursing Department, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
4
|
Adamberg K, Kolk K, Jaagura M, Vilu R, Adamberg S. The composition and metabolism of faecal microbiota is specifically modulated by different dietary polysaccharides and mucin: an isothermal microcalorimetry study. Benef Microbes 2018; 9:21-34. [DOI: 10.3920/bm2016.0198] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The metabolic activity of colon microbiota is specifically affected by fibres with various monomer compositions, degree of polymerisation and branching. The supply of a variety of dietary fibres assures the diversity of gut microbial communities considered important for the well-being of the host. The aim of this study was to compare the impact of different oligo- and polysaccharides (galacto- and fructooligosaccharides, resistant starch, levan, inulin, arabinogalactan, xylan, pectin and chitin), and a glycoprotein mucin on the growth and metabolism of faecal microbiota in vitro by using isothermal microcalorimetry (IMC). Faecal samples from healthy donors were incubated in a phosphate-buffered defined medium with or without supplementation of a single substrate. The generation of heat was followed on-line, microbiota composition (V3-V4 region of the 16S rRNA using Illumina MiSeq v2) and concentrations of metabolites (HPLC) were determined at the end of growth. The multiauxic power-time curves obtained were substrate-specific. More than 70% of all substrates except chitin were fermented by faecal microbiota with total heat generation of up to 8 J/ml. The final metabolite patterns were in accordance with the microbiota changes. For arabinogalactan, xylan and levan, the fibre-affected distribution of bacterial taxa showed clear similarities (e.g. increase of Bacteroides ovatus and decrease of Bifidobacterium adolescentis). The formation of propionic acid, an important colon metabolite, was enhanced by arabinogalactan, xylan and mucin but not by galacto- and fructooligosaccharides or inulin. Mucin fermentation resulted in acetate, propionate and butyrate production in ratios previously observed for faecal samples, indicating that mucins may serve as major substrates for colon microbial population. IMC combined with analytical methods was shown to be an effective method for screening the impact of specific dietary fibres on functional changes in faecal microbiota.
Collapse
Affiliation(s)
- K. Adamberg
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, 12618 Tallinn, Estonia
- Competence Center of Food and Fermentation Technologies, Akadeemia tee 15A, 12618 Tallinn, Estonia
| | - K. Kolk
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, 12618 Tallinn, Estonia
- Competence Center of Food and Fermentation Technologies, Akadeemia tee 15A, 12618 Tallinn, Estonia
| | - M. Jaagura
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, 12618 Tallinn, Estonia
- Competence Center of Food and Fermentation Technologies, Akadeemia tee 15A, 12618 Tallinn, Estonia
| | - R. Vilu
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, 12618 Tallinn, Estonia
- Competence Center of Food and Fermentation Technologies, Akadeemia tee 15A, 12618 Tallinn, Estonia
| | - S. Adamberg
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, 12618 Tallinn, Estonia
| |
Collapse
|
5
|
Jünemann S, Kleinbölting N, Jaenicke S, Henke C, Hassa J, Nelkner J, Stolze Y, Albaum SP, Schlüter A, Goesmann A, Sczyrba A, Stoye J. Bioinformatics for NGS-based metagenomics and the application to biogas research. J Biotechnol 2017; 261:10-23. [PMID: 28823476 DOI: 10.1016/j.jbiotec.2017.08.012] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Revised: 08/08/2017] [Accepted: 08/09/2017] [Indexed: 12/19/2022]
Abstract
Metagenomics has proven to be one of the most important research fields for microbial ecology during the last decade. Starting from 16S rRNA marker gene analysis for the characterization of community compositions to whole metagenome shotgun sequencing which additionally allows for functional analysis, metagenomics has been applied in a wide spectrum of research areas. The cost reduction paired with the increase in the amount of data due to the advent of next-generation sequencing led to a rapidly growing demand for bioinformatic software in metagenomics. By now, a large number of tools that can be used to analyze metagenomic datasets has been developed. The Bielefeld-Gießen center for microbial bioinformatics as part of the German Network for Bioinformatics Infrastructure bundles and imparts expert knowledge in the analysis of metagenomic datasets, especially in research on microbial communities involved in anaerobic digestion residing in biogas reactors. In this review, we give an overview of the field of metagenomics, introduce into important bioinformatic tools and possible workflows, accompanied by application examples of biogas surveys successfully conducted at the Center for Biotechnology of Bielefeld University.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| | - Nils Kleinbölting
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Sebastian Jaenicke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Christian Henke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Julia Hassa
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Johanna Nelkner
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Yvonne Stolze
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Stefan P Albaum
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Andreas Schlüter
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
6
|
Bogachev MI, Markelov OA, Kayumov AR, Bunde A. Superstatistical model of bacterial DNA architecture. Sci Rep 2017; 7:43034. [PMID: 28225058 PMCID: PMC5320525 DOI: 10.1038/srep43034] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Accepted: 01/18/2017] [Indexed: 12/15/2022] Open
Abstract
Understanding the physical principles that govern the complex DNA structural organization as well as its mechanical and thermodynamical properties is essential for the advancement in both life sciences and genetic engineering. Recently we have discovered that the complex DNA organization is explicitly reflected in the arrangement of nucleotides depicted by the universal power law tailed internucleotide interval distribution that is valid for complete genomes of various prokaryotic and eukaryotic organisms. Here we suggest a superstatistical model that represents a long DNA molecule by a series of consecutive ~150 bp DNA segments with the alternation of the local nucleotide composition between segments exhibiting long-range correlations. We show that the superstatistical model and the corresponding DNA generation algorithm explicitly reproduce the laws governing the empirical nucleotide arrangement properties of the DNA sequences for various global GC contents and optimal living temperatures. Finally, we discuss the relevance of our model in terms of the DNA mechanical properties. As an outlook, we focus on finding the DNA sequences that encode a given protein while simultaneously reproducing the nucleotide arrangement laws observed from empirical genomes, that may be of interest in the optimization of genetic engineering of long DNA molecules.
Collapse
Affiliation(s)
- Mikhail I. Bogachev
- Biomedical Engineering Research Centre, St. Petersburg Electrotechnical University, St. Petersburg, 197376, Russia
- Molecular Genetics of Microorganisms Lab, Institute of Fundamental Medicine and Biology, Kazan (Volga Region) Federal University, Kazan, Tatarstan, 420008, Russia
| | - Oleg A. Markelov
- Biomedical Engineering Research Centre, St. Petersburg Electrotechnical University, St. Petersburg, 197376, Russia
| | - Airat R. Kayumov
- Molecular Genetics of Microorganisms Lab, Institute of Fundamental Medicine and Biology, Kazan (Volga Region) Federal University, Kazan, Tatarstan, 420008, Russia
| | - Armin Bunde
- Institut für Theoretische Physik, Justus-Liebig-Universität Giessen, 35392 Giessen, Germany
| |
Collapse
|
7
|
Ferretti P, Farina S, Cristofolini M, Girolomoni G, Tett A, Segata N. Experimental metagenomics and ribosomal profiling of the human skin microbiome. Exp Dermatol 2017; 26:211-219. [DOI: 10.1111/exd.13210] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2016] [Indexed: 02/06/2023]
Affiliation(s)
- Pamela Ferretti
- Centre for Integrative Biology; University of Trento; Trento Italy
| | | | | | - Giampiero Girolomoni
- Section of Dermatology; Department of Medicine; University of Verona; Verona Italy
| | - Adrian Tett
- Centre for Integrative Biology; University of Trento; Trento Italy
| | - Nicola Segata
- Centre for Integrative Biology; University of Trento; Trento Italy
| |
Collapse
|
8
|
La Rosa M, Fiannaca A, Rizzo R, Urso A. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinformatics 2015; 16 Suppl 6:S2. [PMID: 25916734 PMCID: PMC4416183 DOI: 10.1186/1471-2105-16-s6-s2] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.
Collapse
|
9
|
Wall K, Cornell J, Bizzoco RW, Kelley ST. Biodiversity hot spot on a hot spot: novel extremophile diversity in Hawaiian fumaroles. Microbiologyopen 2015; 4:267-281. [PMID: 25565172 PMCID: PMC4398508 DOI: 10.1002/mbo3.236] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2014] [Revised: 11/18/2014] [Accepted: 11/24/2014] [Indexed: 02/01/2023] Open
Abstract
Fumaroles (steam vents) are the most common, yet least understood, microbial habitat in terrestrial geothermal settings. Long believed too extreme for life, recent advances in sample collection and DNA extraction methods have found that fumarole deposits and subsurface waters harbor a considerable diversity of viable microbes. In this study, we applied culture-independent molecular methods to explore fumarole deposit microbial assemblages in 15 different fumaroles in four geographic locations on the Big Island of Hawai'i. Just over half of the vents yielded sufficient high-quality DNA for the construction of 16S ribosomal RNA gene sequence clone libraries. The bacterial clone libraries contained sequences belonging to 11 recognized bacterial divisions and seven other division-level phylogenetic groups. Archaeal sequences were less numerous, but similarly diverse. The taxonomic composition among fumarole deposits was highly heterogeneous. Phylogenetic analysis found cloned fumarole sequences were related to microbes identified from a broad array of globally distributed ecotypes, including hot springs, terrestrial soils, and industrial waste sites. Our results suggest that fumarole deposits function as an “extremophile collector” and may be a hot spot of novel extremophile biodiversity.
Collapse
Affiliation(s)
- Kate Wall
- Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, California
| | - Jennifer Cornell
- Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, California
| | - Richard W Bizzoco
- Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, California
| | - Scott T Kelley
- Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, California
| |
Collapse
|
10
|
Evaluation of whole genome sequencing for outbreak detection of Salmonella enterica. PLoS One 2014; 9:e87991. [PMID: 24505344 PMCID: PMC3913712 DOI: 10.1371/journal.pone.0087991] [Citation(s) in RCA: 186] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2013] [Accepted: 01/02/2014] [Indexed: 11/19/2022] Open
Abstract
Salmonella enterica is a common cause of minor and large food borne outbreaks. To achieve successful and nearly ‘real-time’ monitoring and identification of outbreaks, reliable sub-typing is essential. Whole genome sequencing (WGS) shows great promises for using as a routine epidemiological typing tool. Here we evaluate WGS for typing of S. Typhimurium including different approaches for analyzing and comparing the data. A collection of 34 S. Typhimurium isolates was sequenced. This consisted of 18 isolates from six outbreaks and 16 epidemiologically unrelated background strains. In addition, 8 S. Enteritidis and 5 S. Derby were also sequenced and used for comparison. A number of different bioinformatics approaches were applied on the data; including pan-genome tree, k-mer tree, nucleotide difference tree and SNP tree. The outcome of each approach was evaluated in relation to the association of the isolates to specific outbreaks. The pan-genome tree clustered 65% of the S. Typhimurium isolates according to the pre-defined epidemiology, the k-mer tree 88%, the nucleotide difference tree 100% and the SNP tree 100% of the strains within S. Typhimurium. The resulting outcome of the four phylogenetic analyses were also compared to PFGE reveling that WGS typing achieved the greater performance than the traditional method. In conclusion, for S. Typhimurium, SNP analysis and nucleotide difference approach of WGS data seem to be the superior methods for epidemiological typing compared to other phylogenetic analytic approaches that may be used on WGS. These approaches were also superior to the more classical typing method, PFGE. Our study also indicates that WGS alone is insufficient to determine whether strains are related or un-related to outbreaks. This still requires the combination of epidemiological data and whole genome sequencing results.
Collapse
|
11
|
Argimón S, Konganti K, Chen H, Alekseyenko AV, Brown S, Caufield PW. Comparative genomics of oral isolates of Streptococcus mutans by in silico genome subtraction does not reveal accessory DNA associated with severe early childhood caries. INFECTION GENETICS AND EVOLUTION 2013; 21:269-78. [PMID: 24291226 DOI: 10.1016/j.meegid.2013.11.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 11/07/2013] [Accepted: 11/08/2013] [Indexed: 11/29/2022]
Abstract
Comparative genomics is a popular method for the identification of microbial virulence determinants, especially since the sequencing of a large number of whole bacterial genomes from pathogenic and non-pathogenic strains has become relatively inexpensive. The bioinformatics pipelines for comparative genomics usually include gene prediction and annotation and can require significant computer power. To circumvent this, we developed a rapid method for genome-scale in silico subtractive hybridization, based on blastn and independent of feature identification and annotation. Whole genome comparisons by in silico genome subtraction were performed to identify genetic loci specific to Streptococcus mutans strains associated with severe early childhood caries (S-ECC), compared to strains isolated from caries-free (CF) children. The genome similarity of the 20 S. mutans strains included in this study, calculated by Simrank k-mer sharing, ranged from 79.5% to 90.9%, confirming this is a genetically heterogeneous group of strains. We identified strain-specific genetic elements in 19 strains, with sizes ranging from 200 to 39 kb. These elements contained protein-coding regions with functions mostly associated with mobile DNA. We did not, however, identify any genetic loci consistently associated with dental caries, i.e., shared by all the S-ECC strains and absent in the CF strains. Conversely, we did not identify any genetic loci specific with the healthy group. Comparison of previously published genomes from pathogenic and carriage strains of Neisseria meningitidis with our in silico genome subtraction yielded the same set of genes specific to the pathogenic strains, thus validating our method. Our results suggest that S. mutans strains derived from caries active or caries free dentitions cannot be differentiated based on the presence or absence of specific genetic elements. Our in silico genome subtraction method is available as the Microbial Genome Comparison (MGC) tool, with a user-friendly JAVA graphical interface.
Collapse
Affiliation(s)
- Silvia Argimón
- New York University College of Dentistry, Department of Cariology and Comprehensive Care, 345 East 24th St, New York, NY 10010, USA.
| | - Kranti Konganti
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, 227 East 30th St, New York, NY 10016, USA
| | - Hao Chen
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, 227 East 30th St, New York, NY 10016, USA
| | - Alexander V Alekseyenko
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, 227 East 30th St, New York, NY 10016, USA
| | - Stuart Brown
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, 227 East 30th St, New York, NY 10016, USA
| | - Page W Caufield
- New York University College of Dentistry, Department of Cariology and Comprehensive Care, 345 East 24th St, New York, NY 10010, USA
| |
Collapse
|
12
|
Hermann-Bank ML, Skovgaard K, Stockmarr A, Larsen N, Mølbak L. The Gut Microbiotassay: a high-throughput qPCR approach combinable with next generation sequencing to study gut microbial diversity. BMC Genomics 2013; 14:788. [PMID: 24225361 PMCID: PMC3879714 DOI: 10.1186/1471-2164-14-788] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 10/14/2013] [Indexed: 12/12/2022] Open
Abstract
Background The intestinal microbiota is a complex and diverse ecosystem that plays a significant role in maintaining the health and well-being of the mammalian host. During the last decade focus has increased on the importance of intestinal bacteria. Several molecular methods can be applied to describe the composition of the microbiota. This study used a new approach, the Gut Microbiotassay: an assembly of 24 primer sets targeting the main phyla and taxonomically related subgroups of the intestinal microbiota, to be used with the high-throughput qPCR chip ‘Access Array 48.48′, AA48.48, (Fluidigm®) followed by next generation sequencing. Primers were designed if necessary and all primer sets were screened against DNA extracted from pure cultures of 15 representative bacterial species. Subsequently the setup was tested on DNA extracted from small and large intestinal content from piglets with and without diarrhoea. The PCR amplicons from the 2304 reaction chambers were harvested from the AA48.48, purified, and sequenced using 454-technology. Results The Gut Microbiotassay was able to detect significant differences in the quantity and composition of the microbiota according to gut sections and diarrhoeic status. 454-sequencing confirmed the specificity of the primer sets. Diarrhoea was associated with a reduced number of members from the genus Streptococcus, and in particular S. alactolyticus. Conclusion The Gut Microbiotassay provides fast and affordable high-throughput quantification of the bacterial composition in many samples and enables further descriptive taxonomic information if combined with 454-sequencing.
Collapse
Affiliation(s)
| | | | | | | | - Lars Mølbak
- Present address: Chr, Hansen, Bøge Allé 10, 2970 Hørsholm, Denmark.
| |
Collapse
|
13
|
Nagar A, Hahsler M. Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment. BMC Bioinformatics 2013; 14 Suppl 11:S2. [PMID: 24564200 PMCID: PMC3846703 DOI: 10.1186/1471-2105-14-s11-s2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
Collapse
|
14
|
Mizrahi-Man O, Davenport ER, Gilad Y. Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs. PLoS One 2013; 8:e53608. [PMID: 23308262 PMCID: PMC3538547 DOI: 10.1371/journal.pone.0053608] [Citation(s) in RCA: 198] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2012] [Accepted: 12/03/2012] [Indexed: 01/12/2023] Open
Abstract
Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ∼8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.
Collapse
MESH Headings
- Bayes Theorem
- DNA Barcoding, Taxonomic/methods
- DNA Barcoding, Taxonomic/statistics & numerical data
- Genes, Bacterial
- Genes, rRNA
- High-Throughput Nucleotide Sequencing
- Metagenome
- Microbial Consortia/genetics
- Phylogeny
- RNA, Ribosomal, 16S/classification
- RNA, Ribosomal, 16S/genetics
- Research Design
- Sequence Analysis, DNA/methods
- Sequence Analysis, DNA/statistics & numerical data
Collapse
Affiliation(s)
- Orna Mizrahi-Man
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Emily R. Davenport
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Yoav Gilad
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
15
|
Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Caporaso JG, Angenent LT, Knight R, Ley RE. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. THE ISME JOURNAL 2012; 6:94-103. [PMID: 21716311 PMCID: PMC3217155 DOI: 10.1038/ismej.2011.82] [Citation(s) in RCA: 358] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Revised: 05/10/2011] [Accepted: 05/12/2011] [Indexed: 01/10/2023]
Abstract
Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.
Collapse
Affiliation(s)
- Jeffrey J Werner
- Department of Biological and Environmental Engineering, Cornell University, Ithaca, NY, USA
| | - Omry Koren
- Department of Microbiology, Cornell University, Ithaca, NY, USA
| | - Philip Hugenholtz
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Todd Z DeSantis
- Center for Environmental Biotechnology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - William A Walters
- Department of Biochemistry and Chemistry, University of Colorado, Boulder, CO, USA
| | - J Gregory Caporaso
- Department of Biochemistry and Chemistry, University of Colorado, Boulder, CO, USA
| | - Largus T Angenent
- Department of Biological and Environmental Engineering, Cornell University, Ithaca, NY, USA
| | - Rob Knight
- Department of Biochemistry and Chemistry, University of Colorado, Boulder, CO, USA
- Howard Hughes Medical Institute, University of Colorado, Boulder, CO, USA
| | - Ruth E Ley
- Department of Microbiology, Cornell University, Ithaca, NY, USA
| |
Collapse
|