1
|
Nalbantoglu OU, Sayood K. MIMOSA: Algorithms for Microbial Profiling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:2023-2034. [PMID: 29994027 DOI: 10.1109/tcbb.2018.2830324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A significant goal of the study of metagenomes obtained from an environment is to find the microbial diversity and the abundance of each organism in the community. Phylotyping and binning methods which address this problem generally operate using either marker sequences or by classifying each genome fragment individually. However, these approaches might not use all the information contained in the metagenome. We propose an approach based on a Multiple Input Multiple Output (MIMO) communication system model. Results from two different implementations of this approach, one using DNA-DNA hybridization simulations and one using short read mapping are evaluated using simulated and actual metagenomes and compared with other methods of phylotyping. The proposed approaches generally performed better under different scenarios including pathogen detection tasks of community complexity and low and high sequencing coverage while being highly computationally effective. The resulting framework can be integrated to metagenome analysis pipelines for phylogenetic diversity estimation. The approach is modular so that techniques other than hybridization simulations and short read mapping may be integrated. We have observed that even for low coverage samples, the method provides accurate estimates. Therefore, the use of the proposed strategy could enable the task of exploring biodiversity with limited resources.
Collapse
|
2
|
Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution. QUANTITATIVE BIOLOGY 2018. [DOI: 10.1007/s40484-018-0142-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
3
|
Fuks G, Elgart M, Amir A, Zeisel A, Turnbaugh PJ, Soen Y, Shental N. Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling. MICROBIOME 2018; 6:17. [PMID: 29373999 PMCID: PMC5787238 DOI: 10.1186/s40168-017-0396-x] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 12/25/2017] [Indexed: 05/02/2023]
Abstract
BACKGROUND Most of our knowledge about the remarkable microbial diversity on Earth comes from sequencing the 16S rRNA gene. The use of next-generation sequencing methods has increased sample number and sequencing depth, but the read length of the most widely used sequencing platforms today is quite short, requiring the researcher to choose a subset of the gene to sequence (typically 16-33% of the total length). Thus, many bacteria may share the same amplified region, and the resolution of profiling is inherently limited. Platforms that offer ultra-long read lengths, whole genome shotgun sequencing approaches, and computational frameworks formerly suggested by us and by others all allow different ways to circumvent this problem yet suffer various shortcomings. There is a need for a simple and low-cost 16S rRNA gene-based profiling approach that harnesses the short read length to provide a much larger coverage of the gene to allow for high resolution, even in harsh conditions of low bacterial biomass and fragmented DNA. RESULTS This manuscript suggests Short MUltiple Regions Framework (SMURF), a method to combine sequencing results from different PCR-amplified regions to provide one coherent profiling. The de facto amplicon length is the total length of all amplified regions, thus providing much higher resolution compared to current techniques. Computationally, the method solves a convex optimization problem that allows extremely fast reconstruction and requires only moderate memory. We demonstrate the increase in resolution by in silico simulations and by profiling two mock mixtures and real-world biological samples. Reanalyzing a mock mixture from the Human Microbiome Project achieved about twofold improvement in resolution when combing two independent regions. Using a custom set of six primer pairs spanning about 1200 bp (80%) of the 16S rRNA gene, we were able to achieve ~ 100-fold improvement in resolution compared to a single region, over a mock mixture of common human gut bacterial isolates. Finally, the profiling of a Drosophila melanogaster microbiome using the set of six primer pairs provided a ~ 100-fold increase in resolution and thus enabling efficient downstream analysis. CONCLUSIONS SMURF enables the identification of near full-length 16S rRNA gene sequences in microbial communities, having resolution superior compared to current techniques. It may be applied to standard sample preparation protocols with very little modifications. SMURF also paves the way to high-resolution profiling of low-biomass and fragmented DNA, e.g., in the case of formalin-fixed and paraffin-embedded samples, fossil-derived DNA, or DNA exposed to other degrading conditions. The approach is not restricted to combining amplicons of the 16S rRNA gene and may be applied to any set of amplicons, e.g., in multilocus sequence typing (MLST).
Collapse
Affiliation(s)
- Garold Fuks
- Departments of Physics of Complex Systems, Weizmann Institute of Science, 7610001 Rehovot, Israel
| | - Michael Elgart
- Department of Biomolecular Sciences, Weizmann Institute of Science, 7610001 Rehovot, Israel
| | - Amnon Amir
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093 USA
| | - Amit Zeisel
- Division of Molecular Neurobiology, Department of Medical Biochemistry and Biophysics, 10 Karolinska Institutet, S-171 77 Stockholm, Sweden
| | - Peter J. Turnbaugh
- Department of Microbiology and Immunology, University of California San Francisco, San Francisco, CA 94143 USA
| | - Yoav Soen
- Department of Biomolecular Sciences, Weizmann Institute of Science, 7610001 Rehovot, Israel
| | - Noam Shental
- Department of Computer Science, The Open University of Israel, 43107 Ra’anana, Israel
| |
Collapse
|
4
|
Tran Q, Pham DT, Phan V. Using 16S rRNA gene as marker to detect unknown bacteria in microbial communities. BMC Bioinformatics 2017; 18:499. [PMID: 29297282 PMCID: PMC5751639 DOI: 10.1186/s12859-017-1901-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantification and identification of microbial genomes based on next-generation sequencing data is a challenging problem in metagenomics. Although current methods have mostly focused on analyzing bacteria whose genomes have been sequenced, such analyses are, however, complicated by the presence of unknown bacteria or bacteria whose genomes have not been sequence. RESULTS We propose a method for detecting unknown bacteria in environmental samples. Our approach is unique in its utilization of short reads only from 16S rRNA genes, not from entire genomes. We show that short reads from 16S rRNA genes retain sufficient information for detecting unknown bacteria in oral microbial communities. CONCLUSION In our experimentation with bacterial genomes from the Human Oral Microbiome Database, we found that this method made accurate and robust predictions at different read coverages and percentages of unknown bacteria. Advantages of this approach include not only a reduction in experimental and computational costs but also a potentially high accuracy across environmental samples due to the strong conservation of the 16S rRNA gene.
Collapse
Affiliation(s)
- Quang Tran
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Diem-Trang Pham
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Vinhthuy Phan
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA.
| |
Collapse
|
5
|
Pham DT, Gao S, Phan V. An accurate and fast alignment-free method for profiling microbial communities. J Bioinform Comput Biol 2017; 15:1740001. [PMID: 28345370 DOI: 10.1142/s0219720017400017] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Determining abundances of microbial genomes in metagenomic samples is an important problem in analyzing metagenomic data. Although homology-based methods are popular, they have shown to be computationally expensive due to the alignment of tens of millions of reads from metagenomic samples to reference genomes of hundreds to thousands of environmental microbial species. We introduce an efficient alignment-free approach to estimate abundances of microbial genomes in metagenomic samples. The approach is based on solving linear and quadratic programs, which are represented by genome-specific markers (GSM). We compared our method against popular alignment-free and homology-based methods. Without contamination, our method was more accurate than other alignment-free methods while being much faster than a homology-based method. In more realistic settings where samples were contaminated with human DNA, our method was the most accurate method in predicting abundance at varying levels of contamination. We achieve higher accuracy than both alignment-free and homology-based methods.
Collapse
Affiliation(s)
- Diem-Trang Pham
- 1 Department of Computer Science, The University of Memphis, Memphis, TN 38152, USA
| | - Shanshan Gao
- 1 Department of Computer Science, The University of Memphis, Memphis, TN 38152, USA
| | - Vinhthuy Phan
- 1 Department of Computer Science, The University of Memphis, Memphis, TN 38152, USA
| |
Collapse
|
6
|
Ferretti P, Farina S, Cristofolini M, Girolomoni G, Tett A, Segata N. Experimental metagenomics and ribosomal profiling of the human skin microbiome. Exp Dermatol 2017; 26:211-219. [DOI: 10.1111/exd.13210] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2016] [Indexed: 02/06/2023]
Affiliation(s)
- Pamela Ferretti
- Centre for Integrative Biology; University of Trento; Trento Italy
| | | | | | - Giampiero Girolomoni
- Section of Dermatology; Department of Medicine; University of Verona; Verona Italy
| | - Adrian Tett
- Centre for Integrative Biology; University of Trento; Trento Italy
| | - Nicola Segata
- Centre for Integrative Biology; University of Trento; Trento Italy
| |
Collapse
|
7
|
Gregor I, Dröge J, Schirmer M, Quince C, McHardy AC. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 2016; 4:e1603. [PMID: 26870609 PMCID: PMC4748697 DOI: 10.7717/peerj.1603] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 12/24/2015] [Indexed: 12/21/2022] Open
Abstract
Background. Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. This is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for species bins recovery from deep-branching phyla is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in the model and identifies ‘training’ sequences based on marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area do not have. Results. We have developed PhyloPythiaS+, a successor to our PhyloPythia(S) software. The new (+) component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated the simultaneous counting of 4–6-mers used for taxonomic binning 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion. PhyloPythiaS+ was compared to MEGAN, taxator-tk, Kraken and the generic PhyloPythiaS model. The results showed that PhyloPythiaS+ performs especially well for samples originating from novel environments in comparison to the other methods. Availability.PhyloPythiaS+ in a virtual machine is available for installation under Windows, Unix systems or OS X on: https://github.com/algbioi/ppsp/wiki.
Collapse
Affiliation(s)
- Ivan Gregor
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Johannes Dröge
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Melanie Schirmer
- The Broad Institute of MIT and Harvard , Cambridge, MA , United States
| | | | - Alice C McHardy
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| |
Collapse
|
8
|
Koslicki D, Chatterjee S, Shahrivar D, Walker AW, Francis SC, Fraser LJ, Vehkaperä M, Lan Y, Corander J. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS One 2015; 10:e0140644. [PMID: 26496191 PMCID: PMC4619776 DOI: 10.1371/journal.pone.0140644] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/28/2015] [Indexed: 11/17/2022] Open
Abstract
Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.
Collapse
Affiliation(s)
- David Koslicki
- Dept of Mathematics, Oregon State University, Corvallis, United States of America
| | - Saikat Chatterjee
- Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Damon Shahrivar
- Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Alan W Walker
- Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom
| | - Suzanna C Francis
- MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Louise J Fraser
- Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom
| | - Mikko Vehkaperä
- Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom
| | - Yueheng Lan
- Dept of Physics, Tsinghua University, Beijing, China
| | - Jukka Corander
- Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| |
Collapse
|
9
|
Zepeda Mendoza ML, Sicheritz-Pontén T, Gilbert MTP. Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses. Brief Bioinform 2015; 16:745-58. [PMID: 25673291 PMCID: PMC4570204 DOI: 10.1093/bib/bbv001] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 12/16/2014] [Indexed: 01/19/2023] Open
Abstract
DNA-based taxonomic and functional profiling is widely used for the characterization of organismal communities across a rapidly increasing array of research areas that include the role of microbiomes in health and disease, biomonitoring, and estimation of both microbial and metazoan species richness. Two principal approaches are currently used to assign taxonomy to DNA sequences: DNA metabarcoding and metagenomics. When initially developed, each of these approaches mandated their own particular methods for data analysis; however, with the development of high-throughput sequencing (HTS) techniques they have begun to share many aspects in data set generation and processing. In this review we aim to define the current characteristics, goals and boundaries of each field, and describe the different software used for their analysis. We argue that an appreciation of the potential and limitations of each method can help underscore the improvements required by each field so as to better exploit the richness of current HTS-based data sets.
Collapse
|
10
|
McNair K, Edwards RA. GenomePeek-an online tool for prokaryotic genome and metagenome analysis. PeerJ 2015; 3:e1025. [PMID: 26157610 PMCID: PMC4476108 DOI: 10.7717/peerj.1025] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2014] [Accepted: 05/25/2015] [Indexed: 12/23/2022] Open
Abstract
As more and more prokaryotic sequencing takes place, a method to quickly and accurately analyze this data is needed. Previous tools are mainly designed for metagenomic analysis and have limitations; such as long runtimes and significant false positive error rates. The online tool GenomePeek (edwards.sdsu.edu/GenomePeek) was developed to analyze both single genome and metagenome sequencing files, quickly and with low error rates. GenomePeek uses a sequence assembly approach where reads to a set of conserved genes are extracted, assembled and then aligned against the highly specific reference database. GenomePeek was found to be faster than traditional approaches while still keeping error rates low, as well as offering unique data visualization options.
Collapse
Affiliation(s)
- Katelyn McNair
- Department of Computer Science, San Diego State University , San Diego, CA , USA ; Department of Biology, San Diego State University , San Diego, CA , USA
| | - Robert A Edwards
- Department of Computer Science, San Diego State University , San Diego, CA , USA ; Department of Biology, San Diego State University , San Diego, CA , USA ; Computational Sciences Research Center, San Diego State University , San Diego, CA , USA ; Mathematics and Computer Science Division, Argonne National Laboratory , Argonne, IL , USA
| |
Collapse
|
11
|
Lam KN, Charles TC. Strong spurious transcription likely contributes to DNA insert bias in typical metagenomic clone libraries. MICROBIOME 2015; 3:22. [PMID: 26056565 PMCID: PMC4459075 DOI: 10.1186/s40168-015-0086-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Accepted: 05/01/2015] [Indexed: 05/24/2023]
Abstract
BACKGROUND Clone libraries provide researchers with a powerful resource to study nucleic acid from diverse sources. Metagenomic clone libraries in particular have aided in studies of microbial biodiversity and function, and allowed the mining of novel enzymes. Libraries are often constructed by cloning large inserts into cosmid or fosmid vectors. Recently, there have been reports of GC bias in fosmid metagenomic libraries, and it was speculated to be a result of fragmentation and loss of AT-rich sequences during cloning. However, evidence in the literature suggests that transcriptional activity or gene product toxicity may play a role. RESULTS To explore possible mechanisms responsible for sequence bias in clone libraries, we constructed a cosmid library from a human microbiome sample and sequenced DNA from different steps during library construction: crude extract DNA, size-selected DNA, and cosmid library DNA. We confirmed a GC bias in the final cosmid library, and we provide evidence that the bias is not due to fragmentation and loss of AT-rich sequences but is likely occurring after DNA is introduced into Escherichia coli. To investigate the influence of strong constitutive transcription, we searched the sequence data for promoters and found that rpoD/σ(70) promoter sequences were underrepresented in the cosmid library. Furthermore, when we examined the genomes of taxa that were differentially abundant in the cosmid library relative to the original sample, we found the bias to be more correlated with the number of rpoD/σ(70) consensus sequences in the genome than with simple GC content. CONCLUSIONS The GC bias of metagenomic libraries does not appear to be due to DNA fragmentation. Rather, analysis of promoter sequences provides support for the hypothesis that strong constitutive transcription from sequences recognized as rpoD/σ(70) consensus-like in E. coli may lead to instability, causing loss of the plasmid or loss of the insert DNA that gives rise to the transcription. Despite widespread use of E. coli to propagate foreign DNA in metagenomic libraries, the effects of in vivo transcriptional activity on clone stability are not well understood. Further work is required to tease apart the effects of transcription from those of gene product toxicity.
Collapse
Affiliation(s)
- Kathy N. Lam
- Department of Biology, University of Waterloo, Waterloo, ON Canada
| | | |
Collapse
|
12
|
Lindner MS, Renard BY. Metagenomic profiling of known and unknown microbes with microbeGPS. PLoS One 2015; 10:e0117711. [PMID: 25643362 PMCID: PMC4314203 DOI: 10.1371/journal.pone.0117711] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 12/29/2014] [Indexed: 11/19/2022] Open
Abstract
Microbial community profiling identifies and quantifies organisms in metagenomic sequencing data using either reference based or unsupervised approaches. However, current reference based profiling methods only report the presence and abundance of single reference genomes that are available in databases. Since only a small fraction of environmental genomes is represented in genomic databases, these approaches entail the risk of false identifications and often suggest a higher precision than justified by the data. Therefore, we developed MicrobeGPS, a novel metagenomic profiling approach that overcomes these limitations. MicrobeGPS is the first method that identifies microbiota in the sample and estimates their genomic distances to known reference genomes. With this strategy, MicrobeGPS identifies organisms down to the strain level and highlights possibly inaccurate identifications when the correct reference genome is missing. We demonstrate on three metagenomic datasets with different origin that our approach successfully avoids misleading interpretation of results and additionally provides more accurate results than current profiling methods. Our results indicate that MicrobeGPS can enable reference based taxonomic profiling of complex and less characterized microbial communities. MicrobeGPS is open source and available from https://sourceforge.net/projects/microbegps/ as source code and binary distribution for Windows and Linux operating systems.
Collapse
Affiliation(s)
- Martin S. Lindner
- Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany
| | - Bernhard Y. Renard
- Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany
| |
Collapse
|
13
|
AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization. BMC Bioinformatics 2014; 15:384. [PMID: 25495116 PMCID: PMC4307196 DOI: 10.1186/s12859-014-0384-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Accepted: 11/12/2014] [Indexed: 11/22/2022] Open
Abstract
Background With the advent of low cost, fast sequencing technologies metagenomic analyses are made possible. The large data volumes gathered by these techniques and the unpredictable diversity captured in them are still, however, a challenge for computational biology. Results In this paper we address the problem of rapid taxonomic assignment with small and adaptive data models (< 5 MB) and present the accelerated k-mer explorer (AKE). Acceleration in AKE’s taxonomic assignments is achieved by a special machine learning architecture, which is well suited to model data collections that are intrinsically hierarchical. We report classification accuracy reasonably well for ranks down to order, observed on a study on real world data (Acid Mine Drainage, Cow Rumen). Conclusion We show that the execution time of this approach is orders of magnitude shorter than competitive approaches and that accuracy is comparable. The tool is presented to the public as a web application (url: https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, password: bmcbioinfo). Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0384-0) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
Faisal A, Peltonen J, Georgii E, Rung J, Kaski S. Toward computational cumulative biology by combining models of biological datasets. PLoS One 2014; 9:e113053. [PMID: 25427176 PMCID: PMC4245117 DOI: 10.1371/journal.pone.0113053] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 10/17/2014] [Indexed: 11/21/2022] Open
Abstract
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Collapse
Affiliation(s)
- Ali Faisal
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Jaakko Peltonen
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Elisabeth Georgii
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Johan Rung
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
15
|
Abstract
Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
Collapse
Affiliation(s)
- Lauren Bragg
- Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD, Australia
| | | |
Collapse
|
16
|
Seok HS, Hong W, Kim J. Estimating the composition of species in metagenomes by clustering of next-generation read sequences. Methods 2014; 69:213-9. [PMID: 25072168 DOI: 10.1016/j.ymeth.2014.07.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Revised: 07/16/2014] [Accepted: 07/18/2014] [Indexed: 11/26/2022] Open
Abstract
Faster and cheaper sequencing technologies together with the ability to sequence uncultured microbes collected from any environment present us an opportunity to distill meaningful information from the millions of new genomic sequences from environmental samples, called metagenome. Contrary to conventional cultured microbes, however, the metagenomic data is extremely heterogeneous and noisy. Therefore the separation of the sets of sequenced genomic fragments that belong to different microbes is essential for successful assembly of microbial genomes. In this paper, we present a novel clustering method for a given metagenomic dataset. The metagenomic dataset has some distinguished features because (i) it is possible that similar sequence patterns may exist in different species and (ii) each species has different number of individuals in the given metagenomic dataset. Our method overcomes these obstacles by using the Gaussian mixture model and analysis of mixture profiles, and taking advantage of genomic signatures extracted from the metagenomic dataset. Unlike conventional clustering methods where clusters are discovered through global similarities of data instances, our method builds clusters by combining the data instances sharing local similarities captured by mixture analysis. By considering shared mixture components, our method is able to create clusters of genomic sequences although they are globally distinct each other. We applied our method to an artificial metagenomic dataset comprised of simulated 47 million reads from 25 real microbial genomes, and analyzed the resulting clusters in terms of the number of clusters, the number of participating species and dominant species in each cluster. Even though our approach cannot address all challenges in the field of metagenome sequence clustering, we believe that out method can contribute to take a step forward to achieve the goals.
Collapse
Affiliation(s)
- Ho-Sik Seok
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| | - Woonyoung Hong
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| | - Jaebum Kim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| |
Collapse
|
17
|
An L, Pookhao N, Jiang H, Xu J. Statistical approach of functional profiling for a microbial community. PLoS One 2014; 9:e106588. [PMID: 25198674 PMCID: PMC4157783 DOI: 10.1371/journal.pone.0106588] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 07/31/2014] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree. RESULTS A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets. CONCLUSIONS metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.
Collapse
Affiliation(s)
- Lingling An
- Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America
- Interdisciplinary Programs in Statistics, University of Arizona, Tucson, Arizona, United States of America
| | - Nauromal Pookhao
- Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America
| | - Hongmei Jiang
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America
| | - Jiannong Xu
- Department of Biology, New Mexico State University, Las Cruces, New Mexico, United States of America
| |
Collapse
|
18
|
Wood GR, Ryabov EV, Fannon JM, Moore JD, Evans DJ, Burroughs N. MosaicSolver: a tool for determining recombinants of viral genomes from pileup data. Nucleic Acids Res 2014; 42:e123. [PMID: 25120266 PMCID: PMC4176379 DOI: 10.1093/nar/gku524] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Viral recombination is a key evolutionary mechanism, aiding escape from host immunity, contributing to changes in tropism and possibly assisting transmission across species barriers. The ability to determine whether recombination has occurred and to locate associated specific recombination junctions is thus of major importance in understanding emerging diseases and pathogenesis. This paper describes a method for determining recombinant mosaics (and their proportions) originating from two parent genomes, using high-throughput sequence data. The method involves setting the problem geometrically and the use of appropriately constrained quadratic programming. Recombinants of the honeybee deformed wing virus and the Varroa destructor virus-1 are inferred to illustrate the method from both siRNAs and reads sampling the viral genome population (cDNA library); our results are confirmed experimentally. Matlab software (MosaicSolver) is available.
Collapse
Affiliation(s)
- Graham R Wood
- Warwick Systems Biology Centre, Senate House, University of Warwick, Coventry, CV4 7AL, UK
| | - Eugene V Ryabov
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Jessica M Fannon
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Jonathan D Moore
- Warwick Systems Biology Centre, Senate House, University of Warwick, Coventry, CV4 7AL, UK
| | - David J Evans
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Nigel Burroughs
- Warwick Systems Biology Centre, Senate House, University of Warwick, Coventry, CV4 7AL, UK
| |
Collapse
|
19
|
Exploring neighborhoods in the metagenome universe. Int J Mol Sci 2014; 15:12364-78. [PMID: 25026170 PMCID: PMC4139848 DOI: 10.3390/ijms150712364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 06/25/2014] [Indexed: 11/16/2022] Open
Abstract
The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.
Collapse
|
20
|
Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ 2014; 2:e425. [PMID: 24949242 PMCID: PMC4060023 DOI: 10.7717/peerj.425] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2014] [Accepted: 05/21/2014] [Indexed: 12/11/2022] Open
Abstract
One of the major goals in metagenomics is to identify the organisms present in a microbial community from unannotated shotgun sequencing reads. Taxonomic profiling has valuable applications in biological and medical research, including disease diagnostics. Most currently available approaches do not scale well with increasing data volumes, which is important because both the number and lengths of the reads provided by sequencing platforms keep increasing. Here we introduce FOCUS, an agile composition based approach using non-negative least squares (NNLS) to report the organisms present in metagenomic samples and profile their abundances. FOCUS was tested with simulated and real metagenomes, and the results show that our approach accurately predicts the organisms present in microbial communities. FOCUS was implemented in Python. The source code and web-sever are freely available at http://edwards.sdsu.edu/FOCUS.
Collapse
Affiliation(s)
| | - Daniel A Cuevas
- Computational Science Research Center, San Diego State University , San Diego, CA , USA
| | - Bas E Dutilh
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre, GA , Nijmegen , The Netherlands ; Department of Marine Biology, Institute of Biology, Federal University of Rio de Janeiro , Brazil
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University , San Diego, CA , USA ; Department of Computer Science, San Diego State University , San Diego, CA , USA ; Department of Biology, San Diego State University , San Diego, CA , USA ; Department of Marine Biology, Institute of Biology, Federal University of Rio de Janeiro , Brazil ; Division of Mathematics and Computer Science, Argonne National Laboratory , Argonne, IL , USA
| |
Collapse
|
21
|
Chatterjee S, Koslicki D, Dong S, Innocenti N, Cheng L, Lan Y, Vehkaperä M, Skoglund M, Rasmussen LK, Aurell E, Corander J. SEK: sparsity exploiting k-mer-based estimation of bacterial community composition. Bioinformatics 2014; 30:2423-31. [PMID: 24812337 DOI: 10.1093/bioinformatics/btu320] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. RESULTS Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method. AVAILABILITY AND IMPLEMENTATION A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.
Collapse
Affiliation(s)
- Saikat Chatterjee
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - David Koslicki
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Siyuan Dong
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Nicolas Innocenti
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Lu Cheng
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Yueheng Lan
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Mikko Vehkaperä
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Mikael Skoglund
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Lars K Rasmussen
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Erik Aurell
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| | - Jukka Corander
- Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
| |
Collapse
|
22
|
Hauser PM, Bernard T, Greub G, Jaton K, Pagni M, Hafen GM. Microbiota present in cystic fibrosis lungs as revealed by whole genome sequencing. PLoS One 2014; 9:e90934. [PMID: 24599149 PMCID: PMC3944733 DOI: 10.1371/journal.pone.0090934] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 02/06/2014] [Indexed: 01/13/2023] Open
Abstract
Determination of the precise composition and variation of microbiota in cystic fibrosis lungs is crucial since chronic inflammation due to microorganisms leads to lung damage and ultimately, death. However, this constitutes a major technical challenge. Culturing of microorganisms does not provide a complete representation of a microbiota, even when using culturomics (high-throughput culture). So far, only PCR-based metagenomics have been investigated. However, these methods are biased towards certain microbial groups, and suffer from uncertain quantification of the different microbial domains. We have explored whole genome sequencing (WGS) using the Illumina high-throughput technology applied directly to DNA extracted from sputa obtained from two cystic fibrosis patients. To detect all microorganism groups, we used four procedures for DNA extraction, each with a different lysis protocol. We avoided biases due to whole DNA amplification thanks to the high efficiency of current Illumina technology. Phylogenomic classification of the reads by three different methods produced similar results. Our results suggest that WGS provides, in a single analysis, a better qualitative and quantitative assessment of microbiota compositions than cultures and PCRs. WGS identified a high quantity of Haemophilus spp. (patient 1) or Staphylococcus spp. plus Streptococcus spp. (patient 2) together with low amounts of anaerobic (Veillonella, Prevotella, Fusobacterium) and aerobic bacteria (Gemella, Moraxella, Granulicatella). WGS suggested that fungal members represented very low proportions of the microbiota, which were detected by cultures and PCRs because of their selectivity. The future increase of reads' sizes and decrease in cost should ensure the usefulness of WGS for the characterisation of microbiota.
Collapse
Affiliation(s)
- Philippe M. Hauser
- Institute of Microbiology, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne, Switzerland
| | - Thomas Bernard
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Gilbert Greub
- Institute of Microbiology, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne, Switzerland
| | - Katia Jaton
- Institute of Microbiology, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne, Switzerland
| | - Marco Pagni
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Gaudenz M. Hafen
- Department of Paediatrics, Pulmonary Unit, Centre Hospitalier Universitaire Vaudois and University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
23
|
Johansen P, Vindeløv J, Arneborg N, Brockmann E. Development of quantitative PCR and metagenomics-based approaches for strain quantification of a defined mixed-strain starter culture. Syst Appl Microbiol 2014; 37:186-93. [PMID: 24582508 DOI: 10.1016/j.syapm.2013.12.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Revised: 12/04/2013] [Accepted: 12/23/2013] [Indexed: 11/30/2022]
Abstract
Although the strain composition of mixed cultures may hugely affect production of various fermented foods, such as e.g. cheese, tools for investigating it have so far been limited. In this study, two new approaches for quantification of seven Lactococcus lactis subsp. cremoris strains (S1-S7) in a defined mixed-strain starter culture were developed and verified. By mapping NGS reads from 47 sequenced L. lactis strains to de novo assembly contigs of the seven strains, two strain-specific sequence regions (SEQ1 and SEQ2) were identified for each strain for qPCR primer design (A1 and A2). The qPCR assays amplified their strain-specific sequence region target efficiently. Additionally, high reproducibility was obtained in a validation sample containing equal amounts of the seven strains, and assay-to-assay coefficients of variance (CVs) for six (i.e. S1, S2, S4-S7) of the seven strains correlated to the inter-plate CVs. Hence, at least for six strains, the qPCR assay design approach was successful. The metagenomics-based approach quantified the seven strains based on average coverage of SEQ1 and SEQ2 by mapping sequencing reads from the validation sample to the strain-specific sequence regions. Average coverages of the SEQ1 and SEQ2 in the metagenomics data showed CVs of ≤17.3% for six strains (i.e. S1-S4, S6, S7). Thus, the metagenomics-based quantification approach was considered successful for six strains, regardless of the strain-specific sequence region used. When comparing qPCR- and metagenomics-based quantifications of the validation sample, the identified strain-specific sequence regions were considered suitable and applicable for quantification at a strain level of defined mixed-strain starter cultures.
Collapse
Affiliation(s)
- Pernille Johansen
- Department of Food Science, Faculty of Science, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark
| | | | - Nils Arneborg
- Department of Food Science, Faculty of Science, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark.
| | - Elke Brockmann
- Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| |
Collapse
|
24
|
Roberts A, Feng H, Pachter L. Fragment assignment in the cloud with eXpress-D. BMC Bioinformatics 2013; 14:358. [PMID: 24314033 PMCID: PMC3881492 DOI: 10.1186/1471-2105-14-358] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2013] [Accepted: 11/18/2013] [Indexed: 11/21/2022] Open
Abstract
Background Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. Results We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters–“the cloud”. We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. Conclusions The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems–such as new frameworks like Spark–for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.
Collapse
Affiliation(s)
| | | | - Lior Pachter
- Department of Computer Science, 387 Soda Hall, UC Berkeley, Berkeley, CA 94720, USA.
| |
Collapse
|
25
|
Koslicki D, Foucart S, Rosen G. Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing. ACTA ACUST UNITED AC 2013; 29:2096-102. [PMID: 23786768 DOI: 10.1093/bioinformatics/btt336] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Many metagenomic studies compare hundreds to thousands of environmental and health-related samples by extracting and sequencing their 16S rRNA amplicons and measuring their similarity using beta-diversity metrics. However, one of the first steps--to classify the operational taxonomic units within the sample--can be a computationally time-consuming task because most methods rely on computing the taxonomic assignment of each individual read out of tens to hundreds of thousands of reads. RESULTS We introduce Quikr: a QUadratic, K-mer-based, Iterative, Reconstruction method, which computes a vector of taxonomic assignments and their proportions in the sample using an optimization technique motivated from the mathematical theory of compressive sensing. On both simulated and actual biological data, we demonstrate that Quikr typically has less error and is typically orders of magnitude faster than the most commonly used taxonomic assignment technique (the Ribosomal Database Project's Naïve Bayesian Classifier). Furthermore, the technique is shown to be unaffected by the presence of chimeras, thereby allowing for the circumvention of the time-intensive step of chimera filtering. AVAILABILITY The Quikr computational package (in MATLAB, Octave, Python and C) for the Linux and Mac platforms is available at http://sourceforge.net/projects/quikr/.
Collapse
Affiliation(s)
- David Koslicki
- Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43201, USA.
| | | | | |
Collapse
|
26
|
Klingenberg H, Aßhauer KP, Lingner T, Meinicke P. Protein signature-based estimation of metagenomic abundances including all domains of life and viruses. ACTA ACUST UNITED AC 2013; 29:973-80. [PMID: 23418187 PMCID: PMC3624802 DOI: 10.1093/bioinformatics/btt077] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Motivation: Metagenome analysis requires tools that can estimate the taxonomic abundances in anonymous sequence data over the whole range of biological entities. Because there is usually no prior knowledge about the data composition, not only all domains of life but also viruses have to be included in taxonomic profiling. Such a full-range approach, however, is difficult to realize owing to the limited coverage of available reference data. In particular, archaea and viruses are generally not well represented by current genome databases. Results: We introduce a novel approach to taxonomic profiling of metagenomes that is based on mixture model analysis of protein signatures. Our results on simulated and real data reveal the difficulties of the existing methods when measuring achaeal or viral abundances and show the overall good profiling performance of the protein-based mixture model. As an application example, we provide a large-scale analysis of data from the Human Microbiome Project. This demonstrates the utility of our method as a first instance profiling tool for a fast estimate of the community structure. Availability:http://gobics.de/TaxyPro. Contact:pmeinic@gwdg.de Supplementary information:Supplementary Material is available at Bioinformatics online.
Collapse
Affiliation(s)
- Heiner Klingenberg
- Department of Bioinformatics, Institute for Microbiology and Genetics, University of Göttingen, Göttingen, Germany
| | | | | | | |
Collapse
|
27
|
Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE, Phillippy AM, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol 2013; 14:R2. [PMID: 23320958 PMCID: PMC4053804 DOI: 10.1186/gb-2013-14-1-r2] [Citation(s) in RCA: 154] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2012] [Accepted: 01/15/2013] [Indexed: 12/31/2022] Open
Abstract
We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS.
Collapse
|
28
|
Roberts A, Pachter L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 2013; 10:71-3. [PMID: 23160280 PMCID: PMC3880119 DOI: 10.1038/nmeth.2251] [Citation(s) in RCA: 676] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2012] [Accepted: 10/26/2012] [Indexed: 11/12/2022]
Abstract
We present eXpress, a software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data and show that eXpress achieves greater efficiency than other quantification methods.
Collapse
Affiliation(s)
- Adam Roberts
- Department of Computer Science, University of California, Berkeley, USA
| | - Lior Pachter
- Department of Computer Science, University of California, Berkeley, USA
- Department of Mathematics, University of California, Berkeley, USA
- Deparment of Molecular and Cell Biology, University of California, Berkeley, USA
| |
Collapse
|
29
|
Teeling H, Glöckner FO. Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Brief Bioinform 2012; 13:728-42. [PMID: 22966151 PMCID: PMC3504927 DOI: 10.1093/bib/bbs039] [Citation(s) in RCA: 148] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 06/09/2012] [Indexed: 12/21/2022] Open
Abstract
Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such 'ecosystems biology' approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
Collapse
|
30
|
Jiang H, An L, Lin SM, Feng G, Qiu Y. A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads. PLoS One 2012; 7:e46450. [PMID: 23049702 PMCID: PMC3462201 DOI: 10.1371/journal.pone.0046450] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 08/30/2012] [Indexed: 11/19/2022] Open
Abstract
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
Collapse
Affiliation(s)
- Hongmei Jiang
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America.
| | | | | | | | | |
Collapse
|
31
|
|
32
|
Bonilla-Rosso G, Eguiarte LE, Romero D, Travisano M, Souza V. Understanding microbial community diversity metrics derived from metagenomes: performance evaluation using simulated data sets. FEMS Microbiol Ecol 2012; 82:37-49. [PMID: 22554028 DOI: 10.1111/j.1574-6941.2012.01405.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2011] [Revised: 04/25/2012] [Accepted: 04/27/2012] [Indexed: 12/12/2022] Open
Abstract
Metagenomics holds the promise of greatly advancing the study of diversity in natural communities, but novel theoretical and methodological approaches must first be developed and adjusted for these data sets. We evaluated widely used macroecological metrics of taxonomic diversity on a simulated set of metagenomic samples, using phylogenetically meaningful protein-coding genes as ecological proxies. To our knowledge, this is the first approach of this kind to evaluate taxonomic diversity metrics derived from metagenomic data sets. We demonstrate that abundance matrices derived from protein-coding marker genes reproduce more faithfully the structure of the original community than those derived from SSU-rRNA gene. We also found that the most commonly used diversity metrics are biased estimators of community structure and differ significantly from their corresponding real parameters and that these biases are most likely caused by insufficient sampling and differences in community phylogenetic composition. Our results suggest that the ranking of samples using multidimensional metrics makes a good qualitative alternative for contrasting community structure and that these comparisons can be greatly improved with the incorporation of metrics for both community structure and phylogenetic diversity. These findings will help to achieve a standardized framework for community diversity comparisons derived from metagenomic data sets.
Collapse
Affiliation(s)
- Germán Bonilla-Rosso
- Department of Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, México D.F, México
| | | | | | | | | |
Collapse
|
33
|
Porter TM, Golding GB. Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: classification method, primer choice, and error. PLoS One 2012; 7:e35749. [PMID: 22558215 PMCID: PMC3338786 DOI: 10.1371/journal.pone.0035749] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 03/23/2012] [Indexed: 12/13/2022] Open
Abstract
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50-100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys.
Collapse
|
34
|
Baran Y, Halperin E. Joint analysis of multiple metagenomic samples. PLoS Comput Biol 2012; 8:e1002373. [PMID: 22359490 PMCID: PMC3280959 DOI: 10.1371/journal.pcbi.1002373] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Accepted: 12/20/2011] [Indexed: 12/30/2022] Open
Abstract
The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough. Microorganisms are extremely abundant and diverse, and occupy almost every habitat on earth. Most of these habitats contain a complex mixture of many different microorganisms, and the characterization of these metagenomic mixtures, in terms of both taxonomy and function, is of great interest to science and medicine. Current sequencing technologies produce large numbers of short DNA reads copied from the genomes of a metagenomic sample, which can be used to obtain a high resolution characterization of such samples. However, the analysis of such data is complicated by the fact that one cannot tell which sequencing reads originated from the same genome. We show that the joint analysis of multiple metagenomic samples, which takes advantage of the fact that the samples share common microbial types, achieves better single-sample characterization compared to the current analysis methods that operate on single samples only. We demonstrate how this approach can be used to infer microbial components without the use of external sequence data, and to cluster sequencing reads according to their species of origin. In both cases we show that the joint analysis enhances the average single-sample performance, thus providing better sample characterization.
Collapse
Affiliation(s)
- Yael Baran
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Eran Halperin
- School of Computer Science and the Department of Molecular Microbiology and Biotechnology, Tel-Aviv University, Tel-Aviv, Israel
- International Computer Science Institute, Berkeley, California, Unites States of America
- * E-mail:
| |
Collapse
|