Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Meinicke P, Asshauer KP, Lingner T. Mixture models for analysis of the taxonomic composition of metagenomes. ACTA ACUST UNITED AC 2011;27:1618-24. [PMID: 21546400 PMCID: PMC3106201 DOI: 10.1093/bioinformatics/btr266] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

For:	Meinicke P, Asshauer KP, Lingner T. Mixture models for analysis of the taxonomic composition of metagenomes. ACTA ACUST UNITED AC 2011;27:1618-24. [PMID: 21546400 PMCID: PMC3106201 DOI: 10.1093/bioinformatics/btr266] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Number

Cited by Other Article(s)

Nalbantoglu OU, Sayood K. MIMOSA: Algorithms for Microbial Profiling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:2023-2034. [PMID: 29994027 DOI: 10.1109/tcbb.2018.2830324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution. QUANTITATIVE BIOLOGY 2018. [DOI: 10.1007/s40484-018-0142-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

Fuks G, Elgart M, Amir A, Zeisel A, Turnbaugh PJ, Soen Y, Shental N. Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling. MICROBIOME 2018;6:17. [PMID: 29373999 PMCID: PMC5787238 DOI: 10.1186/s40168-017-0396-x] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 12/25/2017] [Indexed: 05/02/2023]

Abstract

BACKGROUND

Most of our knowledge about the remarkable microbial diversity on Earth comes from sequencing the 16S rRNA gene. The use of next-generation sequencing methods has increased sample number and sequencing depth, but the read length of the most widely used sequencing platforms today is quite short, requiring the researcher to choose a subset of the gene to sequence (typically 16-33% of the total length). Thus, many bacteria may share the same amplified region, and the resolution of profiling is inherently limited. Platforms that offer ultra-long read lengths, whole genome shotgun sequencing approaches, and computational frameworks formerly suggested by us and by others all allow different ways to circumvent this problem yet suffer various shortcomings. There is a need for a simple and low-cost 16S rRNA gene-based profiling approach that harnesses the short read length to provide a much larger coverage of the gene to allow for high resolution, even in harsh conditions of low bacterial biomass and fragmented DNA.

RESULTS

This manuscript suggests Short MUltiple Regions Framework (SMURF), a method to combine sequencing results from different PCR-amplified regions to provide one coherent profiling. The de facto amplicon length is the total length of all amplified regions, thus providing much higher resolution compared to current techniques. Computationally, the method solves a convex optimization problem that allows extremely fast reconstruction and requires only moderate memory. We demonstrate the increase in resolution by in silico simulations and by profiling two mock mixtures and real-world biological samples. Reanalyzing a mock mixture from the Human Microbiome Project achieved about twofold improvement in resolution when combing two independent regions. Using a custom set of six primer pairs spanning about 1200 bp (80%) of the 16S rRNA gene, we were able to achieve ~ 100-fold improvement in resolution compared to a single region, over a mock mixture of common human gut bacterial isolates. Finally, the profiling of a Drosophila melanogaster microbiome using the set of six primer pairs provided a ~ 100-fold increase in resolution and thus enabling efficient downstream analysis.

CONCLUSIONS

SMURF enables the identification of near full-length 16S rRNA gene sequences in microbial communities, having resolution superior compared to current techniques. It may be applied to standard sample preparation protocols with very little modifications. SMURF also paves the way to high-resolution profiling of low-biomass and fragmented DNA, e.g., in the case of formalin-fixed and paraffin-embedded samples, fossil-derived DNA, or DNA exposed to other degrading conditions. The approach is not restricted to combining amplicons of the 16S rRNA gene and may be applied to any set of amplicons, e.g., in multilocus sequence typing (MLST).

Collapse

Tran Q, Pham DT, Phan V. Using 16S rRNA gene as marker to detect unknown bacteria in microbial communities. BMC Bioinformatics 2017;18:499. [PMID: 29297282 PMCID: PMC5751639 DOI: 10.1186/s12859-017-1901-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Pham DT, Gao S, Phan V. An accurate and fast alignment-free method for profiling microbial communities. J Bioinform Comput Biol 2017;15:1740001. [PMID: 28345370 DOI: 10.1142/s0219720017400017] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Ferretti P, Farina S, Cristofolini M, Girolomoni G, Tett A, Segata N. Experimental metagenomics and ribosomal profiling of the human skin microbiome. Exp Dermatol 2017;26:211-219. [DOI: 10.1111/exd.13210] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2016] [Indexed: 02/06/2023]

Gregor I, Dröge J, Schirmer M, Quince C, McHardy AC. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 2016;4:e1603. [PMID: 26870609 PMCID: PMC4748697 DOI: 10.7717/peerj.1603] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 12/24/2015] [Indexed: 12/21/2022] Open

Abstract

Background. Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. This is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for species bins recovery from deep-branching phyla is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in the model and identifies ‘training’ sequences based on marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area do not have.

Results. We have developed PhyloPythiaS+, a successor to our PhyloPythia(S) software. The new (+) component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated the simultaneous counting of 4–6-mers used for taxonomic binning 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion. PhyloPythiaS+ was compared to MEGAN, taxator-tk, Kraken and the generic PhyloPythiaS model. The results showed that PhyloPythiaS+ performs especially well for samples originating from novel environments in comparison to the other methods.

Availability.PhyloPythiaS+ in a virtual machine is available for installation under Windows, Unix systems or OS X on: https://github.com/algbioi/ppsp/wiki.

Collapse

Koslicki D, Chatterjee S, Shahrivar D, Walker AW, Francis SC, Fraser LJ, Vehkaperä M, Lan Y, Corander J. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS One 2015;10:e0140644. [PMID: 26496191 PMCID: PMC4619776 DOI: 10.1371/journal.pone.0140644] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/28/2015] [Indexed: 11/17/2022] Open

Abstract

Motivation

Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.

Results

There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.

Availability

An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

Collapse

Zepeda Mendoza ML, Sicheritz-Pontén T, Gilbert MTP. Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses. Brief Bioinform 2015;16:745-58. [PMID: 25673291 PMCID: PMC4570204 DOI: 10.1093/bib/bbv001] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 12/16/2014] [Indexed: 01/19/2023] Open

McNair K, Edwards RA. GenomePeek-an online tool for prokaryotic genome and metagenome analysis. PeerJ 2015;3:e1025. [PMID: 26157610 PMCID: PMC4476108 DOI: 10.7717/peerj.1025] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2014] [Accepted: 05/25/2015] [Indexed: 12/23/2022] Open

Lam KN, Charles TC. Strong spurious transcription likely contributes to DNA insert bias in typical metagenomic clone libraries. MICROBIOME 2015;3:22. [PMID: 26056565 PMCID: PMC4459075 DOI: 10.1186/s40168-015-0086-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Accepted: 05/01/2015] [Indexed: 05/24/2023]

Abstract

BACKGROUND

Clone libraries provide researchers with a powerful resource to study nucleic acid from diverse sources. Metagenomic clone libraries in particular have aided in studies of microbial biodiversity and function, and allowed the mining of novel enzymes. Libraries are often constructed by cloning large inserts into cosmid or fosmid vectors. Recently, there have been reports of GC bias in fosmid metagenomic libraries, and it was speculated to be a result of fragmentation and loss of AT-rich sequences during cloning. However, evidence in the literature suggests that transcriptional activity or gene product toxicity may play a role.

RESULTS

To explore possible mechanisms responsible for sequence bias in clone libraries, we constructed a cosmid library from a human microbiome sample and sequenced DNA from different steps during library construction: crude extract DNA, size-selected DNA, and cosmid library DNA. We confirmed a GC bias in the final cosmid library, and we provide evidence that the bias is not due to fragmentation and loss of AT-rich sequences but is likely occurring after DNA is introduced into Escherichia coli. To investigate the influence of strong constitutive transcription, we searched the sequence data for promoters and found that rpoD/σ(70) promoter sequences were underrepresented in the cosmid library. Furthermore, when we examined the genomes of taxa that were differentially abundant in the cosmid library relative to the original sample, we found the bias to be more correlated with the number of rpoD/σ(70) consensus sequences in the genome than with simple GC content.

CONCLUSIONS

The GC bias of metagenomic libraries does not appear to be due to DNA fragmentation. Rather, analysis of promoter sequences provides support for the hypothesis that strong constitutive transcription from sequences recognized as rpoD/σ(70) consensus-like in E. coli may lead to instability, causing loss of the plasmid or loss of the insert DNA that gives rise to the transcription. Despite widespread use of E. coli to propagate foreign DNA in metagenomic libraries, the effects of in vivo transcriptional activity on clone stability are not well understood. Further work is required to tease apart the effects of transcription from those of gene product toxicity.

Collapse

Lindner MS, Renard BY. Metagenomic profiling of known and unknown microbes with microbeGPS. PLoS One 2015;10:e0117711. [PMID: 25643362 PMCID: PMC4314203 DOI: 10.1371/journal.pone.0117711] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 12/29/2014] [Indexed: 11/19/2022] Open

AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization. BMC Bioinformatics 2014;15:384. [PMID: 25495116 PMCID: PMC4307196 DOI: 10.1186/s12859-014-0384-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Accepted: 11/12/2014] [Indexed: 11/22/2022] Open

Faisal A, Peltonen J, Georgii E, Rung J, Kaski S. Toward computational cumulative biology by combining models of biological datasets. PLoS One 2014;9:e113053. [PMID: 25427176 PMCID: PMC4245117 DOI: 10.1371/journal.pone.0113053] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 10/17/2014] [Indexed: 11/21/2022] Open

Bragg L, Tyson GW. Metagenomics using next-generation sequencing. Methods Mol Biol 2014;1096:183-201. [PMID: 24515370 DOI: 10.1007/978-1-62703-712-9_15] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]

Seok HS, Hong W, Kim J. Estimating the composition of species in metagenomes by clustering of next-generation read sequences. Methods 2014;69:213-9. [PMID: 25072168 DOI: 10.1016/j.ymeth.2014.07.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Revised: 07/16/2014] [Accepted: 07/18/2014] [Indexed: 11/26/2022] Open

Abstract

Faster and cheaper sequencing technologies together with the ability to sequence uncultured microbes collected from any environment present us an opportunity to distill meaningful information from the millions of new genomic sequences from environmental samples, called metagenome. Contrary to conventional cultured microbes, however, the metagenomic data is extremely heterogeneous and noisy. Therefore the separation of the sets of sequenced genomic fragments that belong to different microbes is essential for successful assembly of microbial genomes. In this paper, we present a novel clustering method for a given metagenomic dataset. The metagenomic dataset has some distinguished features because (i) it is possible that similar sequence patterns may exist in different species and (ii) each species has different number of individuals in the given metagenomic dataset. Our method overcomes these obstacles by using the Gaussian mixture model and analysis of mixture profiles, and taking advantage of genomic signatures extracted from the metagenomic dataset. Unlike conventional clustering methods where clusters are discovered through global similarities of data instances, our method builds clusters by combining the data instances sharing local similarities captured by mixture analysis. By considering shared mixture components, our method is able to create clusters of genomic sequences although they are globally distinct each other. We applied our method to an artificial metagenomic dataset comprised of simulated 47 million reads from 25 real microbial genomes, and analyzed the resulting clusters in terms of the number of clusters, the number of participating species and dominant species in each cluster. Even though our approach cannot address all challenges in the field of metagenome sequence clustering, we believe that out method can contribute to take a step forward to achieve the goals.

Collapse

An L, Pookhao N, Jiang H, Xu J. Statistical approach of functional profiling for a microbial community. PLoS One 2014;9:e106588. [PMID: 25198674 PMCID: PMC4157783 DOI: 10.1371/journal.pone.0106588] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 07/31/2014] [Indexed: 12/21/2022] Open

Abstract

BACKGROUND

Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree.

RESULTS

A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets.

CONCLUSIONS

metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.

Collapse

Wood GR, Ryabov EV, Fannon JM, Moore JD, Evans DJ, Burroughs N. MosaicSolver: a tool for determining recombinants of viral genomes from pileup data. Nucleic Acids Res 2014;42:e123. [PMID: 25120266 PMCID: PMC4176379 DOI: 10.1093/nar/gku524] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open

Exploring neighborhoods in the metagenome universe. Int J Mol Sci 2014;15:12364-78. [PMID: 25026170 PMCID: PMC4139848 DOI: 10.3390/ijms150712364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 06/25/2014] [Indexed: 11/16/2022] Open

Abstract

The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.

Collapse

Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ 2014;2:e425. [PMID: 24949242 PMCID: PMC4060023 DOI: 10.7717/peerj.425] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2014] [Accepted: 05/21/2014] [Indexed: 12/11/2022] Open

Chatterjee S, Koslicki D, Dong S, Innocenti N, Cheng L, Lan Y, Vehkaperä M, Skoglund M, Rasmussen LK, Aurell E, Corander J. SEK: sparsity exploiting k-mer-based estimation of bacterial community composition. Bioinformatics 2014;30:2423-31. [PMID: 24812337 DOI: 10.1093/bioinformatics/btu320] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Affiliation(s)

Saikat Chatterjee Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
David Koslicki Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Siyuan Dong Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Nicolas Innocenti Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Lu Cheng Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Yueheng Lan Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Mikko Vehkaperä Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Mikael Skoglund Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Lars K Rasmussen Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Erik Aurell Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland
Jukka Corander Department of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics, Oregon State University, Corvallis, OR, USA, Systems Biology program, KTH Royal Institute of Technology, Sweden, Aalto University, Esbo, Finland, Department of Computational Biology, KTH Royal Institute of Technology, Stockholm, Sweden, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland, Department of Physics, Tsinghua University, Beijing, China, Department of Signal Processing and Department of Information and Computer Science, Aalto University, Esbo, Finland

Collapse

Hauser PM, Bernard T, Greub G, Jaton K, Pagni M, Hafen GM. Microbiota present in cystic fibrosis lungs as revealed by whole genome sequencing. PLoS One 2014;9:e90934. [PMID: 24599149 PMCID: PMC3944733 DOI: 10.1371/journal.pone.0090934] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 02/06/2014] [Indexed: 01/13/2023] Open

Johansen P, Vindeløv J, Arneborg N, Brockmann E. Development of quantitative PCR and metagenomics-based approaches for strain quantification of a defined mixed-strain starter culture. Syst Appl Microbiol 2014;37:186-93. [PMID: 24582508 DOI: 10.1016/j.syapm.2013.12.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Revised: 12/04/2013] [Accepted: 12/23/2013] [Indexed: 11/30/2022]

Roberts A, Feng H, Pachter L. Fragment assignment in the cloud with eXpress-D. BMC Bioinformatics 2013;14:358. [PMID: 24314033 PMCID: PMC3881492 DOI: 10.1186/1471-2105-14-358] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2013] [Accepted: 11/18/2013] [Indexed: 11/21/2022] Open

Koslicki D, Foucart S, Rosen G. Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing. ACTA ACUST UNITED AC 2013;29:2096-102. [PMID: 23786768 DOI: 10.1093/bioinformatics/btt336] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Klingenberg H, Aßhauer KP, Lingner T, Meinicke P. Protein signature-based estimation of metagenomic abundances including all domains of life and viruses. ACTA ACUST UNITED AC 2013;29:973-80. [PMID: 23418187 PMCID: PMC3624802 DOI: 10.1093/bioinformatics/btt077] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE, Phillippy AM, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol 2013;14:R2. [PMID: 23320958 PMCID: PMC4053804 DOI: 10.1186/gb-2013-14-1-r2] [Citation(s) in RCA: 154] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2012] [Accepted: 01/15/2013] [Indexed: 12/31/2022] Open

Roberts A, Pachter L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 2013;10:71-3. [PMID: 23160280 PMCID: PMC3880119 DOI: 10.1038/nmeth.2251] [Citation(s) in RCA: 676] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2012] [Accepted: 10/26/2012] [Indexed: 11/12/2022]

Teeling H, Glöckner FO. Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Brief Bioinform 2012;13:728-42. [PMID: 22966151 PMCID: PMC3504927 DOI: 10.1093/bib/bbs039] [Citation(s) in RCA: 148] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 06/09/2012] [Indexed: 12/21/2022] Open

Jiang H, An L, Lin SM, Feng G, Qiu Y. A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads. PLoS One 2012;7:e46450. [PMID: 23049702 PMCID: PMC3462201 DOI: 10.1371/journal.pone.0046450] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 08/30/2012] [Indexed: 11/19/2022] Open

Haft DH, Tovchigrechko A. High-speed microbial community profiling. Nat Methods 2012;9:793-4. [PMID: 22688412 DOI: 10.1038/nmeth.2080] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Bonilla-Rosso G, Eguiarte LE, Romero D, Travisano M, Souza V. Understanding microbial community diversity metrics derived from metagenomes: performance evaluation using simulated data sets. FEMS Microbiol Ecol 2012;82:37-49. [PMID: 22554028 DOI: 10.1111/j.1574-6941.2012.01405.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2011] [Revised: 04/25/2012] [Accepted: 04/27/2012] [Indexed: 12/12/2022] Open

Porter TM, Golding GB. Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: classification method, primer choice, and error. PLoS One 2012;7:e35749. [PMID: 22558215 PMCID: PMC3338786 DOI: 10.1371/journal.pone.0035749] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 03/23/2012] [Indexed: 12/13/2022] Open

Baran Y, Halperin E. Joint analysis of multiple metagenomic samples. PLoS Comput Biol 2012;8:e1002373. [PMID: 22359490 PMCID: PMC3280959 DOI: 10.1371/journal.pcbi.1002373] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Accepted: 12/20/2011] [Indexed: 12/30/2022] Open

Abstract

The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.

Microorganisms are extremely abundant and diverse, and occupy almost every habitat on earth. Most of these habitats contain a complex mixture of many different microorganisms, and the characterization of these metagenomic mixtures, in terms of both taxonomy and function, is of great interest to science and medicine. Current sequencing technologies produce large numbers of short DNA reads copied from the genomes of a metagenomic sample, which can be used to obtain a high resolution characterization of such samples. However, the analysis of such data is complicated by the fact that one cannot tell which sequencing reads originated from the same genome. We show that the joint analysis of multiple metagenomic samples, which takes advantage of the fact that the samples share common microbial types, achieves better single-sample characterization compared to the current analysis methods that operate on single samples only. We demonstrate how this approach can be used to infer microbial components without the use of external sequence data, and to cluster sequencing reads according to their species of origin. In both cases we show that the joint analysis enhances the average single-sample performance, thus providing better sample characterization.

Collapse