1
|
Challenges and opportunities in sharing microbiome data and analyses. Nat Microbiol 2023; 8:1960-1970. [PMID: 37783751 DOI: 10.1038/s41564-023-01484-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 08/28/2023] [Indexed: 10/04/2023]
Abstract
Microbiome data, metadata and analytical workflows have become 'big' in terms of volume and complexity. Although the infrastructure and technologies to share data have been established, the interdisciplinary and multi-omic nature of the field can make resources difficult to identify and use. Following best practices for data deposition requires substantial effort, with sometimes little obvious reward. Gaps remain where microbiome-specific resources for data sharing or reproducibility do not yet exist. We outline available best practices, challenges to their adoption and opportunities in data sharing in microbiome research. We showcase examples of best practices and advocate for their enforcement and incentivization for data sharing. This includes recognition of data curation and sharing endeavours by individuals, institutions, journals and funders. Opportunities for progress include enabling microbiome-specific databases to incorporate future methods for data analysis, integration and reuse.
Collapse
|
2
|
Questioning the fetal microbiome illustrates pitfalls of low-biomass microbial studies. Nature 2023; 613:639-649. [PMID: 36697862 DOI: 10.1038/s41586-022-05546-8] [Citation(s) in RCA: 94] [Impact Index Per Article: 94.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 11/09/2022] [Indexed: 01/26/2023]
Abstract
Whether the human fetus and the prenatal intrauterine environment (amniotic fluid and placenta) are stably colonized by microbial communities in a healthy pregnancy remains a subject of debate. Here we evaluate recent studies that characterized microbial populations in human fetuses from the perspectives of reproductive biology, microbial ecology, bioinformatics, immunology, clinical microbiology and gnotobiology, and assess possible mechanisms by which the fetus might interact with microorganisms. Our analysis indicates that the detected microbial signals are likely the result of contamination during the clinical procedures to obtain fetal samples or during DNA extraction and DNA sequencing. Furthermore, the existence of live and replicating microbial populations in healthy fetal tissues is not compatible with fundamental concepts of immunology, clinical microbiology and the derivation of germ-free mammals. These conclusions are important to our understanding of human immune development and illustrate common pitfalls in the microbial analyses of many other low-biomass environments. The pursuit of a fetal microbiome serves as a cautionary example of the challenges of sequence-based microbiome studies when biomass is low or absent, and emphasizes the need for a trans-disciplinary approach that goes beyond contamination controls by also incorporating biological, ecological and mechanistic concepts.
Collapse
|
3
|
Metadata harmonization-Standards are the key for a better usage of omics data for integrative microbiome analysis. ENVIRONMENTAL MICROBIOME 2022; 17:33. [PMID: 35751093 PMCID: PMC9233336 DOI: 10.1186/s40793-022-00425-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 05/29/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND Tremendous amounts of data generated from microbiome research studies during the last decades require not only standards for sampling and preparation of omics data but also clear concepts of how the metadata is prepared to ensure re-use for integrative and interdisciplinary microbiome analysis. RESULTS In this Commentary, we present our views on the key issues related to the current system for metadata submission in omics research, and propose the development of a global metadata system. Such a system should be easy to use, clearly structured in a hierarchical way, and should be compatible with all existing microbiome data repositories, following common standards for minimal required information and common ontology. Although minimum metadata requirements are essential for microbiome datasets, the immense technological progress requires a flexible system, which will have to be constantly improved and re-thought. While FAIR principles (Findable, Accessible, Interoperable, and Reusable) are already considered, international legal issues on genetic resource and sequence sharing provided by the Convention on Biological Diversity need more awareness and engagement of the scientific community. CONCLUSIONS The suggested approach for metadata entries would strongly improve retrieving and re-using data as demonstrated in several representative use cases. These integrative analyses, in turn, would further advance the potential of microbiome research for novel scientific discoveries and the development of microbiome-derived products.
Collapse
|
4
|
Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 2022; 19:429-440. [PMID: 35396482 PMCID: PMC9007738 DOI: 10.1038/s41592-022-01431-4] [Citation(s) in RCA: 89] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/14/2022] [Indexed: 12/20/2022]
Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses. This study presents the results of the second round of the Critical Assessment of Metagenome Interpretation challenges (CAMI II), which is a community-driven effort for comprehensively benchmarking tools for metagenomics data analysis.
Collapse
|
5
|
A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2022; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
|
6
|
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
|
7
|
A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2021; 10:80. [DOI: 10.12688/f1000research.39214.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2021] [Indexed: 01/12/2023] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
|
8
|
"Candidatus Paraporphyromonas polyenzymogenes" encodes multi-modular cellulases linked to the type IX secretion system. MICROBIOME 2018; 6:44. [PMID: 29490697 PMCID: PMC5831590 DOI: 10.1186/s40168-018-0421-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 02/07/2018] [Indexed: 05/07/2023]
Abstract
BACKGROUND In nature, obligate herbivorous ruminants have a close symbiotic relationship with their gastrointestinal microbiome, which proficiently deconstructs plant biomass. Despite decades of research, lignocellulose degradation in the rumen has thus far been attributed to a limited number of culturable microorganisms. Here, we combine meta-omics and enzymology to identify and describe a novel Bacteroidetes family ("Candidatus MH11") composed entirely of uncultivated strains that are predominant in ruminants and only distantly related to previously characterized taxa. RESULTS The first metabolic reconstruction of Ca. MH11-affiliated genome bins, with a particular focus on the provisionally named "Candidatus Paraporphyromonas polyenzymogenes", illustrated their capacity to degrade various lignocellulosic substrates via comprehensive inventories of singular and multi-modular carbohydrate active enzymes (CAZymes). Closer examination revealed an absence of archetypical polysaccharide utilization loci found in human gut microbiota. Instead, we identified many multi-modular CAZymes putatively secreted via the Bacteroidetes-specific type IX secretion system (T9SS). This included cellulases with two or more catalytic domains, which are modular arrangements that are unique to Bacteroidetes species studied to date. Core metabolic proteins from Ca. P. polyenzymogenes were detected in metaproteomic data and were enriched in rumen-incubated plant biomass, indicating that active saccharification and fermentation of complex carbohydrates could be assigned to members of this novel family. Biochemical analysis of selected Ca. P. polyenzymogenes CAZymes further iterated the cellulolytic activity of this hitherto uncultured bacterium towards linear polymers, such as amorphous and crystalline cellulose as well as mixed linkage β-glucans. CONCLUSION We propose that Ca. P. polyenzymogene genotypes and other Ca. MH11 members actively degrade plant biomass in the rumen of cows, sheep and most likely other ruminants, utilizing singular and multi-domain catalytic CAZymes secreted through the T9SS. The discovery of a prominent role of multi-modular cellulases in the Gram-negative Bacteroidetes, together with similar findings for Gram-positive cellulosomal bacteria (Ruminococcus flavefaciens) and anaerobic fungi (Orpinomyces sp.), suggests that complex enzymes are essential and have evolved within all major cellulolytic dominions inherent to the rumen.
Collapse
|
9
|
Abstract
MOTIVATION Gene assembly is an important step in functional analysis of shotgun metagenomic data. Nonetheless, strain aware assembly remains a challenging task, as current assembly tools often fail to distinguish among strain variants or require closely related reference genomes of the studied species to be available. RESULTS We have developed Snowball, a novel strain aware gene assembler for shotgun metagenomic data that does not require closely related reference genomes to be available. It uses profile hidden Markov models (HMMs) of gene domains of interest to guide the assembly. Our assembler performs gene assembly of individual gene domains based on read overlaps and error correction using read quality scores at the same time, which results in very low per-base error rates. AVAILABILITY AND IMPLEMENTATION The software runs on a user-defined number of processor cores in parallel, runs on a standard laptop and is available under the GPL 3.0 license for installation under Linux or OS X at https://github.com/hzi-bifo/snowball CONTACT AMC14@helmholtz-hzi.de,a.schoenhuth@cwi.nl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Abstract
Gene finding is the process of identifying genome sequence regions representing stretches of DNA that encode biologically active products, such as proteins or functional noncoding RNAs. As this is usually the first step in the analysis of any novel genomic sequence or resequenced sample of well-known organisms, it is a very important issue, as all downstream analyses depend on the results. This chapter describes the biological basis for gene finding, and the programs and computational approaches that are available for the automated identification of protein-coding genes. For bacterial, archaeal, and eukaryotic genomes, as well as for multi-species sequence data originating from environmental community studies, the state of the art in automated gene finding is described.
Collapse
|
11
|
Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data. Sci Rep 2016; 6:25373. [PMID: 27156482 PMCID: PMC4860591 DOI: 10.1038/srep25373] [Citation(s) in RCA: 109] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2015] [Accepted: 04/12/2016] [Indexed: 01/22/2023] Open
Abstract
DNA assembly is a core methodological step in metagenomic pipelines used to study the structure and function within microbial communities. Here we investigate the utility of Pacific Biosciences long and high accuracy circular consensus sequencing (CCS) reads for metagenomic projects. We compared the application and performance of both PacBio CCS and Illumina HiSeq data with assembly and taxonomic binning algorithms using metagenomic samples representing a complex microbial community. Eight SMRT cells produced approximately 94 Mb of CCS reads from a biogas reactor microbiome sample that averaged 1319 nt in length and 99.7% accuracy. CCS data assembly generated a comparative number of large contigs greater than 1 kb, to those assembled from a ~190x larger HiSeq dataset (~18 Gb) produced from the same sample (i.e approximately 62% of total contigs). Hybrid assemblies using PacBio CCS and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length and number of large contigs. The incorporation of CCS data produced significant enhancements in taxonomic binning and genome reconstruction of two dominant phylotypes, which assembled and binned poorly using HiSeq data alone. Collectively these results illustrate the value of PacBio CCS reads in certain metagenomics applications.
Collapse
|
12
|
Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 175] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
13
|
|
14
|
Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods. Bioinformatics 2015; 31:817-24. [PMID: 25388150 PMCID: PMC4380030 DOI: 10.1093/bioinformatics/btu745] [Citation(s) in RCA: 91] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 11/04/2014] [Accepted: 11/05/2014] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows identifying the sequenced community members and reconstructing taxonomic bins with sequence data for the individual taxa. For the massive datasets generated by next-generation sequencing technologies, this cannot be performed with de-novo phylogenetic inference methods. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignment by fast approximate determination of evolutionary neighbors from sequence similarities. RESULTS Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short as well as for long sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples because it identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with 10 CPU cores and microbial RefSeq as the genomic reference data.
Collapse
|
15
|
Inference of genotype-phenotype relationships in the antigenic evolution of human influenza A (H3N2) viruses. PLoS Comput Biol 2012; 8:e1002492. [PMID: 22532796 PMCID: PMC3330098 DOI: 10.1371/journal.pcbi.1002492] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2011] [Accepted: 03/09/2012] [Indexed: 01/05/2023] Open
Abstract
Distinguishing mutations that determine an organism's phenotype from (near-) neutral ‘hitchhikers’ is a fundamental challenge in genome research, and is relevant for numerous medical and biotechnological applications. For human influenza viruses, recognizing changes in the antigenic phenotype and a strains' capability to evade pre-existing host immunity is important for the production of efficient vaccines. We have developed a method for inferring ‘antigenic trees’ for the major viral surface protein hemagglutinin. In the antigenic tree, antigenic weights are assigned to all tree branches, which allows us to resolve the antigenic impact of the associated amino acid changes. Our technique predicted antigenic distances with comparable accuracy to antigenic cartography. Additionally, it identified both known and novel sites, and amino acid changes with antigenic impact in the evolution of influenza A (H3N2) viruses from 1968 to 2003. The technique can also be applied for inference of ‘phenotype trees’ and genotype–phenotype relationships from other types of pairwise phenotype distances. The molecular evolution of any organism is described by changes in the genotype resulting from genetic drift or selection to maintain or establish fitness under the given environmental conditions. Identification of phenotype-defining changes and their distinction from (near-) neutral (‘hitchhikers’) ones is a fundamental challenge in genome research. The standard approach involves time- and cost-intensive mutation experiments, which are typically low throughput, due to their experimental nature. We have developed a computational method for the inference of phenotypic impact of genotypic changes that is applicable to any system, within or across species, where homologous genetic sequences and associated pairwise phenotype distances are available. We demonstrate the accuracy of our method by application to the human influenza A (H3N2) virus. This exemplary system is of particular interest, as recognizing changes in the antigenic phenotype and a viral strains' capability to evade pre-existing host immunity is important for the production of efficient vaccines. We accurately identified known sites and amino acid changes with antigenic impact over 35 years of evolution, and provide further details on individual antigenically relevant changes in the evolution of influenza A (H3N2) viruses.
Collapse
|
16
|
The impact of seasonal and year-round transmission regimes on the evolution of influenza A virus. Proc Biol Sci 2010; 278:2249-56. [PMID: 21177678 DOI: 10.1098/rspb.2010.2191] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Punctuated antigenic change is believed to be a key element in the evolution of influenza A; clusters of antigenically similar strains predominate worldwide for several years until an antigenically distant mutant emerges and instigates a selective sweep. It is thought that a region of East-Southeast Asia with year-round transmission acts as a source of antigenic diversity for influenza A and seasonal epidemics in temperate regions make little contribution to antigenic evolution. We use a mathematical model to examine how different transmission regimes affect the evolutionary dynamics of influenza over the lifespan of an antigenic cluster. Our model indicates that, in non-seasonal regions, mutants that cause significant outbreaks appear before the peak of the wild-type epidemic. A relatively large proportion of these mutants spread globally. In seasonal regions, mutants that cause significant local outbreaks appear each year before the seasonal peak of the wild-type epidemic, but only a small proportion spread globally. The potential for global spread is strongly influenced by the intensity of non-seasonal circulation and coupling between non-seasonal and seasonal regions. Results are similar if mutations are neutral, or confer a weak to moderate antigenic advantage. However, there is a threshold antigenic advantage, depending on the non-seasonal transmission intensity, beyond which mutants can escape herd immunity in the non-seasonal region and there is a global explosion in diversity. We conclude that non-seasonal transmission regions are fundamental to the generation and maintenance of influenza diversity owing to their epidemiology. More extensive sampling of viral diversity in such regions could facilitate earlier identification of antigenically novel strains and extend the critical window for vaccine development.
Collapse
|
17
|
Abstract
Phylodynamic techniques combine epidemiological and genetic information to analyze the evolutionary and spatiotemporal dynamics of rapidly evolving pathogens, such as influenza A or human immunodeficiency viruses. We introduce ‘allele dynamics plots’ (AD plots) as a method for visualizing the evolutionary dynamics of a gene in a population. Using AD plots, we propose how to identify the alleles that are likely to be subject to directional selection. We analyze the method’s merits with a detailed study of the evolutionary dynamics of seasonal influenza A viruses. AD plots for the major surface protein of seasonal influenza A (H3N2) and the 2009 swine-origin influenza A (H1N1) viruses show the succession of substitutions that became fixed in the evolution of the two viral populations. They also allow the early identification of those viral strains that later rise to predominance, which is important for the problem of vaccine strain selection. In summary, we describe a technique that reveals the evolutionary dynamics of a rapidly evolving population and allows us to identify alleles and associated genetic changes that might be under directional selection. The method can be applied for the study of influenza A viruses and other rapidly evolving species or viruses.
Collapse
|
18
|
Abstract
Influenza A virus causes annual epidemics and occasional pandemics of short-term respiratory infections associated with considerable morbidity and mortality. The pandemics occur when new human-transmissible viruses that have the major surface protein of influenza A viruses from other host species are introduced into the human population. Between such rare events, the evolution of influenza is shaped by antigenic drift: the accumulation of mutations that result in changes in exposed regions of the viral surface proteins. Antigenic drift makes the virus less susceptible to immediate neutralization by the immune system in individuals who have had a previous influenza infection or vaccination. A biannual reevaluation of the vaccine composition is essential to maintain its effectiveness due to this immune escape. The study of influenza genomes is key to this endeavor, increasing our understanding of antigenic drift and enhancing the accuracy of vaccine strain selection. Recent large-scale genome sequencing and antigenic typing has considerably improved our understanding of influenza evolution: epidemics around the globe are seeded from a reservoir in East-Southeast Asia with year-round prevalence of influenza viruses; antigenically similar strains predominate in epidemics worldwide for several years before being replaced by a new antigenic cluster of strains. Future in-depth studies of the influenza reservoir, along with large-scale data mining of genomic resources and the integration of epidemiological, genomic, and antigenic data, should enhance our understanding of antigenic drift and improve the detection and control of antigenically novel emerging strains.
Collapse
|
19
|
Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 2007; 4:63-72. [PMID: 17179938 DOI: 10.1038/nmeth976] [Citation(s) in RCA: 381] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2006] [Accepted: 10/13/2006] [Indexed: 11/08/2022]
Abstract
Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.
Collapse
|
20
|
Abstract
Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.
Collapse
|