1
|
SPIRE: a Searchable, Planetary-scale mIcrobiome REsource. Nucleic Acids Res 2024; 52:D777-D783. [PMID: 37897342 PMCID: PMC10767986 DOI: 10.1093/nar/gkad943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 10/01/2023] [Accepted: 10/11/2023] [Indexed: 10/30/2023] Open
Abstract
Meta'omic data on microbial diversity and function accrue exponentially in public repositories, but derived information is often siloed according to data type, study or sampled microbial environment. Here we present SPIRE, a Searchable Planetary-scale mIcrobiome REsource that integrates various consistently processed metagenome-derived microbial data modalities across habitats, geography and phylogeny. SPIRE encompasses 99 146 metagenomic samples from 739 studies covering a wide array of microbial environments and augmented with manually-curated contextual data. Across a total metagenomic assembly of 16 Tbp, SPIRE comprises 35 billion predicted protein sequences and 1.16 million newly constructed metagenome-assembled genomes (MAGs) of medium or high quality. Beyond mapping to the high-quality genome reference provided by proGenomes3 (http://progenomes.embl.de), these novel MAGs form 92 134 novel species-level clusters, the majority of which are unclassified at species level using current tools. SPIRE enables taxonomic profiling of these species clusters via an updated, custom mOTUs database (https://motu-tool.org/) and includes several layers of functional annotation, as well as crosslinks to several (micro-)biological databases. The resource is accessible, searchable and browsable via http://spire.embl.de.
Collapse
|
2
|
Challenges and opportunities in sharing microbiome data and analyses. Nat Microbiol 2023; 8:1960-1970. [PMID: 37783751 DOI: 10.1038/s41564-023-01484-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 08/28/2023] [Indexed: 10/04/2023]
Abstract
Microbiome data, metadata and analytical workflows have become 'big' in terms of volume and complexity. Although the infrastructure and technologies to share data have been established, the interdisciplinary and multi-omic nature of the field can make resources difficult to identify and use. Following best practices for data deposition requires substantial effort, with sometimes little obvious reward. Gaps remain where microbiome-specific resources for data sharing or reproducibility do not yet exist. We outline available best practices, challenges to their adoption and opportunities in data sharing in microbiome research. We showcase examples of best practices and advocate for their enforcement and incentivization for data sharing. This includes recognition of data curation and sharing endeavours by individuals, institutions, journals and funders. Opportunities for progress include enabling microbiome-specific databases to incorporate future methods for data analysis, integration and reuse.
Collapse
|
3
|
VIRify: An integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models. PLoS Comput Biol 2023; 19:e1011422. [PMID: 37639475 PMCID: PMC10491390 DOI: 10.1371/journal.pcbi.1011422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 09/08/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
The study of viral communities has revealed the enormous diversity and impact these biological entities have on various ecosystems. These observations have sparked widespread interest in developing computational strategies that support the comprehensive characterisation of viral communities based on sequencing data. Here we introduce VIRify, a new computational pipeline designed to provide a user-friendly and accurate functional and taxonomic characterisation of viral communities. VIRify identifies viral contigs and prophages from metagenomic assemblies and annotates them using a collection of viral profile hidden Markov models (HMMs). These include our manually-curated profile HMMs, which serve as specific taxonomic markers for a wide range of prokaryotic and eukaryotic viral taxa and are thus used to reliably classify viral contigs. We tested VIRify on assemblies from two microbial mock communities, a large metagenomics study, and a collection of publicly available viral genomic sequences from the human gut. The results showed that VIRify could identify sequences from both prokaryotic and eukaryotic viruses, and provided taxonomic classifications from the genus to the family rank with an average accuracy of 86.6%. In addition, VIRify allowed the detection and taxonomic classification of a range of prokaryotic and eukaryotic viruses present in 243 marine metagenomic assemblies. Finally, the use of VIRify led to a large expansion in the number of taxonomically classified human gut viral sequences and the improvement of outdated and shallow taxonomic classifications. Overall, we demonstrate that VIRify is a novel and powerful resource that offers an enhanced capability to detect a broad range of viral contigs and taxonomically classify them.
Collapse
|
4
|
MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol 2023; 435:168016. [PMID: 36806692 PMCID: PMC10318097 DOI: 10.1016/j.jmb.2023.168016] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 02/07/2023] [Accepted: 02/12/2023] [Indexed: 02/18/2023]
Abstract
An increasingly common output arising from the analysis of shotgun metagenomic datasets is the generation of metagenome-assembled genomes (MAGs), with tens of thousands of MAGs now described in the literature. However, the discovery and comparison of these MAG collections is hampered by the lack of uniformity in their generation, annotation and storage. To address this, we have developed MGnify Genomes, a growing collection of biome-specific non-redundant microbial genome catalogues generated using MAGs and publicly available isolate genomes. Genomes within a biome-specific catalogue are organised into species clusters. For species that contain multiple conspecific genomes, the highest quality genome is selected as the representative, always prioritising an isolate genome over a MAG. The species representative sequences and annotations can be visualised on the MGnify website and the full catalogue and associated analysis outputs can be downloaded from MGnify servers. A suite of online search tools is provided allowing users to compare their own sequences, ranging from a gene to sets of genomes, against the catalogues. Seven biomes are available currently, comprising over 300,000 genomes that represent 11,048 non-redundant species, and include 36 taxonomic classes not currently represented by cultured genomes. MGnify Genomes is available at https://www.ebi.ac.uk/metagenomics/browse/genomes/.
Collapse
|
5
|
Biochemical characterisation of Cytochrome P450 oxidoreductase from the cattle tick, Rhipicephalus microplus, highlights potential new acaricide target. Ticks Tick Borne Dis 2023; 14:102148. [PMID: 36905815 DOI: 10.1016/j.ttbdis.2023.102148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 02/23/2023] [Accepted: 02/23/2023] [Indexed: 03/12/2023]
Abstract
Management of the cattle tick, Rhipicephalus microplus, presents a challenge because some populations of this cosmopolitan and economically important ectoparasite are resistant to multiple classes of acaricides. Cytochrome P450 oxidoreductase (CPR) is part of the cytochrome P450 (CYP450) monooxygenases that are involved in metabolic resistance by their ability to detoxify acaricides. Inhibiting CPR, the sole redox partner that transfers electrons to CYP450s, could overcome this type of metabolic resistance. This report represents the biochemical characterisation of a CPR from ticks. Recombinant CPR of R. microplus (RmCPR), minus its N-terminal transmembrane domain, was produced in a bacterial expression system and subjected to biochemical analyses. RmCPR displayed a characteristic dual flavin oxidoreductase spectrum. Incubation with nicotinamide adenine dinucleotide phosphate (NADPH) lead to an increase in absorbance between 500 and 600 nm with a corresponding appearance of a peak absorbance at 340-350 nm indicating functional transfer of electrons between NADPH and the bound flavin cofactors. Using the pseudoredox partner, kinetic parameters for both cytochrome c and NADPH binding were calculated as 26.6 ± 11.4 µM and 7.03 ± 1.8 µM, respectively. The turnover, Kcat, for RmCPR for cytochrome c was calculated as 0.08 s-1 which is significantly lower than the CPR homologues of other species. IC50 (Half maximal Inhibitory Concentration) values obtained for the adenosine analogues 2', 5' ADP, 2'- AMP, NADP+and the reductase inhibitor diphenyliodonium were: 140, 82.2, 24.5, and 75.3 µM, respectively. Biochemically, RmCPR resembles CPRs of hematophagous arthropods more so than mammalian CPRs. These findings highlight the potential of RmCPR as a target for the rational design of safer and potent acaricides against R. microplus.
Collapse
|
6
|
Staphylococcal diversity in atopic dermatitis from an individual to a global scale. Cell Host Microbe 2023; 31:578-592.e6. [PMID: 37054678 PMCID: PMC10151067 DOI: 10.1016/j.chom.2023.03.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 12/08/2022] [Accepted: 03/10/2023] [Indexed: 04/15/2023]
Abstract
Atopic dermatitis (AD) is a multifactorial, chronic relapsing disease associated with genetic and environmental factors. Among skin microbes, Staphylococcus aureus and Staphylococcus epidermidis are associated with AD, but how genetic variability and staphylococcal strains shape the disease remains unclear. We investigated the skin microbiome of an AD cohort (n = 54) as part of a prospective natural history study using shotgun metagenomic and whole genome sequencing, which we analyzed alongside publicly available data (n = 473). AD status and global geographical regions exhibited associations with strains and genomic loci of S. aureus and S. epidermidis. In addition, antibiotic prescribing patterns and within-household transmission between siblings shaped colonizing strains. Comparative genomics determined that S. aureus AD strains were enriched in virulence factors, whereas S. epidermidis AD strains varied in genes involved in interspecies interactions and metabolism. In both species, staphylococcal interspecies genetic transfer shaped gene content. These findings reflect the staphylococcal genomic diversity and dynamics associated with AD.
Collapse
|
7
|
Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. FRONTIERS IN BIOINFORMATICS 2023; 3:1157956. [PMID: 36959975 PMCID: PMC10029925 DOI: 10.3389/fbinf.2023.1157956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
Collapse
|
8
|
metaGOflow: a workflow for the analysis of marine Genomic Observatories shotgun metagenomics data. Gigascience 2022; 12:giad078. [PMID: 37850871 PMCID: PMC10583283 DOI: 10.1093/gigascience/giad078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 10/19/2023] Open
Abstract
BACKGROUND Genomic Observatories (GOs) are sites of long-term scientific study that undertake regular assessments of the genomic biodiversity. The European Marine Omics Biodiversity Observation Network (EMO BON) is a network of GOs that conduct regular biological community samplings to generate environmental and metagenomic data of microbial communities from designated marine stations around Europe. The development of an effective workflow is essential for the analysis of the EMO BON metagenomic data in a timely and reproducible manner. FINDINGS Based on the established MGnify resource, we developed metaGOflow. metaGOflow supports the fast inference of taxonomic profiles from GO-derived data based on ribosomal RNA genes and their functional annotation using the raw reads. Thanks to the Research Object Crate packaging, relevant metadata about the sample under study, and the details of the bioinformatics analysis it has been subjected to, are inherited to the data product while its modular implementation allows running the workflow partially. The analysis of 2 EMO BON samples and 1 Tara Oceans sample was performed as a use case. CONCLUSIONS metaGOflow is an efficient and robust workflow that scales to the needs of projects producing big metagenomic data such as EMO BON. It highlights how containerization technologies along with modern workflow languages and metadata package approaches can support the needs of researchers when dealing with ever-increasing volumes of biological data. Despite being initially oriented to address the needs of EMO BON, metaGOflow is a flexible and easy-to-use workflow that can be broadly used for one-sample-at-a-time analysis of shotgun metagenomics data.
Collapse
|
9
|
Screening of global microbiomes implies ecological boundaries impacting the distribution and dissemination of clinically relevant antimicrobial resistance genes. Commun Biol 2022; 5:1217. [DOI: 10.1038/s42003-022-04187-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 10/28/2022] [Indexed: 11/19/2022] Open
Abstract
AbstractUnderstanding the myriad pathways by which antimicrobial-resistance genes (ARGs) spread across biomes is necessary to counteract the global menace of antimicrobial resistance. We screened 17939 assembled metagenomic samples covering 21 biomes, differing in sequencing quality and depth, unevenly across 46 countries, 6 continents, and 14 years (2005-2019) for clinically crucial ARGs, mobile colistin resistance (mcr), carbapenem resistance (CR), and (extended-spectrum) beta-lactamase (ESBL and BL) genes. These ARGs were most frequent in human gut, oral and skin biomes, followed by anthropogenic (wastewater, bioreactor, compost, food), and natural biomes (freshwater, marine, sediment). Mcr-9 was the most prevalent mcr gene, spatially and temporally; blaOXA-233 and blaTEM-1 were the most prevalent CR and BL/ESBL genes, but blaGES-2 and blaTEM-116 showed the widest distribution. Redundancy analysis and Bayesian analysis showed ARG distribution was non-random and best-explained by potential host genera and biomes, followed by collection year, anthropogenic factors and collection countries. Preferential ARG occurrence, and potential transmission, between characteristically similar biomes indicate strong ecological boundaries. Our results provide a high-resolution global map of ARG distribution and importantly, identify checkpoint biomes wherein interventions aimed at disrupting ARGs dissemination are likely to be most effective in reducing dissemination and in the long term, the ARG global burden.
Collapse
|
10
|
A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications. Gigascience 2022; 11:6661050. [PMID: 35950838 PMCID: PMC9366992 DOI: 10.1093/gigascience/giac077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 06/13/2022] [Accepted: 07/12/2022] [Indexed: 11/17/2022] Open
Abstract
Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.
Collapse
|
11
|
Unifying the known and unknown microbial coding sequence space. eLife 2022; 11:67667. [PMID: 35356891 PMCID: PMC9132574 DOI: 10.7554/elife.67667] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/30/2022] [Indexed: 12/02/2022] Open
Abstract
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data. It is estimated that scientists do not know what half of microbial genes actually do. When these genes are discovered in microorganisms grown in the lab or found in environmental samples, it is not possible to identify what their roles are. Many of these genes are excluded from further analyses for these reasons, meaning that the study of microbial genes tends to be limited to genes that have already been described. These limitations hinder research into microbiology, because information from newly discovered genes cannot be integrated to better understand how these organisms work. Experiments to understand what role these genes have in the microorganisms are labor-intensive, so new analytical strategies are needed. To do this, Vanni et al. developed a new framework to categorize genes with unknown roles, and a computational workflow to integrate them into traditional analyses. When this approach was applied to over 400 million microbial genes (both with known and unknown roles), it showed that the share of genes with unknown functions is only about 30 per cent, smaller than previously thought. The analysis also showed that these genes are very diverse, revealing a huge space for future research and potential applications. Combining their approach with experimental data, Vanni et al. were able to identify a gene with a previously unknown purpose that could be involved in antibiotic resistance. This system could be useful for other scientists studying microorganisms to get a more complete view of microbial systems. In future, it may also be used to analyze the genetics of other organisms, such as plants and animals.
Collapse
|
12
|
A mouse model of occult intestinal colonization demonstrating antibiotic-induced outgrowth of carbapenem-resistant Enterobacteriaceae. MICROBIOME 2022; 10:43. [PMID: 35272717 PMCID: PMC8908617 DOI: 10.1186/s40168-021-01207-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Accepted: 12/06/2021] [Indexed: 05/29/2023]
Abstract
BACKGROUND The human intestinal microbiome is a complex community that contributes to host health and disease. In addition to normal microbiota, pathogens like carbapenem-resistant Enterobacteriaceae may be asymptomatically present. When these bacteria are present at very low levels, they are often undetectable in hospital surveillance cultures, known as occult or subclinical colonization. Through the receipt of antibiotics, these subclinical pathogens can increase to sufficiently high levels to become detectable, in a process called outgrowth. However, little is known about the interaction between gut microbiota and Enterobacteriaceae during occult colonization and outgrowth. RESULTS We developed a clinically relevant mouse model for studying occult colonization. Conventional wild-type mice without antibiotic pre-treatment were exposed to Klebsiella pneumoniae but rapidly tested negative for colonization. This occult colonization was found to perturb the microbiome as detected by both 16S rRNA amplicon and shotgun metagenomic sequencing. Outgrowth of occult K. pneumoniae was induced either by a four-antibiotic cocktail or by individual receipt of ampicillin, vancomycin, or azithromycin, which all reduced overall microbial diversity. Notably, vancomycin was shown to trigger K. pneumoniae outgrowth in only a subset of exposed animals (outgrowth-susceptible). To identify factors that underlie outgrowth susceptibility, we analyzed microbiome-encoded gene functions and were able to classify outgrowth-susceptible microbiomes using pathways associated with mRNA stability. Lastly, an evolutionary approach illuminated the importance of xylose metabolism in K. pneumoniae colonization, supporting xylose abundance as a second susceptibility indicator. We showed that our model is generalizable to other pathogens, including carbapenem-resistant Escherichia coli and Enterobacter cloacae. CONCLUSIONS Our modeling of occult colonization and outgrowth could help the development of strategies to mitigate the risk of subsequent infection and transmission in medical facilities and the wider community. This study suggests that microbiota mRNA and small-molecule metabolites may be used to predict outgrowth-susceptibility. Video Abstract.
Collapse
|
13
|
Integrating cultivation and metagenomics for a multi-kingdom view of skin microbiome diversity and functions. Nat Microbiol 2022; 7:169-179. [PMID: 34952941 PMCID: PMC8732310 DOI: 10.1038/s41564-021-01011-w] [Citation(s) in RCA: 49] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 10/28/2021] [Indexed: 12/23/2022]
Abstract
Human skin functions as a physical barrier to foreign pathogen invasion and houses numerous commensals. Shifts in the human skin microbiome have been associated with conditions ranging from acne to atopic dermatitis. Previous metagenomic investigations into the role of the skin microbiome in health or disease have found that much of the sequenced data do not match reference genomes, making it difficult to interpret metagenomic datasets. We combined bacterial cultivation and metagenomic sequencing to assemble the Skin Microbial Genome Collection (SMGC), which comprises 622 prokaryotic species derived from 7,535 metagenome-assembled genomes and 251 isolate genomes. The metagenomic datasets that we generated were combined with publicly available skin metagenomic datasets to identify members and functions of the human skin microbiome. The SMGC collection includes 174 newly identified bacterial species and 12 newly identified bacterial genera, including the abundant genus 'Candidatus Pellibacterium', which has been newly associated with the skin. The SMGC increases the characterized set of known skin bacteria by 26%. We validated the SMGC metagenome-assembled genomes by comparing them with sequenced isolates obtained from the same samples. We also recovered 12 eukaryotic species and assembled thousands of viral sequences, including newly identified clades of jumbo phages. The SMGC enables classification of a median of 85% of skin metagenomic sequences and provides a comprehensive view of skin microbiome diversity, derived primarily from samples obtained in North America.
Collapse
|
14
|
Abstract
The human gut microbiome plays an important role in health, but its archaeal diversity remains largely unexplored. In the present study, we report the analysis of 1,167 nonredundant archaeal genomes (608 high-quality genomes) recovered from human gastrointestinal tract, sampled across 24 countries and rural and urban populations. We identified previously undescribed taxa including 3 genera, 15 species and 52 strains. Based on distinct genomic features, we justify the split of the Methanobrevibacter smithii clade into two separate species, with one represented by the previously undescribed 'Candidatus Methanobrevibacter intestini'. Patterns derived from 28,581 protein clusters showed significant associations with sociodemographic characteristics such as age groups and lifestyle. We additionally show that archaea are characterized by specific genomic and functional adaptations to the host and carry a complex virome. Our work expands our current understanding of the human archaeome and provides a large genome catalogue for future analyses to decipher its impact on human physiology.
Collapse
|
15
|
Metagenomics approach for Polymyxa betae genome assembly enables comparative analysis towards deciphering the intracellular parasitic lifestyle of the plasmodiophorids. Genomics 2021; 114:9-22. [PMID: 34798282 DOI: 10.1016/j.ygeno.2021.11.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/24/2021] [Accepted: 11/10/2021] [Indexed: 12/28/2022]
Abstract
Genomic knowledge of the tree of life is biased to specific groups of organisms. For example, only six full genomes are currently available in the rhizaria clade. Here, we have applied metagenomic techniques enabling the assembly of the genome of Polymyxa betae (Rhizaria, Plasmodiophorida) RES F41 isolate from unpurified zoospore holobiont and comparison with the A26-41 isolate. Furthermore, the first P. betae mitochondrial genome was assembled. The two P. betae nuclear genomes were highly similar, each with just ~10.2 k predicted protein coding genes, ~3% of which were unique to each isolate. Extending genomic comparisons revealed a greater overlap with Spongospora subterranea than with Plasmodiophora brassicae, including orthologs of the mammalian cation channel sperm-associated proteins, raising some intriguing questions about zoospore physiology. This work validates our metagenomics pipeline for eukaryote genome assembly from unpurified samples and enriches plasmodiophorid genomics; providing the first full annotation of the P. betae genome.
Collapse
|
16
|
Ensembl Genomes 2022: an expanding genome resource for non-vertebrates. Nucleic Acids Res 2021; 50:D996-D1003. [PMID: 34791415 PMCID: PMC8728113 DOI: 10.1093/nar/gkab1007] [Citation(s) in RCA: 94] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 10/07/2021] [Accepted: 11/10/2021] [Indexed: 11/28/2022] Open
Abstract
Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project (https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.
Collapse
|
17
|
Abstract
Non-coding RNAs are essential for all life and carry out a wide range of functions. Information about these molecules is distributed across dozens of specialized resources. RNAcentral is a database of non-coding RNA sequences that provides a unified access point to non-coding RNA annotations from >40 member databases and helps provide insight into the function of these RNAs. This article describes different ways of accessing the data, including searching the website and retrieving the data programmatically over web APIs and a public database. We also demonstrate an example Galaxy workflow for using RNAcentral for RNA-seq differential expression analysis. RNAcentral is available at https://rnacentral.org. © 2020 The Authors. Basic Protocol 1: Viewing RNAcentral sequence reports Basic Protocol 2: Using RNAcentral text search to explore ncRNA sequences Basic Protocol 3: Using RNAcentral sequence search Basic Protocol 4: Using RNAcentral FTP archive Support Protocol 1: Using web APIs for programmatic data access Support Protocol 2: Using public Postgres database to export large datasets Support Protocol 3: Analyze non-coding RNA in RNA-seq datasets using RNAcentral and Galaxy.
Collapse
|
18
|
Erratum to: Predicted Input of Uncultured Fungal Symbionts to a Lichen Symbiosis from Metagenome-Assembled Genomes. Genome Biol Evol 2021; 13:6312529. [PMID: 34196707 PMCID: PMC8247553 DOI: 10.1093/gbe/evab129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
19
|
R2DT is a framework for predicting and visualising RNA secondary structure using templates. Nat Commun 2021; 12:3494. [PMID: 34108470 PMCID: PMC8190129 DOI: 10.1038/s41467-021-23555-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 05/04/2021] [Indexed: 02/05/2023] Open
Abstract
Non-coding RNAs (ncRNA) are essential for all life, and their functions often depend on their secondary (2D) and tertiary structure. Despite the abundance of software for the visualisation of ncRNAs, few automatically generate consistent and recognisable 2D layouts, which makes it challenging for users to construct, compare and analyse structures. Here, we present R2DT, a method for predicting and visualising a wide range of RNA structures in standardised layouts. R2DT is based on a library of 3,647 templates representing the majority of known structured RNAs. R2DT has been applied to ncRNA sequences from the RNAcentral database and produced >13 million diagrams, creating the world's largest RNA 2D structure dataset. The software is amenable to community expansion, and is freely available at https://github.com/rnacentral/R2DT and a web server is found at https://rnacentral.org/r2dt .
Collapse
|
20
|
Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021; 22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 81] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open
Abstract
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Collapse
|
21
|
Predicted Input of Uncultured Fungal Symbionts to a Lichen Symbiosis from Metagenome-Assembled Genomes. Genome Biol Evol 2021; 13:6163286. [PMID: 33693712 PMCID: PMC8355462 DOI: 10.1093/gbe/evab047] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/03/2021] [Indexed: 12/15/2022] Open
Abstract
Basidiomycete yeasts have recently been reported as stably associated secondary
fungal symbionts of many lichens, but their role in the symbiosis remains
unknown. Attempts to sequence their genomes have been hampered both by the
inability to culture them and their low abundance in the lichen thallus
alongside two dominant eukaryotes (an ascomycete fungus and chlorophyte alga).
Using the lichen Alectoria sarmentosa, we selectively dissolved
the cortex layer in which secondary fungal symbionts are embedded to enrich
yeast cell abundance and sequenced DNA from the resulting slurries as well as
bulk lichen thallus. In addition to yielding a near-complete genome of the
filamentous ascomycete using both methods, metagenomes from cortex slurries
yielded a 36- to 84-fold increase in coverage and near-complete genomes for two
basidiomycete species, members of the classes Cystobasidiomycetes and
Tremellomycetes. The ascomycete possesses the largest gene repertoire of the
three. It is enriched in proteases often associated with pathogenicity and
harbors the majority of predicted secondary metabolite clusters. The
basidiomycete genomes possess ∼35% fewer predicted genes than the
ascomycete and have reduced secretomes even compared with close relatives, while
exhibiting signs of nutrient limitation and scavenging. Furthermore, both
basidiomycetes are enriched in genes coding for enzymes producing secreted
acidic polysaccharides, representing a potential contribution to the shared
extracellular matrix. All three fungi retain genes involved in dimorphic
switching, despite the ascomycete not being known to possess a yeast stage. The
basidiomycete genomes are an important new resource for exploration of lifestyle
and function in fungal–fungal interactions in lichen symbioses.
Collapse
|
22
|
Massive expansion of human gut bacteriophage diversity. Cell 2021; 184:1098-1109.e9. [PMID: 33606979 PMCID: PMC7895897 DOI: 10.1016/j.cell.2021.01.029] [Citation(s) in RCA: 241] [Impact Index Per Article: 80.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 11/02/2020] [Accepted: 01/19/2021] [Indexed: 12/25/2022]
Abstract
Bacteriophages drive evolutionary change in bacterial communities by creating gene flow networks that fuel ecological adaptions. However, the extent of viral diversity and its prevalence in the human gut remains largely unknown. Here, we introduce the Gut Phage Database, a collection of ∼142,000 non-redundant viral genomes (>10 kb) obtained by mining a dataset of 28,060 globally distributed human gut metagenomes and 2,898 reference genomes of cultured gut bacteria. Host assignment revealed that viral diversity is highest in the Firmicutes phyla and that ∼36% of viral clusters (VCs) are not restricted to a single species, creating gene flow networks across phylogenetically distinct bacterial species. Epidemiological analysis uncovered 280 globally distributed VCs found in at least 5 continents and a highly prevalent phage clade with features reminiscent of p-crAssphage. This high-quality, large-scale catalog of phage genomes will improve future virome studies and enable ecological and evolutionary analysis of human gut bacteriophages.
Collapse
|
23
|
The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 2021; 49:D344-D354. [PMID: 33156333 PMCID: PMC7778928 DOI: 10.1093/nar/gkaa977] [Citation(s) in RCA: 1048] [Impact Index Per Article: 349.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/08/2020] [Accepted: 10/23/2020] [Indexed: 01/22/2023] Open
Abstract
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Collapse
|
24
|
Pfam: The protein families database in 2021. Nucleic Acids Res 2021; 49:D412-D419. [PMID: 33125078 PMCID: PMC7779014 DOI: 10.1093/nar/gkaa913] [Citation(s) in RCA: 2352] [Impact Index Per Article: 784.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 10/01/2020] [Accepted: 10/06/2020] [Indexed: 12/19/2022] Open
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
|
25
|
Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021; 49:D192-D200. [PMID: 33211869 PMCID: PMC7779021 DOI: 10.1093/nar/gkaa1047] [Citation(s) in RCA: 368] [Impact Index Per Article: 122.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/14/2020] [Accepted: 10/21/2020] [Indexed: 12/15/2022] Open
Abstract
Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.
Collapse
|
26
|
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
|
27
|
RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res 2021; 49:D212-D220. [PMID: 33106848 PMCID: PMC7779037 DOI: 10.1093/nar/gkaa921] [Citation(s) in RCA: 124] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 10/05/2020] [Indexed: 12/16/2022] Open
Abstract
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that provides a single access point to 44 RNA resources and >18 million ncRNA sequences from a wide range of organisms and RNA types. RNAcentral now also includes secondary (2D) structure information for >13 million sequences, making RNAcentral the world's largest RNA 2D structure database. The 2D diagrams are displayed using R2DT, a new 2D structure visualization method that uses consistent, reproducible and recognizable layouts for related RNAs. The sequence similarity search has been updated with a faster interface featuring facets for filtering search results by RNA type, organism, source database or any keyword. This sequence search tool is available as a reusable web component, and has been integrated into several RNAcentral member databases, including Rfam, miRBase and snoDB. To allow for a more fine-grained assignment of RNA types and subtypes, all RNAcentral sequences have been annotated with Sequence Ontology terms. The RNAcentral database continues to grow and provide a central data resource for the RNA community. RNAcentral is freely available at https://rnacentral.org.
Collapse
|
28
|
Dermanyssus gallinae and chicken egg production: impact, management, and a predicted compatibility matrix for integrated approaches. EXPERIMENTAL & APPLIED ACAROLOGY 2020; 82:441-453. [PMID: 33205360 DOI: 10.1007/s10493-020-00558-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 10/06/2020] [Indexed: 06/11/2023]
Abstract
The poultry red mite, Dermanyssus gallinae, is a worldwide threat to egg production and animal and human welfare. This mite is also a potential vector for several significant diseases. EU regulation that forbids the use of conventional cages for egg-laying hens may favour the growth of D. gallinae, a species known to thrive in more complex housing systems. Current control measures emphasize the use of chemical acaricides, which may have limited efficacy on D. gallinae considering its temporary blood-feeding behaviour. In integrated pest management (IPM), two or more compatible measures targeting physical, environmental, and/or biological aspects could be judiciously combined to enhance the effectiveness against D. gallinae infestation. To inform current and future IPM for D. gallinae, a compatibility matrix is proposed to guide the selection of control measures for field application.
Collapse
|
29
|
Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC. Genome Biol 2020; 21:244. [PMID: 32912302 PMCID: PMC7488429 DOI: 10.1186/s13059-020-02155-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 08/24/2020] [Indexed: 12/23/2022] Open
Abstract
Microbial eukaryotes constitute a significant fraction of biodiversity and have recently gained more attention, but the recovery of high-quality metagenomic assembled eukaryotic genomes is limited by the current availability of tools. To help address this, we have developed EukCC, a tool for estimating the quality of eukaryotic genomes based on the automated dynamic selection of single copy marker gene sets. We demonstrate that our method outperforms current genome quality estimators, particularly for estimating contamination, and have applied EukCC to datasets derived from two different environments to enable the identification of novel eukaryote genomes, including one from the human skin.
Collapse
|
30
|
Phylogenomics of expanding uncultured environmental Tenericutes provides insights into their pathogenicity and evolutionary relationship with Bacilli. BMC Genomics 2020; 21:408. [PMID: 32552739 PMCID: PMC7301438 DOI: 10.1186/s12864-020-06807-4] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Accepted: 06/05/2020] [Indexed: 12/28/2022] Open
Abstract
Background The metabolic capacity, stress response and evolution of uncultured environmental Tenericutes have remained elusive, since previous studies have been largely focused on pathogenic species. In this study, we expanded analyses on Tenericutes lineages that inhabit various environments using a collection of 840 genomes. Results Several environmental lineages were discovered inhabiting the human gut, ground water, bioreactors and hypersaline lake and spanning the Haloplasmatales and Mycoplasmatales orders. A phylogenomics analysis of Bacilli and Tenericutes genomes revealed that some uncultured Tenericutes are affiliated with novel clades in Bacilli, such as RF39, RFN20 and ML615. Erysipelotrichales and two major gut lineages, RF39 and RFN20, were found to be neighboring clades of Mycoplasmatales. We detected habitat-specific functional patterns between the pathogenic, gut and the environmental Tenericutes, where genes involved in carbohydrate storage, carbon fixation, mutation repair, environmental response and amino acid cleavage are overrepresented in the genomes of environmental lineages, perhaps as a result of environmental adaptation. We hypothesize that the two major gut lineages, namely RF39 and RFN20, are probably acetate and hydrogen producers. Furthermore, deteriorating capacity of bactoprenol synthesis for cell wall peptidoglycan precursors secretion is a potential adaptive strategy employed by these lineages in response to the gut environment. Conclusions This study uncovers the characteristic functions of environmental Tenericutes and their relationships with Bacilli, which sheds new light onto the pathogenicity and evolutionary processes of Mycoplasmatales.
Collapse
|
31
|
Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation. Nucleic Acids Res 2020; 48:D314-D319. [PMID: 31733063 PMCID: PMC7139969 DOI: 10.1093/nar/gkz967] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 10/09/2019] [Accepted: 11/07/2019] [Indexed: 12/20/2022] Open
Abstract
Genome3D (https://www.genome3d.eu) is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. Since the last NAR update in 2015, the method of data submission has been overhauled, with annotations now being 'pushed' to the database via an API. As a result, contributing groups are now able to manage their own structural annotations, making the resource more flexible and maintainable. The new submission protocol brings a number of additional benefits including: providing instant validation of data and avoiding the requirement to synchronise releases between resources. It also makes it possible to implement the submission of these structural annotations as an automated part of existing internal workflows. In turn, these improvements facilitate Genome3D being opened up to new prediction algorithms and groups. For the latest release of Genome3D (v2.1), the underlying dataset of sequences used as prediction targets has been updated using the latest reference proteomes available in UniProtKB. A number of new reference proteomes have also been added of particular interest to the wider scientific community: cow, pig, wheat and mycobacterium tuberculosis. These additions, along with improvements to the underlying predictions from contributing resources, has ensured that the number of annotations in Genome3D has nearly doubled since the last NAR update article. The new API has also been used to facilitate the dissemination of Genome3D data into InterPro, thereby widening the visibility of both the annotation data and annotation algorithms.
Collapse
|
32
|
The Pfam protein families database in 2019. Nucleic Acids Res 2020; 47:D427-D432. [PMID: 30357350 PMCID: PMC6324024 DOI: 10.1093/nar/gky995] [Citation(s) in RCA: 2821] [Impact Index Per Article: 705.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Accepted: 10/09/2018] [Indexed: 12/11/2022] Open
Abstract
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Collapse
|
33
|
MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res 2020; 48:D570-D578. [PMID: 31696235 PMCID: PMC7145632 DOI: 10.1093/nar/gkz1035] [Citation(s) in RCA: 170] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 10/23/2019] [Indexed: 12/16/2022] Open
Abstract
MGnify (http://www.ebi.ac.uk/metagenomics) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. Over the past 2 years, MGnify (formerly EBI Metagenomics) has more than doubled the number of publicly available analysed datasets held within the resource. Recently, an updated approach to data analysis has been unveiled (version 5.0), replacing the previous single pipeline with multiple analysis pipelines that are tailored according to the input data, and that are formally described using the Common Workflow Language, enabling greater provenance, reusability, and reproducibility. MGnify's new analysis pipelines offer additional approaches for taxonomic assertions based on ribosomal internal transcribed spacer regions (ITS1/2) and expanded protein functional annotations. Biochemical pathways and systems predictions have also been added for assembled contigs. MGnify's growing focus on the assembly of metagenomic data has also seen the number of datasets it has assembled and analysed increase six-fold. The non-redundant protein database constructed from the proteins encoded by these assemblies now exceeds 1 billion sequences. Meanwhile, a newly developed contig viewer provides fine-grained visualisation of the assembled contigs and their enriched annotations.
Collapse
|
34
|
Microbial composition of Kombucha determined using amplicon sequencing and shotgun metagenomics. J Food Sci 2020; 85:455-464. [PMID: 31957879 PMCID: PMC7027524 DOI: 10.1111/1750-3841.14992] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Revised: 11/12/2019] [Accepted: 11/13/2019] [Indexed: 01/26/2023]
Abstract
Kombucha, a fermented tea generated from the co-culture of yeasts and bacteria, has gained worldwide popularity in recent years due to its potential benefits to human health. As a result, many studies have attempted to characterize both its biochemical properties and microbial composition. Here, we have applied a combination of whole metagenome sequencing (WMS) and amplicon (16S rRNA and Internal Transcribed Spacer 1 [ITS1]) sequencing to investigate the microbial communities of homemade Kombucha fermentations from day 3 to day 15. We identified the dominant bacterial genus as Komagataeibacter and dominant fungal genus as Zygosaccharomyces in all samples at all time points. Furthermore, we recovered three near complete Komagataeibacter genomes and one Zygosaccharomyces bailii genome and then predicted their functional properties. Also, we determined the broad taxonomic and functional profile of plasmids found within the Kombucha microbial communities. Overall, this study provides a detailed description of the taxonomic and functional systems of the Kombucha microbial community. Based on this, we conject that the functional complementarity enables metabolic cross talks between Komagataeibacter species and Z. bailii, which helps establish the sustained a relatively low diversity ecosystem in Kombucha.
Collapse
|
35
|
A vaccinology Approach to the Identification and Characterization of Dermanyssus Gallinae Candidate Protective Antigens for the Control of Poultry Red Mite Infestations. Vaccines (Basel) 2019; 7:vaccines7040190. [PMID: 31756972 PMCID: PMC6963798 DOI: 10.3390/vaccines7040190] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Revised: 11/07/2019] [Accepted: 11/15/2019] [Indexed: 11/16/2022] Open
Abstract
The poultry red mite (PRM), Dermanyssus gallinae, is a hematophagous ectoparasite considered as the major pest in the egg-laying industry. Its pesticide-based control is only partially successful and requires the development of new control interventions such as vaccines. In this study, we follow a vaccinology approach to identify PRM candidate protective antigens. Based on proteomic data from fed and unfed nymph and adult mites, we selected a novel PRM protein, calumenin (Deg-CALU), which is tested as a vaccine candidate on an on-hen trial. Rhipicephalus microplus Subolesin (Rhm-SUB) was chosen as a positive control. Deg-CALU and Rhm-SUB reduced the mite oviposition by 35 and 44%, respectively. These results support Deg-CALU and Rhm-SUB as candidate protective antigens for the PRM control.
Collapse
|
36
|
TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations. Bioinformatics 2019; 35:518-520. [PMID: 30032202 PMCID: PMC6361231 DOI: 10.1093/bioinformatics/bty625] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 07/18/2018] [Indexed: 11/13/2022] Open
Abstract
Summary TreeGrafter is a new software tool for annotating protein sequences using pre-annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology (GO) terms, and PANTHER family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequence, finds the best matching homologous family in a library of pre-calculated, pre-annotated gene trees, and then grafts it to the best location in the tree. It then annotates the sequence by propagating annotations from ancestral nodes in the reference tree. We show that TreeGrafter outperforms subfamily HMM scoring for correctly assigning subfamily membership, and that it produces highly specific annotations of GO terms based on annotated reference phylogenetic trees. This method will be further integrated into InterProScan, enabling an even broader user community. Availability and implementation TreeGrafter is freely available on the web at https://github.com/pantherdb/TreeGrafter, including as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
37
|
3DPatch: fast 3D structure visualization with residue conservation. Bioinformatics 2019; 35:332-334. [PMID: 29931189 PMCID: PMC6330005 DOI: 10.1093/bioinformatics/bty464] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 06/08/2018] [Indexed: 11/14/2022] Open
Abstract
Summary Amino acid residues showing above background levels of conservation are often indicative of functionally significant regions within a protein. Understanding how the sequence conservation profile relates in space requires projection onto a protein structure, a potentially time-consuming process. 3DPatch is a web application that streamlines this task by automatically generating multiple sequence alignments (where appropriate) and finding structural homologs, presenting the user with a choice of structures matching their query, annotated with residue conservation scores in a matter of seconds. Availability and implementation 3DPatch is written in JavaScript and is freely available at http://www.skylign.org/3DPatch/. Mozilla Firefox, Google Chrome, and Safari web browsers are supported. Source code is available under MIT license at https://github.com/davidjakubec/3DPatch. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
38
|
EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic Acids Res 2019; 46:D726-D735. [PMID: 29069476 PMCID: PMC5753268 DOI: 10.1093/nar/gkx967] [Citation(s) in RCA: 121] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Accepted: 10/12/2017] [Indexed: 01/16/2023] Open
Abstract
EBI metagenomics (http://www.ebi.ac.uk/metagenomics) provides a free to use platform for the analysis and archiving of sequence data derived from the microbial populations found in a particular environment. Over the past two years, EBI metagenomics has increased the number of datasets analysed 10-fold. In addition to increased throughput, the underlying analysis pipeline has been overhauled to include both new or updated tools and reference databases. Of particular note is a new workflow for taxonomic assignments that has been extended to include assignments based on both the large and small subunit RNA marker genes and to encompass all cellular micro-organisms. We also describe the addition of metagenomic assembly as a new analysis service. Our pilot studies have produced over 2400 assemblies from datasets in the public domain. From these assemblies, we have produced a searchable, non-redundant protein database of over 50 million sequences. To provide improved access to the data stored within the resource, we have developed a programmatic interface that provides access to the analysis results and associated sample metadata. Finally, we have integrated the results of a series of statistical analyses that provide estimations of diversity and sample comparisons.
Collapse
|
39
|
The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res 2019; 46:D624-D632. [PMID: 29145643 PMCID: PMC5753285 DOI: 10.1093/nar/gkx1134] [Citation(s) in RCA: 928] [Impact Index Per Article: 185.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 10/30/2017] [Indexed: 12/15/2022] Open
Abstract
The MEROPS database (http://www.ebi.ac.uk/merops/) is an integrated source of information about peptidases, their substrates and inhibitors. The hierarchical classification is: protein-species, family, clan, with an identifier at each level. The MEROPS website moved to the EMBL-EBI in 2017, requiring refactoring of the code-base and services provided. The interface to sequence searching has changed and the MEROPS protein sequence libraries can be searched at the EMBL-EBI with HMMER, FastA and BLASTP. Cross-references have been established between MEROPS and the PANTHER database at both the family and protein-species level, which will help to improve curation and coverage between the resources. Because of the increasing size of the MEROPS sequence collection, in future only sequences of characterized proteins, and from completely sequenced genomes of organisms of evolutionary, medical or commercial significance will be added. As an example, peptidase homologues in four proteomes from the Asgard superphylum of Archaea have been identified and compared to other archaean, bacterial and eukaryote proteomes. This has given insights into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity. Novel structures for proteasome complexes in archaea are postulated.
Collapse
|
40
|
Metaproteomics characterization of the alphaproteobacteria microbiome in different developmental and feeding stages of the poultry red mite Dermanyssus gallinae (De Geer, 1778). Avian Pathol 2019; 48:S52-S59. [PMID: 31267762 DOI: 10.1080/03079457.2019.1635679] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The poultry red mite (PRM), Dermanyssus gallinae (De Geer, 1778), is a worldwide distributed ectoparasite and considered a major pest affecting the laying hen industry in Europe. Based on available information in other ectoparasites, the mite microbiome might participate in several biological processes and the acquisition, maintenance and transmission of pathogens. However, little is known about the role of PRM as a mechanical carrier or a biological vector in the transmission of pathogenic bacteria. Herein, we used a metaproteomics approach to characterize the alphaproteobacteria in the microbiota of PRM, and variations in its profile with ectoparasite development (nymphs vs. adults) and feeding (unfed vs. fed). The results showed that the bacterial community associated with D. gallinae was mainly composed of environmental and commensal bacteria. Putative symbiotic bacteria of the genera Wolbachia, C. Tokpelaia and Sphingomonas were identified, together with potential pathogenic bacteria of the genera Inquilinus, Neorickettsia and Roseomonas. Significant differences in the composition of alphaproteobacterial microbiota were associated with mite development and feeding, suggesting that bacteria have functional implications in metabolic pathways associated with blood feeding. These results support the use of metaproteomics for the characterization of alphaproteobacteria associated with the D. gallinae microbiota that could provide relevant information for the understanding of mite-host interactions and the development of potential control interventions. Research highlights Metaproteomics is a valid approach for microbiome characterization in ectoparasites. Alphaproteobacteria putative bacterial symbionts were identified in D. gallinae. Mite development and feeding were related to variations in bacterial community. Potentially pathogenic bacteria were identified in mite microbiota.
Collapse
|
41
|
Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res 2019; 46:D335-D342. [PMID: 29112718 PMCID: PMC5753348 DOI: 10.1093/nar/gkx1038] [Citation(s) in RCA: 584] [Impact Index Per Article: 116.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 10/19/2017] [Indexed: 11/13/2022] Open
Abstract
The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.
Collapse
|
42
|
The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res 2019; 47:W636-W641. [PMID: 30976793 PMCID: PMC6602479 DOI: 10.1093/nar/gkz268] [Citation(s) in RCA: 2879] [Impact Index Per Article: 575.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 03/22/2019] [Accepted: 04/03/2019] [Indexed: 02/07/2023] Open
Abstract
The EMBL-EBI provides free access to popular bioinformatics sequence analysis applications as well as to a full-featured text search engine with powerful cross-referencing and data retrieval capabilities. Access to these services is provided via user-friendly web interfaces and via established RESTful and SOAP Web Services APIs (https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/EMBL-EBI+Web+Services+APIs+-+Data+Retrieval). Both systems have been developed with the same core principles that allow them to integrate an ever-increasing volume of biological data, making them an integral part of many popular data resources provided at the EMBL-EBI. Here, we describe the latest improvements made to the frameworks which enhance the interconnectivity between public EMBL-EBI resources and ultimately enhance biological data discoverability, accessibility, interoperability and reusability.
Collapse
|
43
|
Microbial community drivers of PK/NRP gene diversity in selected global soils. MICROBIOME 2019; 7:78. [PMID: 31118083 PMCID: PMC6532259 DOI: 10.1186/s40168-019-0692-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 05/08/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND The emergence of antibiotic-resistant pathogens has created an urgent need for novel antimicrobial treatments. Advances in next-generation sequencing have opened new frontiers for discovery programmes for natural products allowing the exploitation of a larger fraction of the microbial community. Polyketide (PK) and non-ribosomal pepetide (NRP) natural products have been reported to be related to compounds with antimicrobial and anticancer activities. We report here a new culture-independent approach to explore bacterial biosynthetic diversity and determine bacterial phyla in the microbial community associated with PK and NRP diversity in selected soils. RESULTS Through amplicon sequencing, we explored the microbial diversity (16S rRNA gene) of 13 soils from Antarctica, Africa, Europe and a Caribbean island and correlated this with the amplicon diversity of the adenylation (A) and ketosynthase (KS) domains within functional genes coding for non-ribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs), which are involved in the production of NRP and PK, respectively. Mantel and Procrustes correlation analyses with microbial taxonomic data identified not only the well-studied phyla Actinobacteria and Proteobacteria, but also, interestingly, the less biotechnologically exploited phyla Verrucomicrobia and Bacteroidetes, as potential sources harbouring diverse A and KS domains. Some soils, notably that from Antarctica, provided evidence of endemic diversity, whilst others, such as those from Europe, clustered together. In particular, the majority of the domain reads from Antarctica remained unmatched to known sequences suggesting they could encode enzymes for potentially novel PK and NRP. CONCLUSIONS The approach presented here highlights potential sources of metabolic novelty in the environment which will be a useful precursor to metagenomic biosynthetic gene cluster mining for PKs and NRPs which could provide leads for new antimicrobial metabolites.
Collapse
|
44
|
|
45
|
Microbial abundance, activity and population genomic profiling with mOTUs2. Nat Commun 2019; 10:1014. [PMID: 30833550 PMCID: PMC6399450 DOI: 10.1038/s41467-019-08844-4] [Citation(s) in RCA: 195] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Accepted: 02/02/2019] [Indexed: 12/21/2022] Open
Abstract
Metagenomic sequencing has greatly improved our ability to profile the composition of environmental and host-associated microbial communities. However, the dependency of most methods on reference genomes, which are currently unavailable for a substantial fraction of microbial species, introduces estimation biases. We present an updated and functionally extended tool based on universal (i.e., reference-independent), phylogenetic marker gene (MG)-based operational taxonomic units (mOTUs) enabling the profiling of >7700 microbial species. As more than 30% of them could not previously be quantified at this taxonomic resolution, relative abundance estimates based on mOTUs are more accurate compared to other methods. As a new feature, we show that mOTUs, which are based on essential housekeeping genes, are demonstrably well-suited for quantification of basal transcriptional activity of community members. Furthermore, single nucleotide variation profiles estimated using mOTUs reflect those from whole genomes, which allows for comparing microbial strain populations (e.g., across different human body sites).
Collapse
|
46
|
Abstract
The composition of the human gut microbiota is linked to health and disease, but knowledge of individual microbial species is needed to decipher their biological roles. Despite extensive culturing and sequencing efforts, the complete bacterial repertoire of the human gut microbiota remains undefined. Here we identify 1,952 uncultured candidate bacterial species by reconstructing 92,143 metagenome-assembled genomes from 11,850 human gut microbiomes. These uncultured genomes substantially expand the known species repertoire of the collective human gut microbiota, with a 281% increase in phylogenetic diversity. Although the newly identified species are less prevalent in well-studied populations compared to reference isolate genomes, they improve classification of understudied African and South American samples by more than 200%. These candidate species encode hundreds of newly identified biosynthetic gene clusters and possess a distinctive functional capacity that might explain their elusive nature. Our work expands the known diversity of uncultured gut bacteria, which provides unprecedented resolution for taxonomic and functional characterization of the intestinal microbiota. The known species repertoire of the collective human gut microbiota is substantially expanded with the discovery of 1,952 uncultured bacterial species that greatly improve classification of understudied African and South American samples.
Collapse
|
47
|
Abstract
Publication interests should not limit access to public data
Collapse
|
48
|
Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes. Nucleic Acids Res 2019; 47:D564-D572. [PMID: 30364992 PMCID: PMC6323913 DOI: 10.1093/nar/gky1013] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Revised: 10/09/2018] [Accepted: 10/10/2018] [Indexed: 11/14/2022] Open
Abstract
Automatic annotation of protein function is routinely applied to newly sequenced genomes. While this provides a fine-grained view of an organism's functional protein repertoire, proteins, more commonly function in a coordinated manner, such as in pathways or multimeric complexes. Genome Properties (GPs) define such functional entities as a series of steps, originally described by either TIGRFAMs or Pfam entries. To increase the scope of coverage, we have migrated GPs to function as a companion resource utilizing InterPro entries. Having introduced GPs-specific versioned releases, we provide software and data via a GitHub repository, and have developed a new web interface to GPs (available at https://www.ebi.ac.uk/interpro/genomeproperties). In addition to exploring each of the 1286 GPs, the website contains GPs pre-calculated for a representative set of proteomes; these results can be used to profile GPs phylogenetically via an interactive viewer. Users can upload novel data to the viewer for comparison with the pre-calculated results. Over the last year, we have added ∼700 new GPs, increasing the coverage of eukaryotic systems, as well as increasing general coverage through automatic generation of GPs from related resources. All data are freely available via the website and the GitHub repository.
Collapse
|
49
|
InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 2019; 47:D351-D360. [PMID: 30398656 PMCID: PMC6323941 DOI: 10.1093/nar/gky1100] [Citation(s) in RCA: 966] [Impact Index Per Article: 193.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Revised: 10/19/2018] [Accepted: 10/22/2018] [Indexed: 12/15/2022] Open
Abstract
The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.
Collapse
|
50
|
Identifying accurate metagenome and amplicon software via a meta-analysis of sequence to taxonomy benchmarking studies. PeerJ 2019; 7:e6160. [PMID: 30631651 PMCID: PMC6322486 DOI: 10.7717/peerj.6160] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 11/14/2018] [Indexed: 01/26/2023] Open
Abstract
Metagenomic and meta-barcode DNA sequencing has rapidly become a widely-used technique for investigating a range of questions, particularly related to health and environmental monitoring. There has also been a proliferation of bioinformatic tools for analysing metagenomic and amplicon datasets, which makes selecting adequate tools a significant challenge. A number of benchmark studies have been undertaken; however, these can present conflicting results. In order to address this issue we have applied a robust Z-score ranking procedure and a network meta-analysis method to identify software tools that are consistently accurate for mapping DNA sequences to taxonomic hierarchies. Based upon these results we have identified some tools and computational strategies that produce robust predictions.
Collapse
|