1
|
Vakirlis N, Kupczok A. Large-scale investigation of species-specific orphan genes in the human gut microbiome elucidates their evolutionary origins. Genome Res 2024; 34:888-903. [PMID: 38977308 PMCID: PMC11293555 DOI: 10.1101/gr.278977.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Accepted: 06/12/2024] [Indexed: 07/10/2024]
Abstract
Species-specific genes, also known as orphans, are ubiquitous across life's domains. In prokaryotes, species-specific orphan genes (SSOGs) are mostly thought to originate in external elements such as viruses followed by horizontal gene transfer, whereas the scenario of native origination, through rapid divergence or de novo, is mostly dismissed. However, quantitative evidence supporting either scenario is lacking. Here, we systematically analyzed genomes from 4644 human gut microbiome species and identified more than 600,000 unique SSOGs, representing an average of 2.6% of a given species' pangenome. These sequences are mostly rare within each species yet show signs of purifying selection. Overall, SSOGs use optimal codons less frequently, and their proteins are more disordered than those of conserved genes (i.e., non-SSOGs). Importantly, across species, the GC content of SSOGs closely matches that of conserved ones. In contrast, the ∼5% of SSOGs that share similarity to known viral sequences have distinct characteristics, including lower GC content. Thus, SSOGs with similarity to viruses differ from the remaining SSOGs, contrasting an external origination scenario for most of them. By examining the orthologous genomic region in closely related species, we show that a small subset of SSOGs likely evolved natively de novo and find that these genes also differ in their properties from the remaining SSOGs. Our results challenge the notion that external elements are the dominant source of prokaryotic genetic novelty and will enable future studies into the biological role and relevance of species-specific genes in the human gut.
Collapse
Affiliation(s)
- Nikolaos Vakirlis
- Institute For Fundamental Biomedical Research, B.S.R.C. "Alexander Fleming," Vari 166 72, Greece;
- Institute for General Microbiology, Kiel University, 24118 Kiel, Germany
| | - Anne Kupczok
- Bioinformatics Group, Wageningen University, 6700 PB Wageningen, The Netherlands
| |
Collapse
|
2
|
Elisée E, Ducrot L, Méheust R, Bastard K, Fossey-Jouenne A, Grogan G, Pelletier E, Petit JL, Stam M, de Berardinis V, Zaparucha A, Vallenet D, Vergne-Vaxelaire C. A refined picture of the native amine dehydrogenase family revealed by extensive biodiversity screening. Nat Commun 2024; 15:4933. [PMID: 38858403 PMCID: PMC11164908 DOI: 10.1038/s41467-024-49009-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 05/20/2024] [Indexed: 06/12/2024] Open
Abstract
Native amine dehydrogenases offer sustainable access to chiral amines, so the search for scaffolds capable of converting more diverse carbonyl compounds is required to reach the full potential of this alternative to conventional synthetic reductive aminations. Here we report a multidisciplinary strategy combining bioinformatics, chemoinformatics and biocatalysis to extensively screen billions of sequences in silico and to efficiently find native amine dehydrogenases features using computational approaches. In this way, we achieve a comprehensive overview of the initial native amine dehydrogenase family, extending it from 2,011 to 17,959 sequences, and identify native amine dehydrogenases with non-reported substrate spectra, including hindered carbonyls and ethyl ketones, and accepting methylamine and cyclopropylamine as amine donor. We also present preliminary model-based structural information to inform the design of potential (R)-selective amine dehydrogenases, as native amine dehydrogenases are mostly (S)-selective. This integrated strategy paves the way for expanding the resource of other enzyme families and in highlighting enzymes with original features.
Collapse
Affiliation(s)
- Eddy Elisée
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Laurine Ducrot
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Raphaël Méheust
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Karine Bastard
- School of Pharmacy, Faculty of Medicine and Health, University of Sydney, Sydney, NSW, 2006, Australia
| | - Aurélie Fossey-Jouenne
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Gideon Grogan
- York Structural Biology Laboratory, Department of Chemistry, University of York, Heslington, York, YO10 5DD, UK
| | - Eric Pelletier
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Jean-Louis Petit
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Mark Stam
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Véronique de Berardinis
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - Anne Zaparucha
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | - David Vallenet
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France.
| | - Carine Vergne-Vaxelaire
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France.
| |
Collapse
|
3
|
Llinares-López F, Berthet Q, Blondel M, Teboul O, Vert JP. Deep embedding and alignment of protein sequences. Nat Methods 2023; 20:104-111. [PMID: 36522501 DOI: 10.1038/s41592-022-01700-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Accepted: 10/24/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
Collapse
|
4
|
Escudeiro P, Henry CS, Dias RP. Functional characterization of prokaryotic dark matter: the road so far and what lies ahead. CURRENT RESEARCH IN MICROBIAL SCIENCES 2022; 3:100159. [PMID: 36561390 PMCID: PMC9764257 DOI: 10.1016/j.crmicr.2022.100159] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 07/18/2022] [Accepted: 08/05/2022] [Indexed: 12/25/2022] Open
Abstract
Eight-hundred thousand to one trillion prokaryotic species may inhabit our planet. Yet, fewer than two-hundred thousand prokaryotic species have been described. This uncharted fraction of microbial diversity, and its undisclosed coding potential, is known as the "microbial dark matter" (MDM). Next-generation sequencing has allowed to collect a massive amount of genome sequence data, leading to unprecedented advances in the field of genomics. Still, harnessing new functional information from the genomes of uncultured prokaryotes is often limited by standard classification methods. These methods often rely on sequence similarity searches against reference genomes from cultured species. This hinders the discovery of unique genetic elements that are missing from the cultivated realm. It also contributes to the accumulation of prokaryotic gene products of unknown function among public sequence data repositories, highlighting the need for new approaches for sequencing data analysis and classification. Increasing evidence indicates that these proteins of unknown function might be a treasure trove of biotechnological potential. Here, we outline the challenges, opportunities, and the potential hidden within the functional dark matter (FDM) of prokaryotes. We also discuss the pitfalls surrounding molecular and computational approaches currently used to probe these uncharted waters, and discuss future opportunities for research and applications.
Collapse
Affiliation(s)
- Pedro Escudeiro
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Christopher S. Henry
- Argonne National Laboratory, Lemont, Illinois, USA
- University of Chicago, Chicago, Illinois, USA
| | - Ricardo P.M. Dias
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
- iXLab - Innovation for National Biological Resilience, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
5
|
Vanni C, Schechter MS, Acinas SG, Barberán A, Buttigieg PL, Casamayor EO, Delmont TO, Duarte CM, Eren AM, Finn RD, Kottmann R, Mitchell A, Sánchez P, Siren K, Steinegger M, Gloeckner FO, Fernàndez-Guerra A. Unifying the known and unknown microbial coding sequence space. eLife 2022; 11:e67667. [PMID: 35356891 PMCID: PMC9132574 DOI: 10.7554/elife.67667] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/30/2022] [Indexed: 12/02/2022] Open
Abstract
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.
Collapse
Affiliation(s)
- Chiara Vanni
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Jacobs University BremenBremenGermany
| | - Matthew S Schechter
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Department of Medicine, University of ChicagoChicagoUnited States
| | - Silvia G Acinas
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Albert Barberán
- Department of Environmental Science, University of ArizonaTucsonUnited States
| | - Pier Luigi Buttigieg
- Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener InstituteBremerhavenGermany
| | - Emilio O Casamayor
- Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for ResearchBlanesSpain
| | - Tom O Delmont
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-SaclayEvryFrance
| | - Carlos M Duarte
- Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and TechnologyThuwalSaudi Arabia
| | - A Murat Eren
- Department of Medicine, University of ChicagoChicagoUnited States
- Josephine Bay Paul Center, Marine Biological LaboratoryWoods HoleUnited States
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Renzo Kottmann
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
| | - Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Pablo Sánchez
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Kimmo Siren
- Section for Evolutionary Genomics, The GLOBE Institute, University of CopenhagenCopenhagenDenmark
| | - Martin Steinegger
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Institute of Molecular Biology and Genetics, Seoul National UniversitySeoulRepublic of Korea
| | - Frank Oliver Gloeckner
- Jacobs University BremenBremenGermany
- University of Bremen and Life Sciences and ChemistryBremenGermany
- Computing Center, Helmholtz Center for Polar and Marine ResearchBremerhavenGermany
| | - Antonio Fernàndez-Guerra
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of CopenhagenCopenhagenDenmark
| |
Collapse
|
6
|
Functional Characterisation of Bile Metagenome: Study of Metagenomic Dark Matter. Microorganisms 2021; 9:microorganisms9112201. [PMID: 34835325 PMCID: PMC8621414 DOI: 10.3390/microorganisms9112201] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 10/01/2021] [Accepted: 10/11/2021] [Indexed: 11/16/2022] Open
Abstract
Gallbladder metagenome involves a wide range of unidentified sequences comprising the so-called metagenomic dark matter. Therefore, this study aimed to characterise three gallbladder metagenomes and a fosmid library with an emphasis on metagenomic dark matter fraction. For this purpose, a novel data analysis strategy based on the combination of remote homology and molecular modelling has been proposed. According to the results obtained, several protein functional domains were annotated in the metagenomic dark matter fraction including acetyltransferases, outer membrane transporter proteins, membrane assembly factors, DNA repair and recombination proteins and response regulator phosphatases. In addition, one deacetylase involved in mycothiol biosynthesis was found in the metagenomic dark matter fraction of the fosmid library. This enzyme may exert a protective effect in Actinobacteria against bile components exposure, in agreement with the presence of multiple antibiotic and multidrug resistance genes. Potential mechanisms of action of this novel deacetylase were elucidated by molecular simulations, highlighting the role of histidine and aspartic acid residues. Computational pipelines presented in this work may be of special interest to discover novel microbial enzymes which had not been previously characterised.
Collapse
|
7
|
Lobb B, Tremblay BJM, Moreno-Hagelsieb G, Doxey AC. PathFams: statistical detection of pathogen-associated protein domains. BMC Genomics 2021; 22:663. [PMID: 34521345 PMCID: PMC8442362 DOI: 10.1186/s12864-021-07982-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 09/01/2021] [Indexed: 11/10/2022] Open
Abstract
Background A substantial fraction of genes identified within bacterial genomes encode proteins of unknown function. Identifying which of these proteins represent potential virulence factors, and mapping their key virulence determinants, is a challenging but important goal. Results To facilitate virulence factor discovery, we performed a comprehensive analysis of 17,929 protein domain families within the Pfam database, and scored them based on their overrepresentation in pathogenic versus non-pathogenic species, taxonomic distribution, relative abundance in metagenomic datasets, and other factors. Conclusions We identify pathogen-associated domain families, candidate virulence factors in the human gut, and eukaryotic-like mimicry domains with likely roles in virulence. Furthermore, we provide an interactive database called PathFams to allow users to explore pathogen-associated domains as well as identify pathogen-associated domains and domain architectures in user-uploaded sequences of interest. PathFams is freely available at https://pathfams.uwaterloo.ca. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07982-8.
Collapse
Affiliation(s)
- Briallen Lobb
- Department of Biology, University of Waterloo, Waterloo, Ontario, Canada
| | | | | | - Andrew C Doxey
- Department of Biology, University of Waterloo, Waterloo, Ontario, Canada.
| |
Collapse
|
8
|
Castro-Severyn J, Pardo-Esté C, Mendez KN, Fortt J, Marquez S, Molina F, Castro-Nallar E, Remonsellez F, Saavedra CP. Living to the High Extreme: Unraveling the Composition, Structure, and Functional Insights of Bacterial Communities Thriving in the Arsenic-Rich Salar de Huasco Altiplanic Ecosystem. Microbiol Spectr 2021; 9:e0044421. [PMID: 34190603 PMCID: PMC8552739 DOI: 10.1128/spectrum.00444-21] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Accepted: 06/07/2021] [Indexed: 01/03/2023] Open
Abstract
Microbial communities inhabiting extreme environments such as Salar de Huasco (SH) in northern Chile are adapted to thrive while exposed to several abiotic pressures and the presence of toxic elements such as arsenic (As). Hence, we aimed to uncover the role of As in shaping bacterial composition, structure, and functional potential in five different sites in this altiplanic wetland using a shotgun metagenomic approach. The sites exhibit wide gradients of As (9 to 321 mg/kg), and our results showed highly diverse communities and a clear dominance exerted by the Proteobacteria and Bacteroidetes phyla. Functional potential analyses show broadly convergent patterns, contrasting with their great taxonomic variability. As-related metabolism, as well as other functional categories such as those related to the CH4 and S cycles, differs among the five communities. Particularly, we found that the distribution and abundance of As-related genes increase as the As concentration rises. Approximately 75% of the detected genes for As metabolism belong to expulsion mechanisms; arsJ and arsP pumps are related to sites with higher As concentrations and are present almost exclusively in Proteobacteria. Furthermore, taxonomic diversity and functional potential are reflected in the 12 reconstructed high-quality metagenome assembled genomes (MAGs) belonging to the Bacteroidetes (5), Proteobacteria (5), Cyanobacteria (1), and Gemmatimonadetes (1) phyla. We conclude that SH microbial communities are diverse and possess a broad genetic repertoire to thrive under extreme conditions, including increasing concentrations of highly toxic As. Finally, this environment represents a reservoir of unknown and undescribed microorganisms, with great metabolic versatility, which needs further study. IMPORTANCE As microbial communities inhabiting extreme environments are fundamental for maintaining ecosystems, many studies concerning composition, functionality, and interactions have been carried out. However, much is still unknown. Here, we sampled microbial communities in the Salar de Huasco, an extreme environment subjected to several abiotic stresses (high UV radiation, salinity and arsenic; low pressure and temperatures). We found that although microbes are taxonomically diverse, functional potential seems to have an important degree of convergence, suggesting high levels of adaptation. Particularly, arsenic metabolism showed differences associated with increasing concentrations of the metalloid throughout the area, and it effectively exerts a significant pressure over these organisms. Thus, the significance of this research is that we describe highly specialized communities thriving in little-explored environments subjected to several pressures, considered analogous of early Earth and other planets, that have the potential for unraveling technologies to face the repercussions of climate change in many areas of interest.
Collapse
Affiliation(s)
- Juan Castro-Severyn
- Laboratorio de Microbiología Aplicada y Extremófilos, Facultad de Ingeniería y Ciencias Geológicas, Universidad Católica del Norte, Antofagasta, Chile
| | - Coral Pardo-Esté
- Laboratorio de Microbiología Aplicada y Extremófilos, Facultad de Ingeniería y Ciencias Geológicas, Universidad Católica del Norte, Antofagasta, Chile
- Laboratorio de Microbiología Molecular, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile
| | - Katterinne N. Mendez
- Center for Bioinformatics and Integrative Biology, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile
| | - Jonathan Fortt
- Laboratorio de Microbiología Aplicada y Extremófilos, Facultad de Ingeniería y Ciencias Geológicas, Universidad Católica del Norte, Antofagasta, Chile
| | - Sebastian Marquez
- Center for Bioinformatics and Integrative Biology, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile
| | - Franck Molina
- Sys2Diag, UMR9005 CNRS ALCEDIAG, Montpellier, France
| | - Eduardo Castro-Nallar
- Center for Bioinformatics and Integrative Biology, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile
| | - Francisco Remonsellez
- Laboratorio de Microbiología Aplicada y Extremófilos, Facultad de Ingeniería y Ciencias Geológicas, Universidad Católica del Norte, Antofagasta, Chile
- Centro de Investigación Tecnológica del Agua en el Desierto-CEITSAZA, Universidad Católica del Norte, Antofagasta, Chile
| | - Claudia P. Saavedra
- Laboratorio de Microbiología Molecular, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile
| |
Collapse
|
9
|
Lobb B, Tremblay BJM, Moreno-Hagelsieb G, Doxey AC. An assessment of genome annotation coverage across the bacterial tree of life. Microb Genom 2020; 6. [PMID: 32124724 PMCID: PMC7200070 DOI: 10.1099/mgen.0.000341] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.
Collapse
Affiliation(s)
- Briallen Lobb
- Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada
| | | | - Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo, ON, Canada
| | - Andrew C Doxey
- Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
10
|
Structure-Based Deep Mining Reveals First-Time Annotations for 46 Percent of the Dark Annotation Space of the 9,671-Member Superproteome of the Nucleocytoplasmic Large DNA Viruses. J Virol 2020; 94:JVI.00854-20. [PMID: 32999026 DOI: 10.1128/jvi.00854-20] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 09/16/2020] [Indexed: 12/20/2022] Open
Abstract
We conducted an exhaustive search for three-dimensional structural homologs to the proteins of 20 key phylogenetically distinct nucleocytoplasmic DNA viruses (NCLDV). Structural matches covered 429 known protein domain superfamilies, with the most highly represented being ankyrin repeat, P-loop NTPase, F-box, protein kinase, and membrane occupation and recognition nexus (MORN) repeat. Domain superfamily diversity correlated with genome size, but a diversity of around 200 superfamilies appeared to correlate with an abrupt switch to paralogization. Extensive structural homology was found across the range of eukaryotic RNA polymerase II subunits and their associated basal transcription factors, with the coordinated gain and loss of clusters of subunits on a virus-by-virus basis. The total number of predicted endonucleases across the 20 NCLDV was nearly quadrupled from 36 to 132, covering much of the structural and functional diversity of endonucleases throughout the biosphere in DNA restriction, repair, and homing. Unexpected findings included capsid protein-transcription factor chimeras; endonuclease chimeras; enzymes for detoxification; antimicrobial peptides and toxin-antitoxin systems associated with symbiosis, immunity, and addiction; and novel proteins for membrane abscission and protein turnover.IMPORTANCE We extended the known annotation space for the NCLDV by 46%, revealing high-probability structural matches for fully 45% of the 9,671 query proteins and confirming up to 98% of existing annotations per virus. The most prevalent protein families included ankyrin repeat- and MORN repeat-containing proteins, many of which included an F-box, suggesting extensive host cell modulation among the NCLDV. Regression suggested a minimum requirement for around 36 protein structural superfamilies for a viable NCLDV, and beyond around 200 superfamilies, genome expansion by the acquisition of new functions was abruptly replaced by paralogization. We found homologs to herpesvirus surface glycoprotein gB in cytoplasmic viruses. This study provided the first prediction of an endonuclease in 10 of the 20 viruses examined; the first report in a virus of a phenolic acid decarboxylase, proteasomal subunit, or cysteine knot (defensin) protein; and the first report of a prokaryotic-type ribosomal protein in a eukaryotic virus.
Collapse
|
11
|
Buongermino Pereira M, Österlund T, Eriksson KM, Backhaus T, Axelson-Fisk M, Kristiansson E. A comprehensive survey of integron-associated genes present in metagenomes. BMC Genomics 2020; 21:495. [PMID: 32689930 PMCID: PMC7370490 DOI: 10.1186/s12864-020-06830-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Accepted: 06/15/2020] [Indexed: 12/19/2022] Open
Abstract
Background Integrons are genomic elements that mediate horizontal gene transfer by inserting and removing genetic material using site-specific recombination. Integrons are commonly found in bacterial genomes, where they maintain a large and diverse set of genes that plays an important role in adaptation and evolution. Previous studies have started to characterize the wide range of biological functions present in integrons. However, the efforts have so far mainly been limited to genomes from cultivable bacteria and amplicons generated by PCR, thus targeting only a small part of the total integron diversity. Metagenomic data, generated by direct sequencing of environmental and clinical samples, provides a more holistic and unbiased analysis of integron-associated genes. However, the fragmented nature of metagenomic data has previously made such analysis highly challenging. Results Here, we present a systematic survey of integron-associated genes in metagenomic data. The analysis was based on a newly developed computational method where integron-associated genes were identified by detecting their associated recombination sites. By processing contiguous sequences assembled from more than 10 terabases of metagenomic data, we were able to identify 13,397 unique integron-associated genes. Metagenomes from marine microbial communities had the highest occurrence of integron-associated genes with levels more than 100-fold higher than in the human microbiome. The identified genes had a large functional diversity spanning over several functional classes. Genes associated with defense mechanisms and mobility facilitators were most overrepresented and more than five times as common in integrons compared to other bacterial genes. As many as two thirds of the genes were found to encode proteins of unknown function. Less than 1% of the genes were associated with antibiotic resistance, of which several were novel, previously undescribed, resistance gene variants. Conclusions Our results highlight the large functional diversity maintained by integrons present in unculturable bacteria and significantly expands the number of described integron-associated genes.
Collapse
Affiliation(s)
- Mariana Buongermino Pereira
- Department of Mathematical Sciences, Chalmers University of Technology, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden
| | - Tobias Österlund
- Department of Mathematical Sciences, Chalmers University of Technology, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden
| | - K Martin Eriksson
- Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.,Gothenburg Centre for Sustainable Development, Chalmers University of Technology, Gothenburg, Sweden
| | - Thomas Backhaus
- Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden.,Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
| | - Marina Axelson-Fisk
- Department of Mathematical Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Erik Kristiansson
- Department of Mathematical Sciences, Chalmers University of Technology, Gothenburg, Sweden. .,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden.
| |
Collapse
|
12
|
King CH, Desai H, Sylvetsky AC, LoTempio J, Ayanyan S, Carrie J, Crandall KA, Fochtman BC, Gasparyan L, Gulzar N, Howell P, Issa N, Krampis K, Mishra L, Morizono H, Pisegna JR, Rao S, Ren Y, Simonyan V, Smith K, VedBrat S, Yao MD, Mazumder R. Baseline human gut microbiota profile in healthy people and standard reporting template. PLoS One 2019; 14:e0206484. [PMID: 31509535 PMCID: PMC6738582 DOI: 10.1371/journal.pone.0206484] [Citation(s) in RCA: 105] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 08/05/2019] [Indexed: 12/19/2022] Open
Abstract
A comprehensive knowledge of the types and ratios of microbes that inhabit the healthy human gut is necessary before any kind of pre-clinical or clinical study can be performed that attempts to alter the microbiome to treat a condition or improve therapy outcome. To address this need we present an innovative scalable comprehensive analysis workflow, a healthy human reference microbiome list and abundance profile (GutFeelingKB), and a novel Fecal Biome Population Report (FecalBiome) with clinical applicability. GutFeelingKB provides a list of 157 organisms (8 phyla, 18 classes, 23 orders, 38 families, 59 genera and 109 species) that forms the baseline biome and therefore can be used as healthy controls for studies related to dysbiosis. This list can be expanded to 863 organisms if closely related proteomes are considered. The incorporation of microbiome science into routine clinical practice necessitates a standard report for comparison of an individual’s microbiome to the growing knowledgebase of “normal” microbiome data. The FecalBiome and the underlying technology of GutFeelingKB address this need. The knowledgebase can be useful to regulatory agencies for the assessment of fecal transplant and other microbiome products, as it contains a list of organisms from healthy individuals. In addition to the list of organisms and their abundances, this study also generated a collection of assembled contiguous sequences (contigs) of metagenomics dark matter. In this study, metagenomic dark matter represents sequences that cannot be mapped to any known sequence but can be assembled into contigs of 10,000 nucleotides or higher. These sequences can be used to create primers to study potential novel organisms. All data is freely available from https://hive.biochemistry.gwu.edu/gfkb and NCBI’s Short Read Archive.
Collapse
Affiliation(s)
- Charles H. King
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, DC, United States of America
| | - Hiral Desai
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Allison C. Sylvetsky
- The Department of Exercise and Nutrition Sciences, Milken Institute School of Public Health, George Washington University, Washington, DC, United States of America
| | - Jonathan LoTempio
- The Institute for Biomedical Science, School of Medicine and Health Sciences, George Washington University, Washington, DC, United States of America
- Center for Genetic Medicine, Children’s National Medical Center, George Washington University, Washington, DC, United States of America
| | - Shant Ayanyan
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Jill Carrie
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Keith A. Crandall
- Computational Biology Institute and The Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States of America
| | - Brian C. Fochtman
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Lusine Gasparyan
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Naila Gulzar
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Paul Howell
- KamTek Inc, Frederick, Maryland, United States of America
| | - Najy Issa
- The Department of Exercise and Nutrition Sciences, Milken Institute School of Public Health, George Washington University, Washington, DC, United States of America
| | - Konstantinos Krampis
- Department of Biological Sciences, Hunter College, City University of New York, New York, New York, United States of America
| | - Lopa Mishra
- Center for Translational Medicine, Department of Surgery, George Washington University, Washington, DC, United States of America
| | - Hiroki Morizono
- Center for Genetic Medicine, Children’s National Medical Center, George Washington University, Washington, DC, United States of America
| | - Joseph R. Pisegna
- Division of Gastroenterology and Hepatology VA Greater Los Angeles Healthcare System and Department of Medicine and Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Shuyun Rao
- Center for Translational Medicine, Department of Surgery, George Washington University, Washington, DC, United States of America
| | - Yao Ren
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Vahan Simonyan
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | - Krista Smith
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
| | | | - Michael D. Yao
- Washington DC VA Medical Center, Gastroenterology & Hepatology Section, Washington, DC, United States of America
- Department of Medicine, School of Medicine and Health Sciences, George Washington University, Washington, DC, United States of America
| | - Raja Mazumder
- The Department of Biochemistry & Molecular Medicine, School of Medicine and Health Sciences, George Washington University Medical Center, Washington, DC, United States of America
- Department of Medicine, School of Medicine and Health Sciences, George Washington University, Washington, DC, United States of America
- * E-mail:
| |
Collapse
|
13
|
Abstract
The molecular evolution of virulence factors is a central theme in our understanding of bacterial pathogenesis and host-microbe interactions. Using bioinformatics and genome data mining, recent studies have shed light on the evolution of important virulence factor families and the mechanisms by which they have adapted and diversified in function. This perspective highlights three complementary approaches useful for studying the molecular evolution of virulence factors: identification and analysis of virulence factor homologs, detection of adaptations or functional shifts, and computational prediction of novel virulence factor families. Each of these research directions is associated with distinct questions, approaches, and challenges for future work. Moving forward, bioinformatics will continue to play a critical role in exploring the evolution of virulence factors, including those that target humans. By reconstructing past processes and events, we will be able to better interpret newly sequenced microbial genomes and detect future pathoadaptations.
Collapse
|
14
|
Identification of new members of alkaliphilic lipases in archaea and metagenome database using reconstruction of ancestral sequences. 3 Biotech 2019; 9:165. [PMID: 30997302 DOI: 10.1007/s13205-019-1693-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Accepted: 02/27/2019] [Indexed: 10/27/2022] Open
Abstract
The application of bioinformatics in lipase research has the potential to discover robust members from different genomic/metagenomic databses. In this study, we explored the diversity and distribution of alkaliphilic lipases in archaea domain and metagenome data sets through phylogenetic survey. Reconstructed ancestral sequence of alkaphilic lipase was used to search the homologous alkaliphilic lipases among the archaea and metagenome public databases. Our investigation revealed a total 21 unique sequences of new alkaliphilic lipases in the archaeal and environmental metagenomic protein databases that shared significant sequence similarity to the bacterial alkaliphilic lipases. Most of the identified new members of alkaliphilic lipases belong to class Haloarchaea. The searched list of homologs also comprised of one characterized lipase from alkalohyperthermophilic Archaeoglobus fulgidus. All the newly identified alkaliphilic lipase members showed conserved pentapeptide [X-His-Ser-X-Gly] motif, a key feature of lipase family. Furthermore, detailed analysis of all these new sequences showed homology either with thermostable or alkalophilic lipases. The reconstructed ancestral sequence-based searches increased the sensitivity and efficacies to detect remotely homologous sequences. We hypothesize that this study can enrich our current knowledge on lipases in designing more potential thermo-alkaliphilic lipases for industrial applications.
Collapse
|
15
|
Calderon D, Peña L, Suarez A, Villamil C, Ramirez-Rojas A, Anzola JM, García-Betancur JC, Cepeda ML, Uribe D, Del Portillo P, Mongui A. Recovery and functional validation of hidden soil enzymes in metagenomic libraries. Microbiologyopen 2019; 8:e00572. [PMID: 30851083 PMCID: PMC6460280 DOI: 10.1002/mbo3.572] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 11/01/2017] [Accepted: 11/09/2017] [Indexed: 11/10/2022] Open
Abstract
The vast microbial diversity on the planet represents an invaluable source for identifying novel activities with potential industrial and therapeutic application. In this regard, metagenomics has emerged as a group of strategies that have significantly facilitated the analysis of DNA from multiple environments and has expanded the limits of known microbial diversity. However, the functional characterization of enzymes, metabolites, and products encoded by diverse microbial genomes is limited by the inefficient heterologous expression of foreign genes. We have implemented a pipeline that combines NGS and Sanger sequencing as a way to identify fosmids within metagenomic libraries. This strategy facilitated the identification of putative proteins, subcloning of targeted genes and preliminary characterization of selected proteins. Overall, the in silico approach followed by the experimental validation allowed us to efficiently recover the activity of previously hidden enzymes derived from agricultural soil samples. Therefore, the methodology workflow described herein can be applied to recover activities encoded by environmental DNA from multiple sources.
Collapse
Affiliation(s)
- Dayana Calderon
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia
| | - Luis Peña
- Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Friedrich-Schiller Universität, Jena, Germany
| | - Angélica Suarez
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia
| | - Carolina Villamil
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia
| | - Adan Ramirez-Rojas
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia
| | - Juan M Anzola
- Computational Biology, Corporación CorpoGen, Bogotá, Colombia
| | | | - Martha L Cepeda
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia
| | - Daniel Uribe
- Biotechnology Institute, Universidad Nacional de Colombia, Bogotá, Colombia
| | | | - Alvaro Mongui
- Molecular Biotechnology Research Group, Corporación CorpoGen, Bogotá, Colombia.,Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| |
Collapse
|
16
|
Orphan Genes Shared by Pathogenic Genomes Are More Associated with Bacterial Pathogenicity. mSystems 2019; 4:mSystems00290-18. [PMID: 30801025 PMCID: PMC6372840 DOI: 10.1128/msystems.00290-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 01/08/2019] [Indexed: 11/20/2022] Open
Abstract
Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases. Orphan genes (also known as ORFans [i.e., orphan open reading frames]) are new genes that enable an organism to adapt to its specific living environment. Our focus in this study is to compare ORFans between pathogens (P) and nonpathogens (NP) of the same genus. Using the pangenome idea, we have identified 130,169 ORFans in nine bacterial genera (505 genomes) and classified these ORFans into four groups: (i) SS-ORFans (P), which are only found in a single pathogenic genome; (ii) SS-ORFans (NP), which are only found in a single nonpathogenic genome; (iii) PS-ORFans (P), which are found in multiple pathogenic genomes; and (iv) NS-ORFans (NP), which are found in multiple nonpathogenic genomes. Within the same genus, pathogens do not always have more genes, more ORFans, or more pathogenicity-related genes (PRGs)—including prophages, pathogenicity islands (PAIs), virulence factors (VFs), and horizontal gene transfers (HGTs)—than nonpathogens. Interestingly, in pathogens of the nine genera, the percentages of PS-ORFans are consistently higher than those of SS-ORFans, which is not true in nonpathogens. Similarly, in pathogens of the nine genera, the percentages of PS-ORFans matching the four types of PRGs are also always higher than those of SS-ORFans, but this is not true in nonpathogens. All of these findings suggest the greater importance of PS-ORFans for bacterial pathogenicity. IMPORTANCE Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases.
Collapse
|
17
|
Vakirlis N, Hebert AS, Opulente DA, Achaz G, Hittinger CT, Fischer G, Coon JJ, Lafontaine I. A Molecular Portrait of De Novo Genes in Yeasts. Mol Biol Evol 2019; 35:631-645. [PMID: 29220506 DOI: 10.1093/molbev/msx315] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
New genes, with novel protein functions, can evolve "from scratch" out of intergenic sequences. These de novo genes can integrate the cell's genetic network and drive important phenotypic innovations. Therefore, identifying de novo genes and understanding how the transition from noncoding to coding occurs are key problems in evolutionary biology. However, identifying de novo genes is a difficult task, hampered by the presence of remote homologs, fast evolving sequences and erroneously annotated protein coding genes. To overcome these limitations, we developed a procedure that handles the usual pitfalls in de novo gene identification and predicted the emergence of 703 de novo gene candidates in 15 yeast species from 2 genera whose phylogeny spans at least 100 million years of evolution. We validated 85 candidates by proteomic data, providing new translation evidence for 25 of them through mass spectrometry experiments. We also unambiguously identified the mutations that enabled the transition from noncoding to coding for 30 Saccharomyces de novo genes. We established that de novo gene origination is a widespread phenomenon in yeasts, only a few being ultimately maintained by selection. We also found that de novo genes preferentially emerge next to divergent promoters in GC-rich intergenic regions where the probability of finding a fortuitous and transcribed ORF is the highest. Finally, we found a more than 3-fold enrichment of de novo genes at recombination hot spots, which are GC-rich and nucleosome-free regions, suggesting that meiotic recombination contributes to de novo gene emergence in yeasts.
Collapse
Affiliation(s)
- Nikolaos Vakirlis
- Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative UMR7238, 75005 Paris, France
| | - Alex S Hebert
- Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, WI.,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI
| | - Dana A Opulente
- Laboratory of Genetics, Genome Center of Wisconsin, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI
| | - Guillaume Achaz
- Atelier de BioInformatique, ISyEB UMR7205 Muséum National d'Histoire Naturelle, Paris, France.,SMILE Group, CIRB UMR7241, Collège de France, Paris, France
| | - Chris Todd Hittinger
- DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI.,Laboratory of Genetics, Genome Center of Wisconsin, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI
| | - Gilles Fischer
- Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative UMR7238, 75005 Paris, France
| | - Joshua J Coon
- Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, WI.,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI.,Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI.,Department of Chemistry, University of Wisconsin-Madison, Madison, WI.,Morgridge Institute for Research, Madison, WI
| | - Ingrid Lafontaine
- Atelier de BioInformatique, ISyEB UMR7205 Muséum National d'Histoire Naturelle, Paris, France.,Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Physico-Chimique, Physiologie Membranaire et Moléculaire du Chloroplaste UMR7141, 75005 Paris, France
| |
Collapse
|
18
|
Watson AK, Lannes R, Pathmanathan JS, Méheust R, Karkar S, Colson P, Corel E, Lopez P, Bapteste E. The Methodology Behind Network Thinking: Graphs to Analyze Microbial Complexity and Evolution. Methods Mol Biol 2019; 1910:271-308. [PMID: 31278668 DOI: 10.1007/978-1-4939-9074-0_9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In the post genomic era, large and complex molecular datasets from genome and metagenome sequencing projects expand the limits of what is possible for bioinformatic analyses. Network-based methods are increasingly used to complement phylogenetic analysis in studies in molecular evolution, including comparative genomics, classification, and ecological studies. Using network methods, the vertical and horizontal relationships between all genes or genomes, whether they are from cellular chromosomes or mobile genetic elements, can be explored in a single expandable graph. In recent years, development of new methods for the construction and analysis of networks has helped to broaden the availability of these approaches from programmers to a diversity of users. This chapter introduces the different kinds of networks based on sequence similarity that are already available to tackle a wide range of biological questions, including sequence similarity networks, gene-sharing networks and bipartite graphs, and a guide for their construction and analyses.
Collapse
Affiliation(s)
- Andrew K Watson
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Romain Lannes
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Jananan S Pathmanathan
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Raphaël Méheust
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Slim Karkar
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
- Department of Ecology, Evolution, and Natural Resources, School of Environmental and Biological Sciences, Rutgers, The State University of NJ, New Brunswick, NJ, USA
| | - Philippe Colson
- Fondation Institut Hospitalo-Universitaire Méditerranée Infection, Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie-Hygiène-Virologie, Centre Hospitalo-Universitaire Tione, Assistance Publique-Hôpitaux de Marseille, Marseille, France
- Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE) UM63, CNRS 7278, IRD 198, INSERM U1095, Aix-Marseille University, Marseille, France
| | - Eduardo Corel
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Philippe Lopez
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Eric Bapteste
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France.
| |
Collapse
|
19
|
Bernard G, Pathmanathan JS, Lannes R, Lopez P, Bapteste E. Microbial Dark Matter Investigations: How Microbial Studies Transform Biological Knowledge and Empirically Sketch a Logic of Scientific Discovery. Genome Biol Evol 2018; 10:707-715. [PMID: 29420719 PMCID: PMC5830969 DOI: 10.1093/gbe/evy031] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/05/2018] [Indexed: 02/07/2023] Open
Abstract
Microbes are the oldest and most widespread, phylogenetically and metabolically diverse life forms on Earth. However, they have been discovered only 334 years ago, and their diversity started to become seriously investigated even later. For these reasons, microbial studies that unveil novel microbial lineages and processes affecting or involving microbes deeply (and repeatedly) transform knowledge in biology. Considering the quantitative prevalence of taxonomically and functionally unassigned sequences in environmental genomics data sets, and that of uncultured microbes on the planet, we propose that unraveling the microbial dark matter should be identified as a central priority for biologists. Based on former empirical findings of microbial studies, we sketch a logic of discovery with the potential to further highlight the microbial unknowns.
Collapse
Affiliation(s)
- Guillaume Bernard
- Sorbonne Universités, UPMC Université Paris 06, Institut de Biologie Paris-Seine (IBPS), France
| | - Jananan S Pathmanathan
- Sorbonne Universités, UPMC Université Paris 06, Institut de Biologie Paris-Seine (IBPS), France
| | - Romain Lannes
- Sorbonne Universités, UPMC Université Paris 06, Institut de Biologie Paris-Seine (IBPS), France
| | - Philippe Lopez
- Sorbonne Universités, UPMC Université Paris 06, Institut de Biologie Paris-Seine (IBPS), France
| | - Eric Bapteste
- Sorbonne Universités, UPMC Université Paris 06, Institut de Biologie Paris-Seine (IBPS), France
| |
Collapse
|
20
|
Discovering novel hydrolases from hot environments. Biotechnol Adv 2018; 36:2077-2100. [PMID: 30266344 DOI: 10.1016/j.biotechadv.2018.09.004] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 09/21/2018] [Accepted: 09/24/2018] [Indexed: 12/12/2022]
Abstract
Novel hydrolases from hot and other extreme environments showing appropriate performance and/or novel functionalities and new approaches for their systematic screening are of great interest for developing new processes, for improving safety, health and environment issues. Existing processes could benefit as well from their properties. The workflow, based on the HotZyme project, describes a multitude of technologies and their integration from discovery to application, providing new tools for discovering, identifying and characterizing more novel thermostable hydrolases with desired functions from hot terrestrial and marine environments. To this end, hot springs worldwide were mined, resulting in hundreds of environmental samples and thousands of enrichment cultures growing on polymeric substrates of industrial interest. Using high-throughput sequencing and bioinformatics, 15 hot spring metagenomes, as well as several sequenced isolate genomes and transcriptomes were obtained. To facilitate the discovery of novel hydrolases, the annotation platform Anastasia and a whole-cell bioreporter-based functional screening method were developed. Sequence-based screening and functional screening together resulted in about 100 potentially new hydrolases of which more than a dozen have been characterized comprehensively from a biochemical and structural perspective. The characterized hydrolases include thermostable carboxylesterases, enol lactonases, quorum sensing lactonases, gluconolactonases, epoxide hydrolases, and cellulases. Apart from these novel thermostable hydrolases, the project generated an enormous amount of samples and data, thereby allowing the future discovery of even more novel enzymes.
Collapse
|
21
|
Keshri V, Panda A, Levasseur A, Rolain JM, Pontarotti P, Raoult D. Phylogenomic Analysis of β-Lactamase in Archaea and Bacteria Enables the Identification of Putative New Members. Genome Biol Evol 2018; 10:1106-1114. [PMID: 29672703 PMCID: PMC5905574 DOI: 10.1093/gbe/evy028] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/02/2018] [Indexed: 01/09/2023] Open
Abstract
β-lactamases are enzymes which are commonly produced by bacteria and which degrade the β-lactam ring of β-lactam antibiotics, namely penicillins, cephalosporins, carbapenems, and monobactams, and inactivate these antibiotics. We performed a rational and comprehensive investigation of β-lactamases in different biological databases. In this study, we constructed hidden Markov model profiles as well as the ancestral sequence of four classes of β-lactamases (A, B, C, and D), which were used to identify potential β-lactamases from environmental metagenomic (1206), human microbiome metagenomic (6417), human microbiome reference genome (1310), and NCBI's nonredundant databases (44101). Our analysis revealed the existence of putative β-lactamases in the metagenomic databases, which appeared to be similar to the four different molecular classes (A-D). This is the first report on the large-scale phylogenetic diversity of new members of β-lactamases, and our results revealed that metagenomic database dark-matter contains β-lactamase-like antibiotic resistance genes.
Collapse
Affiliation(s)
- Vivek Keshri
- Evolution Biologique et Modélisation, I2M, UMR-CNRS 7373, Aix-Marseille Université, France
- IRD, APHM, MEPHI, IHU Méditerranée Infection, Aix-Marseille Université, France
| | - Arup Panda
- Evolution Biologique et Modélisation, I2M, UMR-CNRS 7373, Aix-Marseille Université, France
| | - Anthony Levasseur
- IRD, APHM, MEPHI, IHU Méditerranée Infection, Aix-Marseille Université, France
| | - Jean-Marc Rolain
- IRD, APHM, MEPHI, IHU Méditerranée Infection, Aix-Marseille Université, France
| | - Pierre Pontarotti
- Evolution Biologique et Modélisation, I2M, UMR-CNRS 7373, Aix-Marseille Université, France
- CNRS, IRD, APHM, MEPHI, IHU Méditerranée Infection (Evolutionary Biology Team), Aix-Marseille Université, France
| | - Didier Raoult
- IRD, APHM, MEPHI, IHU Méditerranée Infection, Aix-Marseille Université, France
| |
Collapse
|
22
|
Discovery of novel bacterial toxins by genomics and computational biology. Toxicon 2018; 147:2-12. [PMID: 29438679 DOI: 10.1016/j.toxicon.2018.02.002] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Revised: 12/23/2017] [Accepted: 02/07/2018] [Indexed: 12/13/2022]
Abstract
Hundreds and hundreds of bacterial protein toxins are presently known. Traditionally, toxin identification begins with pathological studies of bacterial infectious disease. Following identification and cultivation of a bacterial pathogen, the protein toxin is purified from the culture medium and its pathogenic activity is studied using the methods of biochemistry and structural biology, cell biology, tissue and organ biology, and appropriate animal models, supplemented by bioimaging techniques. The ongoing and explosive development of high-throughput DNA sequencing and bioinformatic approaches have set in motion a revolution in many fields of biology, including microbiology. One consequence is that genes encoding novel bacterial toxins can be identified by bioinformatic and computational methods based on previous knowledge accumulated from studies of the biology and pathology of thousands of known bacterial protein toxins. Starting from the paradigmatic cases of diphtheria toxin, tetanus and botulinum neurotoxins, this review discusses traditional experimental approaches as well as bioinformatics and genomics-driven approaches that facilitate the discovery of novel bacterial toxins. We discuss recent work on the identification of novel botulinum-like toxins from genera such as Weissella, Chryseobacterium, and Enteroccocus, and the implications of these computationally identified toxins in the field. Finally, we discuss the promise of metagenomics in the discovery of novel toxins and their ecological niches, and present data suggesting the existence of uncharacterized, botulinum-like toxin genes in insect gut metagenomes.
Collapse
|
23
|
Two fundamentally different classes of microbial genes. Nat Microbiol 2016; 2:16208. [PMID: 27819663 DOI: 10.1038/nmicrobiol.2016.208] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Accepted: 09/20/2016] [Indexed: 01/15/2023]
Abstract
The evolution of bacterial and archaeal genomes is highly dynamic and involves extensive horizontal gene transfer and gene loss1-4. Furthermore, many microbial species appear to have open pangenomes, where each newly sequenced genome contains more than 10% ORFans, that is, genes without detectable homologues in other species5,6. Here, we report a quantitative analysis of microbial genome evolution by fitting the parameters of a simple, steady-state evolutionary model to the comparative genomic data on the gene content and gene order similarity between archaeal genomes. The results reveal two sharply distinct classes of microbial genes, one of which is characterized by effectively instantaneous gene replacement, and the other consists of genes with finite, distributed replacement rates. These findings imply a conservative estimate of the size of the prokaryotic genomic universe, which appears to consist of at least a billion distinct genes. Furthermore, the same distribution of constraints is shown to govern the evolution of gene complement and gene order, without the need to invoke long-range conservation or the selfish operon concept7.
Collapse
|
24
|
Lobb B, Doxey AC. Novel function discovery through sequence and structural data mining. Curr Opin Struct Biol 2016; 38:53-61. [DOI: 10.1016/j.sbi.2016.05.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 05/17/2016] [Accepted: 05/24/2016] [Indexed: 01/30/2023]
|
25
|
Neuhaus K, Landstorfer R, Fellner L, Simon S, Schafferhans A, Goldberg T, Marx H, Ozoline ON, Rost B, Kuster B, Keim DA, Scherer S. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC). BMC Genomics 2016; 17:133. [PMID: 26911138 PMCID: PMC4765031 DOI: 10.1186/s12864-016-2456-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 02/09/2016] [Indexed: 12/30/2022] Open
Abstract
Background Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Results Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. Conclusions These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2456-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Klaus Neuhaus
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Richard Landstorfer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Lea Fellner
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Svenja Simon
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Andrea Schafferhans
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Tatyana Goldberg
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Harald Marx
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany.
| | - Olga N Ozoline
- Institute of Cell Biophysics, Russian Academy of Sciences, Moscow Region, 142290, Pushchino, Russia.
| | - Burkhard Rost
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany. .,Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technische Universität München, Gregor-Mendel-Str. 4, 85354, Freising, Germany.
| | - Daniel A Keim
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Siegfried Scherer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| |
Collapse
|
26
|
Petrenko P, Lobb B, Kurtz DA, Neufeld JD, Doxey AC. MetAnnotate: function-specific taxonomic profiling and comparison of metagenomes. BMC Biol 2015; 13:92. [PMID: 26541816 PMCID: PMC4636000 DOI: 10.1186/s12915-015-0195-4] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2015] [Accepted: 10/02/2015] [Indexed: 11/13/2022] Open
Abstract
Background Metagenomes provide access to the taxonomic composition and functional capabilities of microbial communities. Although metagenomic analysis methods exist for estimating overall community composition or metabolic potential, identifying specific taxa that encode specific functions or pathways of interest can be more challenging. Here we present MetAnnotate, which addresses the common question: “which organisms perform my function of interest within my metagenome(s) of interest?” MetAnnotate uses profile hidden Markov models to analyze shotgun metagenomes for genes and pathways of interest, classifies retrieved sequences either through a phylogenetic placement or best hit approach, and enables comparison of these profiles between metagenomes. Results Based on a simulated metagenome dataset, the tool achieves high taxonomic classification accuracy for a broad range of genes, including both markers of community abundance and specific biological pathways. Lastly, we demonstrate MetAnnotate by analyzing for cobalamin (vitamin B12) synthesis genes across hundreds of aquatic metagenomes in a fraction of the time required by the commonly used Basic Local Alignment Search Tool top hit approach. Conclusions MetAnnotate is multi-threaded and installable as a local web application or command-line tool on Linux systems. Metannotate is a useful framework for general and/or function-specific taxonomic profiling and comparison of metagenomes. Electronic supplementary material The online version of this article (doi:10.1186/s12915-015-0195-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pavel Petrenko
- Department of Biology, University of Waterloo, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| | - Briallen Lobb
- Department of Biology, University of Waterloo, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| | - Daniel A Kurtz
- Department of Biology, University of Waterloo, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| | - Josh D Neufeld
- Department of Biology, University of Waterloo, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| | - Andrew C Doxey
- Department of Biology, University of Waterloo, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada.
| |
Collapse
|