1
|
Pavlopoulos GA, Baltoumas FA, Liu S, Selvitopi O, Camargo AP, Nayfach S, Azad A, Roux S, Call L, Ivanova NN, Chen IM, Paez-Espino D, Karatzas E, Iliopoulos I, Konstantinidis K, Tiedje JM, Pett-Ridge J, Baker D, Visel A, Ouzounis CA, Ovchinnikov S, Buluç A, Kyrpides NC. Unraveling the functional dark matter through global metagenomics. Nature 2023; 622:594-602. [PMID: 37821698 PMCID: PMC10584684 DOI: 10.1038/s41586-023-06583-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 08/30/2023] [Indexed: 10/13/2023]
Abstract
Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
Collapse
Affiliation(s)
- Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece.
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece.
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece
| | - Sirui Liu
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
| | - Oguz Selvitopi
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Ariful Azad
- Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Lee Call
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - I Min Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - David Paez-Espino
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | | | - James M Tiedje
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA
| | - Jennifer Pett-Ridge
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Axel Visel
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christos A Ouzounis
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
- Biological Computation & Computational Biology Group, Artificial Intelligence & Information Analysis Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
| | - Aydin Buluç
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
2
|
Sajid S, Mashkoor M, Jørgensen MG, Christensen LP, Hansen PR, Franzyk H, Mirza O, Prabhala BK. The Y-ome Conundrum: Insights into Uncharacterized Genes and Approaches for Functional Annotation. Mol Cell Biochem 2023:10.1007/s11010-023-04827-8. [PMID: 37610616 DOI: 10.1007/s11010-023-04827-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 08/09/2023] [Indexed: 08/24/2023]
Abstract
The ever-increasing availability of genome sequencing data has revealed a substantial number of uncharacterized genes without known functions across various organisms. The first comprehensive genome sequencing of E. coli K12 revealed that more than 50% of its open reading frames corresponded to transcripts with no known functions. The group of protein-coding genes without a functional description and/or a recognized pathway, beginning with the letter "Y", is classified as the "y-ome". Several efforts have been made to elucidate the functions of these genes and to recognize their role in biological processes. This review provides a brief update on various strategies employed when studying the y-ome, such as high-throughput experimental approaches, comparative omics, metabolic engineering, gene expression analysis, and data integration techniques. Additionally, we highlight recent advancements in functional annotation methods, including the use of machine learning, network analysis, and functional genomics approaches. Novel approaches are required to produce more precise functional annotations across the genome to reduce the number of genes with unknown functions.
Collapse
Affiliation(s)
- Salvia Sajid
- Department of Drug Design and Pharmacology, University of Copenhagen, Universitetsparken 2, 2100, Copenhagen Ø, Denmark
- Department of Physics, Chemistry, and Pharmacy, University of Southern Denmark, Campusvej 55, 5230, Odense M, Denmark
| | - Maliha Mashkoor
- Department of Surgery, Center for Surgical Sciences, Zealand University Hospital, Lykkebækvej 1, 4600, Køge, Denmark
| | - Mikkel Girke Jørgensen
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230, Odense M, Denmark
| | - Lars Porskjær Christensen
- Department of Physics, Chemistry, and Pharmacy, University of Southern Denmark, Campusvej 55, 5230, Odense M, Denmark
| | - Paul Robert Hansen
- Department of Drug Design and Pharmacology, University of Copenhagen, Universitetsparken 2, 2100, Copenhagen Ø, Denmark
| | - Henrik Franzyk
- Department of Drug Design and Pharmacology, University of Copenhagen, Universitetsparken 2, 2100, Copenhagen Ø, Denmark
| | - Osman Mirza
- Department of Drug Design and Pharmacology, University of Copenhagen, Universitetsparken 2, 2100, Copenhagen Ø, Denmark
| | - Bala Krishna Prabhala
- Department of Physics, Chemistry, and Pharmacy, University of Southern Denmark, Campusvej 55, 5230, Odense M, Denmark.
| |
Collapse
|
3
|
Plitt T, Faith JJ. Seminars in immunology special issue: Nutrition, microbiota and immunity The unexplored microbes in health and disease. Semin Immunol 2023; 66:101735. [PMID: 36857892 PMCID: PMC10049858 DOI: 10.1016/j.smim.2023.101735] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 01/17/2023] [Accepted: 02/09/2023] [Indexed: 03/03/2023]
Abstract
Functional characterization of the microbiome's influence on host physiology has been dominated by a few characteristic example strains that have been studied in detail. However, the extensive development of methods for high-throughput bacterial isolation and culture over the past decade is enabling functional characterization of the broader microbiota that may impact human health. Characterizing the understudied majority of human microbes and expanding our functional understanding of the diversity of the gut microbiota could enable new insights into diseases with unknown etiology, provide disease-predictive microbiome signatures, and advance microbial therapeutics. We summarize high-throughput culture-dependent platforms for characterizing bacterial strain function and host-interactions. We elaborate on the importance of these technologies in facilitating mechanistic studies of previously unexplored microbes, highlight new opportunities for large-scale in vitro screens of host-relevant microbial functions, and discuss the potential translational applications for microbiome science.
Collapse
Affiliation(s)
- Tamar Plitt
- Precision Immunology Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Jeremiah J Faith
- Precision Immunology Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
| |
Collapse
|
4
|
Bhosle A, Wang Y, Franzosa EA, Huttenhower C. Progress and opportunities in microbial community metabolomics. Curr Opin Microbiol 2022; 70:102195. [PMID: 36063685 DOI: 10.1016/j.mib.2022.102195] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 07/20/2022] [Accepted: 07/21/2022] [Indexed: 01/25/2023]
Abstract
The metabolome lies at the interface of host-microbiome crosstalk. Previous work has established links between chemically diverse microbial metabolites and a myriad of host physiological processes and diseases. Coupled with scalable and cost-effective technologies, metabolomics is thus gaining popularity as a tool for characterization of microbial communities, particularly when combined with metagenomics as a window into microbiome function. A systematic interrogation of microbial community metabolomes can uncover key microbial compounds, metabolic capabilities of the microbiome, and also provide critical mechanistic insights into microbiome-linked host phenotypes. In this review, we discuss methods and accompanying resources that have been developed for these purposes. The accomplishments of these methods demonstrate that metabolomes can be used to functionally characterize microbial communities, and that microbial properties can be used to identify and investigate chemical compounds.
Collapse
Affiliation(s)
- Amrisha Bhosle
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Ya Wang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Eric A Franzosa
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Curtis Huttenhower
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
5
|
Cohen SE, Hashmi SM, Jones AAD, Lykourinou V, Ondrechen MJ, Sridhar S, van de Ven AL, Waters LS, Beuning PJ. Adapting Undergraduate Research to Remote Work to Increase Engagement. BIOPHYSICIST (ROCKVILLE, MD.) 2021; 2:28-32. [PMID: 36909739 PMCID: PMC10003819 DOI: 10.35459/tbp.2021.000199] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Demand for undergraduate research experiences typically outstrips the available laboratory positions, which could have been exacerbated during the remote work conditions imposed by the SARS-CoV-2/COVID-19 pandemic. This report presents a collection of examples of how undergraduates have been engaged in research under pandemic work restrictions. Examples include a range of projects related to fluid dynamics, cancer biology, nanomedicine, circadian clocks, metabolic disease, catalysis, and environmental remediation. Adaptations were made that included partnerships between remote and in-person research students and students taking on more data analysis and literature surveys, as well as data mining, computational, and informatics projects. In many cases, these projects engaged students who otherwise would have worked in traditional bench research, as some previously had. Several examples of beneficial experiences are reported, such as the additional time spent studying the literature, which gave students a heightened sense of project ownership, and more opportunities to integrate feedback into writing and research. Additionally, the more intentional and regular communication necessitated by remote work proved beneficial for all team members. Finally, online seminars and conferences have made participation possible for many more students, especially those at predominantly undergraduate institutions. Participants aim to adopt these beneficial practices in our research groups even after pandemic restrictions end.
Collapse
Affiliation(s)
- Susan E Cohen
- Department of Biological Sciences, California State University Los Angeles, Los Angeles, CA 90032, USA
| | - Sara M Hashmi
- Department of Chemical Engineering, Northeastern University, Boston, MA 02115, USA.,Department of Mechanical & Industrial Engineering, Northeastern University, Boston, MA 02115, USA
| | - A-Andrew D Jones
- Department of Chemical Engineering, Northeastern University, Boston, MA 02115, USA.,School of Public Policy and Urban Affairs, Northeastern University, Boston, MA 02115, USA
| | - Vasiliki Lykourinou
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, USA
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, USA.,Center for Interdisciplinary Research on Complex Systems, Northeastern University, Boston, MA 02115, USA
| | - Srinivas Sridhar
- Department of Chemical Engineering, Northeastern University, Boston, MA 02115, USA.,Department of Physics, Northeastern University, Boston, MA 02115, USA
| | - Anne L van de Ven
- Department of Physics, Northeastern University, Boston, MA 02115, USA
| | - Lauren S Waters
- Department of Chemistry, University of Wisconsin Oshkosh, Oshkosh, WI 54901, USA
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, USA.,Center for Interdisciplinary Research on Complex Systems, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
6
|
Ghatak S, King ZA, Sastry A, Palsson BO. The y-ome defines the 35% of Escherichia coli genes that lack experimental evidence of function. Nucleic Acids Res 2019; 47:2446-2454. [PMID: 30698741 PMCID: PMC6412132 DOI: 10.1093/nar/gkz030] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 12/07/2018] [Accepted: 01/26/2019] [Indexed: 01/22/2023] Open
Abstract
Experimental studies of Escherichia coli K-12 MG1655 often implicate poorly annotated genes in cellular phenotypes. However, we lack a systematic understanding of these genes. How many are there? What information is available for them? And what features do they share that could explain the gap in our understanding? Efforts to build predictive, whole-cell models of E. coli inevitably face this knowledge gap. We approached these questions systematically by assembling annotations from the knowledge bases EcoCyc, EcoGene, UniProt and RegulonDB. We identified the genes that lack experimental evidence of function (the ‘y-ome’) which include 1600 of 4623 unique genes (34.6%), of which 111 have absolutely no evidence of function. An additional 220 genes (4.7%) are pseudogenes or phantom genes. y-ome genes tend to have lower expression levels and are enriched in the termination region of the E. coli chromosome. Where evidence is available for y-ome genes, it most often points to them being membrane proteins and transporters. We resolve the misconception that a gene in E. coli whose primary name starts with ‘y’ is unannotated, and we discuss the value of the y-ome for systematic improvement of E. coli knowledge bases and its extension to other organisms.
Collapse
Affiliation(s)
- Sankha Ghatak
- Bioengineering Department, University of California, San Diego, La Jolla, CA 92093, USA
| | - Zachary A King
- Bioengineering Department, University of California, San Diego, La Jolla, CA 92093, USA
| | - Anand Sastry
- Bioengineering Department, University of California, San Diego, La Jolla, CA 92093, USA
| | - Bernhard O Palsson
- Bioengineering Department, University of California, San Diego, La Jolla, CA 92093, USA.,Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA.,Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Building 220, 2800 Kongens, Lyngby, Denmark
| |
Collapse
|
7
|
Reis RAG, Salvi F, Williams I, Gadda G. Kinetic Investigation of a Presumed Nitronate Monooxygenase from Pseudomonas aeruginosa PAO1 Establishes a New Class of NAD(P)H:Quinone Reductases. Biochemistry 2019; 58:2594-2607. [PMID: 31075192 DOI: 10.1021/acs.biochem.9b00207] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
PA0660 from Pseudomonas aeruginosa PAO1 is currently classified as a hypothetical nitronate monooxygenase (NMO), but no evidence at the transcript or protein level has been presented. In this study, PA0660 was purified and its biochemical and kinetic properties were characterized. Absorption spectroscopy and mass spectrometry demonstrated a tightly, noncovalently bound FMN in the active site of the enzyme. Analytical ultracentrifugation showed that the enzyme exists as a dimer in solution. Despite its annotation, PA0660 did not exhibit nitronate monooxygenase activity. The enzyme could be reduced with NADPH or NADH with a marked preference for NADPH, as indicated by ∼30-fold larger kcat/ Km and kred/ Kd values. Turnover could be sustained with NAD(P)H and quinones, DCPIP, and to a lesser extent molecular oxygen. However, PA0660 did not turn over with methyl red, consistent with a lack of azoreductase activity. The enzyme turned over through a ping-pong bi-bi steady-state kinetic mechanism with NADPH and 1,4-benzoquinone showing a kcat value of 90 s-1. The rate constant for flavin reduction with saturating NADPH was 360 s-1, whereas that for flavin oxidation with 1,4-benzoquinone was 270 s-1, consistent with both hydride transfers from the pyridine nucleotide to the flavin and from the flavin to 1,4-benzoquinone being partially rate-limiting for enzyme turnover. A BlastP search and a multiple-sequence alignment analysis of PA0660 highlighted the presence of six conserved motifs in >1000 open reading frames currently annotated as hypothetical NMOs. Our results suggest that PA0660 should be classified as an NAD(P)H:quinone reductase and serve as a paradigm enzyme for a new class of enzymes.
Collapse
|
8
|
Ball J, Salvi F, Gadda G. Functional Annotation of a Presumed Nitronate Monoxygenase Reveals a New Class of NADH:Quinone Reductases. J Biol Chem 2016; 291:21160-21170. [PMID: 27502282 PMCID: PMC5076524 DOI: 10.1074/jbc.m116.739151] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Indexed: 12/12/2022] Open
Abstract
The protein PA1024 from Pseudomonas aeruginosa PAO1 is currently classified as 2-nitropropane dioxygenase, the previous name for nitronate monooxygenase in the GenBankTM and PDB databases, but the enzyme was not kinetically characterized. In this study, PA1024 was purified to high levels, and the enzymatic activity was investigated by spectroscopic and polarographic techniques. Purified PA1024 did not exhibit nitronate monooxygenase activity; however, it displayed NADH:quinone reductase and a small NADH:oxidase activity. The enzyme preferred NADH to NADPH as a reducing substrate. PA1024 could reduce a broad spectrum of quinone substrates via a Ping Pong Bi Bi steady-state kinetic mechanism, generating the corresponding hydroquinones. The reductive half-reaction with NADH showed a kred value of 24 s-1 and an apparent Kd value estimated in the low micromolar range. The enzyme was not able to reduce the azo dye methyl red, routinely used in the kinetic characterization of azoreductases. Finally, we revisited and modified the existing six conserved motifs of PA1024, which define a new class of NADH:quinone reductases and are present in more than 490 hypothetical proteins in the GenBankTM, the vast majority of which are currently misannotated as nitronate monooxygenase.
Collapse
Affiliation(s)
| | | | - Giovanni Gadda
- From the Departments of Chemistry and Biology, Center for Biotechnology and Drug Design, Center for Diagnostics and Therapeutics, Georgia State University, Atlanta, Georgia 30302-3965
| |
Collapse
|
9
|
Chang YC, Hu Z, Rachlin J, Anton BP, Kasif S, Roberts RJ, Steffen M. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res 2015; 44:D330-5. [PMID: 26635392 PMCID: PMC4702925 DOI: 10.1093/nar/gkv1324] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 11/11/2015] [Indexed: 02/06/2023] Open
Abstract
The COMBREX database (COMBREX-DB; combrex.bu.edu) is an online repository of information related to (i) experimentally determined protein function, (ii) predicted protein function, (iii) relationships among proteins of unknown function and various types of experimental data, including molecular function, protein structure, and associated phenotypes. The database was created as part of the novel COMBREX (COMputational BRidges to EXperiments) effort aimed at accelerating the rate of gene function validation. It currently holds information on ∼3.3 million known and predicted proteins from over 1000 completely sequenced bacterial and archaeal genomes. The database also contains a prototype recommendation system for helping users identify those proteins whose experimental determination of function would be most informative for predicting function for other proteins within protein families. The emphasis on documenting experimental evidence for function predictions, and the prioritization of uncharacterized proteins for experimental testing distinguish COMBREX from other publicly available microbial genomics resources. This article describes updates to COMBREX-DB since an initial description in the 2011 NAR Database Issue.
Collapse
Affiliation(s)
- Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, MA 02215, USA
| | - Zhenjun Hu
- Bioinformatics Program, Boston University, Boston, MA 02215, USA
| | | | | | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, MA 02215, USA Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | | | - Martin Steffen
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA Department of Pathology and Laboratory Medicine, Boston University School of Medicine, Boston, MA 02118, USA
| |
Collapse
|
10
|
Zomorrodi AR, Segrè D. Synthetic Ecology of Microbes: Mathematical Models and Applications. J Mol Biol 2015; 428:837-61. [PMID: 26522937 DOI: 10.1016/j.jmb.2015.10.019] [Citation(s) in RCA: 121] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2015] [Revised: 10/17/2015] [Accepted: 10/21/2015] [Indexed: 12/29/2022]
Abstract
As the indispensable role of natural microbial communities in many aspects of life on Earth is uncovered, the bottom-up engineering of synthetic microbial consortia with novel functions is becoming an attractive alternative to engineering single-species systems. Here, we summarize recent work on synthetic microbial communities with a particular emphasis on open challenges and opportunities in environmental sustainability and human health. We next provide a critical overview of mathematical approaches, ranging from phenomenological to mechanistic, to decipher the principles that govern the function, dynamics and evolution of microbial ecosystems. Finally, we present our outlook on key aspects of microbial ecosystems and synthetic ecology that require further developments, including the need for more efficient computational algorithms, a better integration of empirical methods and model-driven analysis, the importance of improving gene function annotation, and the value of a standardized library of well-characterized organisms to be used as building blocks of synthetic communities.
Collapse
Affiliation(s)
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, MA; Department of Biology, Boston University, Boston, MA; Department of Biomedical Engineering, Boston University, Boston, MA.
| |
Collapse
|
11
|
Pfeiffer F, Oesterhelt D. A manual curation strategy to improve genome annotation: application to a set of haloarchael genomes. Life (Basel) 2015; 5:1427-44. [PMID: 26042526 PMCID: PMC4500146 DOI: 10.3390/life5021427] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Revised: 05/22/2015] [Accepted: 05/25/2015] [Indexed: 12/31/2022] Open
Abstract
Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.
Collapse
Affiliation(s)
- Friedhelm Pfeiffer
- Department of Membrane Biochemistry, Max-Planck-Institute of Biochemisty, Am Klopferspitz 18, Martinsried 82152, Germany.
| | - Dieter Oesterhelt
- Department of Membrane Biochemistry, Max-Planck-Institute of Biochemisty, Am Klopferspitz 18, Martinsried 82152, Germany.
| |
Collapse
|
12
|
Abstract
Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. The purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists.
Collapse
|
13
|
Sequencing and beyond: integrating molecular 'omics' for microbial community profiling. Nat Rev Microbiol 2015; 13:360-72. [PMID: 25915636 DOI: 10.1038/nrmicro3451] [Citation(s) in RCA: 398] [Impact Index Per Article: 44.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
High-throughput DNA sequencing has proven invaluable for investigating diverse environmental and host-associated microbial communities. In this Review, we discuss emerging strategies for microbial community analysis that complement and expand traditional metagenomic profiling. These include novel DNA sequencing strategies for identifying strain-level microbial variation and community temporal dynamics; measuring multiple 'omic' data types that better capture community functional activity, such as transcriptomics, proteomics and metabolomics; and combining multiple forms of omic data in an integrated framework. We highlight studies in which the 'multi-omics' approach has led to improved mechanistic models of microbial community structure and function.
Collapse
|
14
|
Mills CL, Beuning PJ, Ondrechen MJ. Biochemical functional predictions for protein structures of unknown or uncertain function. Comput Struct Biotechnol J 2015; 13:182-91. [PMID: 25848497 PMCID: PMC4372640 DOI: 10.1016/j.csbj.2015.02.003] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 02/06/2015] [Accepted: 02/11/2015] [Indexed: 01/07/2023] Open
Abstract
With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations.
Collapse
Affiliation(s)
- Caitlyn L Mills
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| |
Collapse
|
15
|
Abstract
Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, in Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.
Collapse
|
16
|
Salvi F, Agniswamy J, Yuan H, Vercammen K, Pelicaen R, Cornelis P, Spain JC, Weber IT, Gadda G. The combined structural and kinetic characterization of a bacterial nitronate monooxygenase from Pseudomonas aeruginosa PAO1 establishes NMO class I and II. J Biol Chem 2014; 289:23764-75. [PMID: 25002579 DOI: 10.1074/jbc.m114.577791] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Nitronate monooxygenase (NMO) oxidizes the mitochondrial toxin propionate 3-nitronate (P3N) to malonate semialdehyde. The enzyme has been previously characterized biochemically in fungi, but no structural information is available. Based on amino acid similarity 4,985 genes are annotated in the GenBank(TM) as NMO. Of these, 4,424 (i.e. 89%) are bacterial genes, including several Pseudomonads that have been shown to use P3N as growth substrate. Here, we have cloned and expressed the gene pa4202 of Pseudomonas aeruginosa PAO1, purified the resulting protein, and characterized it. The enzyme is active on P3N and other alkyl nitronates, but cannot oxidize nitroalkanes. P3N is the best substrate at pH 7.5 and atmospheric oxygen with k(cat)(app)/K(m)(app) of 12 × 10(6) M(-1) s(-1), k(cat)(app) of 1300 s(-1), and K(m)(app) of 110 μm. Anerobic reduction of the enzyme with P3N yields a flavosemiquinone, which is formed within 7.5 ms, consistent with this species being a catalytic intermediate. Absorption spectroscopy, mass spectrometry, and x-ray crystallography demonstrate a tightly, non-covalently bound FMN in the active site of the enzyme. Thus, PA4202 is the first NMO identified and characterized in bacteria. The x-ray crystal structure of the enzyme was solved at 1.44 Å, showing a TIM barrel-fold. Four motifs in common with the biochemically characterized NMO from Cyberlindnera saturnus are identified in the structure of bacterial NMO, defining Class I NMO, which includes bacterial, fungal, and two animal NMOs. Notably, the only other NMO from Neurospora crassa for which biochemical evidence is available lacks the four motifs, defining Class II NMO.
Collapse
Affiliation(s)
| | | | | | - Ken Vercammen
- the Department of Bioengineering Sciences, Vrije Universiteit Brussel, Belgium, and the Department of Structural Biology Brussels, VIB, Pleinlaan 2, 1050 Brussels, Belgium
| | - Rudy Pelicaen
- the Department of Bioengineering Sciences, Vrije Universiteit Brussel, Belgium, and the Department of Structural Biology Brussels, VIB, Pleinlaan 2, 1050 Brussels, Belgium
| | - Pierre Cornelis
- the Department of Bioengineering Sciences, Vrije Universiteit Brussel, Belgium, and the Department of Structural Biology Brussels, VIB, Pleinlaan 2, 1050 Brussels, Belgium
| | - Jim C Spain
- the School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30302
| | - Irene T Weber
- From the Departments of Chemistry, Biology, Center for Biotechnology and Drug Design,
| | - Giovanni Gadda
- From the Departments of Chemistry, Biology, Center for Biotechnology and Drug Design, Center for Diagnostics and Therapeutics, Georgia State University, Atlanta, Georgia 30302,
| |
Collapse
|
17
|
Galperin MY, Koonin EV. Comparative Genomics Approaches to Identifying Functionally Related Genes. ALGORITHMS FOR COMPUTATIONAL BIOLOGY 2014. [DOI: 10.1007/978-3-319-07953-0_1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
18
|
Abstract
The genomic revolution promises great advances in the search for useful biocatalysts. Function-based metagenomic approaches have identified several enzymes with properties that make them useful candidates for a variety of bioprocesses. As DNA sequencing costs continue to decline, the volume of genomic data, along with their corresponding predicted protein sequences, will continue to increase dramatically, necessitating new approaches to leverage this information for gene-based bioprospecting efforts. Additionally, as new functions are discovered and correlated with this sequence information, the knowledge of the often complex relationship between a protein's sequence and function will improve. This in turn will lead to better gene-based bioprospecting approaches and facilitate the tailoring of desired properties through protein engineering projects. In this chapter, we discuss a number of recent advances in bioprospecting within the context of the genomic age.
Collapse
Affiliation(s)
- Michael A Hicks
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Kristala L J Prather
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Synthetic Biology Engineering Research Center (SynBERC), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| |
Collapse
|
19
|
Hwang WC, Bakolitsa C, Punta M, Coggill PC, Bateman A, Axelrod HL, Rawlings ND, Sedova M, Peterson SN, Eberhardt RY, Aravind L, Pascual J, Godzik A. LUD, a new protein domain associated with lactate utilization. BMC Bioinformatics 2013; 14:341. [PMID: 24274019 PMCID: PMC3924224 DOI: 10.1186/1471-2105-14-341] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 11/19/2013] [Indexed: 11/24/2022] Open
Abstract
Background A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we hereby redefine DUF162 as the LUD domain family. Results JCSG solved the first crystal structure [PDB:2G40] from the LUD domain family: LutC protein, encoded by ORF DR_1909, of Deinococcus radiodurans. LutC shares features with domains in the functionally diverse ISOCOT superfamily. We have observed that the LUD domain has an increased abundance in the human gut microbiome. Conclusions We propose a model for the substrate and cofactor binding and regulation in LUD domain. The significance of LUD-containing proteins in the human gut microbiome, and the implication of lactate metabolism in the radiation-resistance of Deinococcus radiodurans are discussed.
Collapse
Affiliation(s)
- William C Hwang
- Joint Center for Structural Genomics, La Jolla, CA 92037, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Martin AJM, Walsh I, Domenico TD, Mičetić I, Tosatto SCE. PANADA: protein association network annotation, determination and analysis. PLoS One 2013; 8:e78383. [PMID: 24265686 PMCID: PMC3827049 DOI: 10.1371/journal.pone.0078383] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2013] [Accepted: 09/20/2013] [Indexed: 11/18/2022] Open
Abstract
Increasingly large numbers of proteins require methods for functional annotation. This is typically based on pairwise inference from the homology of either protein sequence or structure. Recently, similarity networks have been presented to leverage both the ability to visualize relationships between proteins and assess the transferability of functional inference. Here we present PANADA, a novel toolkit for the visualization and analysis of protein similarity networks in Cytoscape. Networks can be constructed based on pairwise sequence or structural alignments either on a set of proteins or, alternatively, by database search from a single sequence. The Panada web server, executable for download and examples and extensive help files are available at URL: http://protein.bio.unipd.it/panada/.
Collapse
Affiliation(s)
| | - Ian Walsh
- Department of Biology, University of Padova, Padova, Italy
| | | | - Ivan Mičetić
- Department of Biology, University of Padova, Padova, Italy
| | | |
Collapse
|
21
|
Mazumdar V, Amar S, Segrè D. Metabolic proximity in the order of colonization of a microbial community. PLoS One 2013; 8:e77617. [PMID: 24204896 PMCID: PMC3813667 DOI: 10.1371/journal.pone.0077617] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2013] [Accepted: 09/03/2013] [Indexed: 01/25/2023] Open
Abstract
Microbial biofilms are often composed of multiple bacterial species that accumulate by adhering to a surface and to each other. Biofilms can be resistant to antibiotics and physical stresses, posing unresolved challenges in the fight against infectious diseases. It has been suggested that early colonizers of certain biofilms could cause local environmental changes, favoring the aggregation of subsequent organisms. Here we ask whether the enzyme content of different microbes in a well-characterized dental biofilm can be used to predict their order of colonization. We define a metabolic distance between different species, based on the overlap in their enzyme content. We next use this metric to quantify the average metabolic distance between neighboring organisms in the biofilm. We find that this distance is significantly smaller than the one observed for a random choice of prokaryotes, probably reflecting the environmental constraints on metabolic function of the community. More surprisingly, this metabolic metric is able to discriminate between observed and randomized orders of colonization of the biofilm, with the observed orders displaying smaller metabolic distance than randomized ones. By complementing these results with the analysis of individual vs. joint metabolic networks, we find that the tendency towards minimal metabolic distance may be counter-balanced by a propensity to pair organisms with maximal joint potential for synergistic interactions. The trade-off between these two tendencies may create a "sweet spot" of optimal inter-organism distance, with possible broad implications for our understanding of microbial community organization.
Collapse
Affiliation(s)
- Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Salomon Amar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Center for Anti-Inflammatory Therapeutics; Boston University Goldman School of Dental Medicine, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biology and Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
22
|
Abstract
The 1952 observation of host-induced non-hereditary variation in bacteriophages by Salvador Luria and Mary Human led to the discovery in the 1960s of modifying enzymes that glucosylate hydroxymethylcytosine in T-even phages and of genes encoding corresponding host activities that restrict non-glucosylated phage DNA: rglA and rglB (restricts glucoseless phage). In the 1980’s, appreciation of the biological scope of these activities was dramatically expanded with the demonstration that plant and animal DNA was also sensitive to restriction in cloning experiments. The rgl genes were renamed mcrA and mcrBC (modified cytosine restriction). The new class of modification-dependent restriction enzymes was named Type IV, as distinct from the familiar modification-blocked Types I–III. A third Escherichia coli enzyme, mrr (modified DNA rejection and restriction) recognizes both methylcytosine and methyladenine. In recent years, the universe of modification-dependent enzymes has expanded greatly. Technical advances allow use of Type IV enzymes to study epigenetic mechanisms in mammals and plants. Type IV enzymes recognize modified DNA with low sequence selectivity and have emerged many times independently during evolution. Here, we review biochemical and structural data on these proteins, the resurgent interest in Type IV enzymes as tools for epigenetic research and the evolutionary pressures on these systems.
Collapse
Affiliation(s)
- Wil A M Loenen
- Leiden University Medical Center, P.O. Box 9600 2300RC Leiden, The Netherlands and New England Biolabs Inc., 240 County Road Ipswich, MA 01938-2723, USA
| | | |
Collapse
|
23
|
Anton BP, Chang YC, Brown P, Choi HP, Faller LL, Guleria J, Hu Z, Klitgord N, Levy-Moonshine A, Maksad A, Mazumdar V, McGettrick M, Osmani L, Pokrzywa R, Rachlin J, Swaminathan R, Allen B, Housman G, Monahan C, Rochussen K, Tao K, Bhagwat AS, Brenner SE, Columbus L, de Crécy-Lagard V, Ferguson D, Fomenkov A, Gadda G, Morgan RD, Osterman AL, Rodionov DA, Rodionova IA, Rudd KE, Söll D, Spain J, Xu SY, Bateman A, Blumenthal RM, Bollinger JM, Chang WS, Ferrer M, Friedberg I, Galperin MY, Gobeill J, Haft D, Hunt J, Karp P, Klimke W, Krebs C, Macelis D, Madupu R, Martin MJ, Miller JH, O'Donovan C, Palsson B, Ruch P, Setterdahl A, Sutton G, Tate J, Yakunin A, Tchigvintsev D, Plata G, Hu J, Greiner R, Horn D, Sjölander K, Salzberg SL, Vitkup D, Letovsky S, Segrè D, DeLisi C, Roberts RJ, Steffen M, Kasif S. The COMBREX project: design, methodology, and initial results. PLoS Biol 2013; 11:e1001638. [PMID: 24013487 PMCID: PMC3754883 DOI: 10.1371/journal.pbio.1001638] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- Brian P. Anton
- New England Biolabs, Ipswich, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| | - Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Peter Brown
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Han-Pil Choi
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Lina L. Faller
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Jyotsna Guleria
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Zhenjun Hu
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Ami Levy-Moonshine
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Almaz Maksad
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Mark McGettrick
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Lais Osmani
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Revonda Pokrzywa
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - John Rachlin
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Rajeswari Swaminathan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Benjamin Allen
- Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Mathematics, Emmanuel College, Boston, Massachusetts, United States of America
| | - Genevieve Housman
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Caitlin Monahan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Krista Rochussen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Kevin Tao
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Ashok S. Bhagwat
- Department of Chemistry, Wayne State University, Detroit, Michigan, United States of America
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California, United States of America
| | - Linda Columbus
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, United States of America
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida, United States of America
| | - Donald Ferguson
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Alexey Fomenkov
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Giovanni Gadda
- Department of Chemistry, Georgia State University, Atlanta, Georgia, United States of America
| | - Richard D. Morgan
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Andrei L. Osterman
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Dmitry A. Rodionov
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Irina A. Rodionova
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Kenneth E. Rudd
- Department of Biochemistry and Molecular Biology, University of Miami, Miami, Florida, United States of America
| | - Dieter Söll
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - James Spain
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Shuang-yong Xu
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Alex Bateman
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Robert M. Blumenthal
- Department of Medical Microbiology and Immunology, and Program in Bioinformatics, University of Toledo, Toledo, Ohio, United States of America
| | - J. Martin Bollinger
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Woo-Suk Chang
- Department of Biology, University of Texas-Arlington, Arlington, Texas, United States of America
| | - Manuel Ferrer
- Spanish National Research Council (CSIC), Institute of Catalysis, Madrid, Spain
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Michael Y. Galperin
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Julien Gobeill
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Daniel Haft
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Hunt
- Biological Sciences, Columbia University, New York, New York, United States of America
| | - Peter Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, California, United States of America
| | - William Klimke
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Carsten Krebs
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Dana Macelis
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Ramana Madupu
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Maria J. Martin
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Jeffrey H. Miller
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Claire O'Donovan
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Bernhard Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Patrick Ruch
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Aaron Setterdahl
- Department of Chemistry, Indiana University Southeast, New Albany, Indiana, United States of America
| | - Granger Sutton
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Tate
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Alexander Yakunin
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Dmitri Tchigvintsev
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Germán Plata
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
- Integrated Program in Cellular, Molecular, Structural, and Genetic Studies, Columbia University, New York, New York, United States of America
| | - Jie Hu
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
| | - David Horn
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Kimmen Sjölander
- Berkeley Phylogenomics Group, University of California, Berkeley, California, United States of America
| | - Steven L. Salzberg
- Departments of Medicine and Biostatistics, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Dennis Vitkup
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Stanley Letovsky
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Charles DeLisi
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Richard J. Roberts
- New England Biolabs, Ipswich, Massachusetts, United States of America
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Martin Steffen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| |
Collapse
|
24
|
Choi HP, Juarez S, Ciordia S, Fernandez M, Bargiela R, Albar JP, Mazumdar V, Anton BP, Kasif S, Ferrer M, Steffen M. Biochemical Characterization of Hypothetical Proteins from Helicobacter pylori. PLoS One 2013; 8:e66605. [PMID: 23825549 PMCID: PMC3688963 DOI: 10.1371/journal.pone.0066605] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Accepted: 05/08/2013] [Indexed: 12/16/2022] Open
Abstract
The functional characterization of Open Reading Frames (ORFs) from sequenced genomes remains a bottleneck in our effort to understand microbial biology. In particular, the functional characterization of proteins with only remote sequence homology to known proteins can be challenging, as there may be few clues to guide initial experiments. Affinity enrichment of proteins from cell lysates, and a global perspective of protein function as provided by COMBREX, affords an approach to this problem. We present here the biochemical analysis of six proteins from Helicobacter pylori ATCC 26695, a focus organism in COMBREX. Initial hypotheses were based upon affinity capture of proteins from total cellular lysate using derivatized nano-particles, and subsequent identification by mass spectrometry. Candidate genes encoding these proteins were cloned and expressed in Escherichia coli, and the recombinant proteins were purified and characterized biochemically and their biochemical parameters compared with the native ones. These proteins include a guanosine triphosphate (GTP) cyclohydrolase (HP0959), an ATPase (HP1079), an adenosine deaminase (HP0267), a phosphodiesterase (HP1042), an aminopeptidase (HP1037), and new substrates were characterized for a peptidoglycan deacetylase (HP0310). Generally, characterized enzymes were active at acidic to neutral pH (4.0–7.5) with temperature optima ranging from 35 to 55°C, although some exhibited outstanding characteristics.
Collapse
Affiliation(s)
- Han-Pil Choi
- Dept of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Silvia Juarez
- Proteomic Facility, CNB-National Centre for Biotechnology, CSIC, Darwin 3, Madrid, Spain
| | - Sergio Ciordia
- Proteomic Facility, CNB-National Centre for Biotechnology, CSIC, Darwin 3, Madrid, Spain
| | - Marisol Fernandez
- Proteomic Facility, CNB-National Centre for Biotechnology, CSIC, Darwin 3, Madrid, Spain
| | - Rafael Bargiela
- Spanish National Research Council (CSIC), Institute of Catalysis, Madrid, Spain
| | - Juan P. Albar
- Proteomic Facility, CNB-National Centre for Biotechnology, CSIC, Darwin 3, Madrid, Spain
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Brian P. Anton
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Simon Kasif
- Dept of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Manuel Ferrer
- Spanish National Research Council (CSIC), Institute of Catalysis, Madrid, Spain
- * E-mail: (MS); (MF)
| | - Martin Steffen
- Dept of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Dept of Pathology and Laboratory Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America
- * E-mail: (MS); (MF)
| |
Collapse
|
25
|
Bell MJ, Gillespie CS, Swan D, Lord P. An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB. Bioinformatics 2013; 28:i562-i568. [PMID: 22962482 PMCID: PMC3436799 DOI: 10.1093/bioinformatics/bts372] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. Contact:phillip.lord@newcastle.ac.uk
Collapse
Affiliation(s)
- Michael J Bell
- School of Computing Science, Newcastle University, Newcastle-Upon-Tyne, NE1 7RU, UK
| | | | | | | |
Collapse
|
26
|
Buttigieg PL, Hankeln W, Kostadinov I, Kottmann R, Yilmaz P, Duhaime MB, Glöckner FO. Ecogenomic perspectives on domains of unknown function: correlation-based exploration of marine metagenomes. PLoS One 2013; 8:e50869. [PMID: 23516388 PMCID: PMC3597751 DOI: 10.1371/journal.pone.0050869] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2012] [Accepted: 10/24/2012] [Indexed: 11/19/2022] Open
Abstract
Background The proportion of conserved DNA sequences with no clear function is steadily growing in bioinformatics databases. Studies of sequence and structural homology have indicated that many uncharacterized protein domain sequences are variants of functionally described domains. If these variants promote an organism's ecological fitness, they are likely to be conserved in the genome of its progeny and the population at large. The genetic composition of microbial communities in their native ecosystems is accessible through metagenomics. We hypothesize the co-variation of protein domain sequences across metagenomes from similar ecosystems will provide insights into their potential roles and aid further investigation. Methodology/Principal findings We calculated the correlation of Pfam protein domain sequences across the Global Ocean Sampling metagenome collection, employing conservative detection and correlation thresholds to limit results to well-supported hits and associations. We then examined intercorrelations between domains of unknown function (DUFs) and domains involved in known metabolic pathways using network visualization and cluster-detection tools. We used a cautious “guilty-by-association” approach, referencing knowledge-level resources to identify and discuss associations that offer insight into DUF function. We observed numerous DUFs associated to photobiologically active domains and prevalent in the Cyanobacteria. Other clusters included DUFs associated with DNA maintenance and repair, inorganic nutrient metabolism, and sodium-translocating transport domains. We also observed a number of clusters reflecting known metabolic associations and cases that predicted functional reclassification of DUFs. Conclusion/Significance Critically examining domain covariation across metagenomic datasets can grant new perspectives on the roles and associations of DUFs in an ecological setting. Targeted attempts at DUF characterization in the laboratory or in silico may draw from these insights and opportunities to discover new associations and corroborate existing ones will arise as more large-scale metagenomic datasets emerge.
Collapse
Affiliation(s)
- Pier Luigi Buttigieg
- Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany.
| | | | | | | | | | | | | |
Collapse
|
27
|
Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol 2013; 9:e1002852. [PMID: 23308060 PMCID: PMC3536626 DOI: 10.1371/journal.pcbi.1002852] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 11/05/2012] [Indexed: 11/19/2022] Open
Abstract
New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs-homologs separated by a speciation and a duplication event, respectively-provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ~400000 specific annotations with the estimated Precision of 90%, ~19000 of which are highly specific-e.g. "penicillin binding," "tRNA aminoacylation for protein translation," or "pathogenesis"-and are freely available at http://gorbi.irb.hr/.
Collapse
|
28
|
Abstract
EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection.
Collapse
Affiliation(s)
- Jindan Zhou
- Department of Biochemistry and Molecular Biology, The Miller School of Medicine, University of Miami, Miami, FL 33143, USA
| | | |
Collapse
|
29
|
Wood DE, Lin H, Levy-Moonshine A, Swaminathan R, Chang YC, Anton BP, Osmani L, Steffen M, Kasif S, Salzberg SL. Thousands of missed genes found in bacterial genomes and their analysis with COMBREX. Biol Direct 2012; 7:37. [PMID: 23111013 PMCID: PMC3534567 DOI: 10.1186/1745-6150-7-37] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2012] [Accepted: 10/23/2012] [Indexed: 12/01/2022] Open
Abstract
Background The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. Results By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. Conclusions Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website. Reviewers This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Murray IA, Clark TA, Morgan RD, Boitano M, Anton BP, Luong K, Fomenkov A, Turner SW, Korlach J, Roberts RJ. The methylomes of six bacteria. Nucleic Acids Res 2012; 40:11450-62. [PMID: 23034806 PMCID: PMC3526280 DOI: 10.1093/nar/gks891] [Citation(s) in RCA: 192] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Six bacterial genomes, Geobacter metallireducens GS-15, Chromohalobacter salexigens, Vibrio breoganii 1C-10, Bacillus cereus ATCC 10987, Campylobacter jejuni subsp. jejuni 81-176 and C. jejuni NCTC 11168, all of which had previously been sequenced using other platforms were re-sequenced using single-molecule, real-time (SMRT) sequencing specifically to analyze their methylomes. In every case a number of new N(6)-methyladenine ((m6)A) and N(4)-methylcytosine ((m4)C) methylation patterns were discovered and the DNA methyltransferases (MTases) responsible for those methylation patterns were assigned. In 15 cases, it was possible to match MTase genes with MTase recognition sequences without further sub-cloning. Two Type I restriction systems required sub-cloning to differentiate their recognition sequences, while four MTase genes that were not expressed in the native organism were sub-cloned to test for viability and recognition sequences. Two of these proved active. No attempt was made to detect 5-methylcytosine ((m5)C) recognition motifs from the SMRT® sequencing data because this modification produces weaker signals using current methods. However, all predicted (m6)A and (m4)C MTases were detected unambiguously. This study shows that the addition of SMRT sequencing to traditional sequencing approaches gives a wealth of useful functional information about a genome showing not only which MTase genes are active but also revealing their recognition sequences.
Collapse
Affiliation(s)
- Iain A Murray
- New England Biolabs, 240 County Road, Ipswich, MA 01938, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Liu B, Faller LL, Klitgord N, Mazumdar V, Ghodsi M, Sommer DD, Gibbons TR, Treangen TJ, Chang YC, Li S, Stine OC, Hasturk H, Kasif S, Segrè D, Pop M, Amar S. Deep sequencing of the oral microbiome reveals signatures of periodontal disease. PLoS One 2012; 7:e37919. [PMID: 22675498 PMCID: PMC3366996 DOI: 10.1371/journal.pone.0037919] [Citation(s) in RCA: 263] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2011] [Accepted: 04/30/2012] [Indexed: 11/18/2022] Open
Abstract
The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (~2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ~90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
Collapse
Affiliation(s)
- Bo Liu
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Lina L. Faller
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Mohammad Ghodsi
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Daniel D. Sommer
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Theodore R. Gibbons
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Biological Sciences Graduate Program, University of Maryland, College Park, Maryland, United States of America
| | - Todd J. Treangen
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- The McKusick-Nathans Institute for Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Shan Li
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - O. Colin Stine
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Hatice Hasturk
- The Forysth Institute, Department of Periodontology, Cambridge, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Children’s Informatics Program, Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biology, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Mihai Pop
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
- Biological Sciences Graduate Program, University of Maryland, College Park, Maryland, United States of America
| | - Salomon Amar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Center for Anti-Inflammatory Therapeutics; Boston University Goldman School of Dental Medicine, Boston, Massachusetts, United States of America
| |
Collapse
|
32
|
The CanOE strategy: integrating genomic and metabolic contexts across multiple prokaryote genomes to find candidate genes for orphan enzymes. PLoS Comput Biol 2012; 8:e1002540. [PMID: 22693442 PMCID: PMC3364942 DOI: 10.1371/journal.pcbi.1002540] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Accepted: 04/01/2012] [Indexed: 12/17/2022] Open
Abstract
Of all biochemically characterized metabolic reactions formalized by the IUBMB, over one out of four have yet to be associated with a nucleic or protein sequence, i.e. are sequence-orphan enzymatic activities. Few bioinformatics annotation tools are able to propose candidate genes for such activities by exploiting context-dependent rather than sequence-dependent data, and none are readily accessible and propose result integration across multiple genomes. Here, we present CanOE (Candidate genes for Orphan Enzymes), a four-step bioinformatics strategy that proposes ranked candidate genes for sequence-orphan enzymatic activities (or orphan enzymes for short). The first step locates “genomic metabolons”, i.e. groups of co-localized genes coding proteins catalyzing reactions linked by shared metabolites, in one genome at a time. These metabolons can be particularly helpful for aiding bioanalysts to visualize relevant metabolic data. In the second step, they are used to generate candidate associations between un-annotated genes and gene-less reactions. The third step integrates these gene-reaction associations over several genomes using gene families, and summarizes the strength of family-reaction associations by several scores. In the final step, these scores are used to rank members of gene families which are proposed for metabolic reactions. These associations are of particular interest when the metabolic reaction is a sequence-orphan enzymatic activity. Our strategy found over 60,000 genomic metabolons in more than 1,000 prokaryote organisms from the MicroScope platform, generating candidate genes for many metabolic reactions, of which more than 70 distinct orphan reactions. A computational validation of the approach is discussed. Finally, we present a case study on the anaerobic allantoin degradation pathway in Escherichia coli K-12. The discovery of the various metabolic functions catalyzed by enzymes encoded by the genes from the exponentially increasing number of sequenced genomes is one of the main focuses of bioinformatics tools today. However, most of these tools rely on already identified enzyme-coding gene or protein sequence information to predict known enzymatic activities in new genomes. Therefore, they cannot be used to reveal metabolic activities without any corresponding sequenced genes, dubbed “sequence-orphan activities”. In such cases, the best approach is the bioanalysis of target genes by human expert curators, manually integrating so-called “context-based information” (such as gene co-localization on the genome, or the presence of incomplete metabolic pathways) to infer novel functions. Few bioinformatics tools exploit such information and render accessible results in an automated way. Here, we present “CanOE”, a strategy that uses contextual information to propose and rank Candidate genes for Orphan Enzymes in Bacteria and Archaea. Beyond the merit of extending our knowledge and comprehension of prokaryote metabolism, identifying coding genes for sequence-orphan activities opens new opportunities for functional annotation (homology-based transfer made accessible), drug design (new metabolic targets), synthetic biology (new building blocks) and biotechnology applications (new biocatalysts).
Collapse
|
33
|
Smirnov SV, Sokolov PM, Kodera T, Sugiyama M, Hibi M, Shimizu S, Yokozeki K, Ogawa J. A novel family of bacterial dioxygenases that catalyse the hydroxylation of free L-amino acids. FEMS Microbiol Lett 2012; 331:97-104. [PMID: 22448874 DOI: 10.1111/j.1574-6968.2012.02558.x] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2011] [Revised: 03/04/2012] [Accepted: 03/22/2012] [Indexed: 11/27/2022] Open
Abstract
L-isoleucine-4-hydroxylase (IDO) is a recently discovered member of the Pfam family PF10014 (the former DUF 2257 family) of uncharacterized conserved bacterial proteins. To uncover the range of biochemical activities carried out by PF10014 members, eight in silico-selected IDO homologues belonging to the PF10014 were cloned and expressed in Escherichia coli. L-methionine, L-leucine, L-isoleucine and L-threonine were found to be catalysed by the investigated enzymes, producing L-methionine sulfoxide, 4-hydroxyleucine, 4-hydroxyisoleucine and 4-hydroxythreonine, respectively. An investigation of enzyme kinetics suggested the existence of a novel subfamily of bacterial dioxygenases within the PF10014 family for which free L-amino acids could be accepted as in vivo substrates. A hypothesis regarding the physiological significance of hydroxylated l-amino acids is also discussed.
Collapse
|
34
|
Greene CS, Troyanskaya OG. Accurate evaluation and analysis of functional genomics data and methods. Ann N Y Acad Sci 2012; 1260:95-100. [PMID: 22268703 DOI: 10.1111/j.1749-6632.2011.06383.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner.
Collapse
Affiliation(s)
- Casey S Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA.
| | | |
Collapse
|
35
|
Madupu R, Richter A, Dodson RJ, Brinkac L, Harkins D, Durkin S, Shrivastava S, Sutton G, Haft D. CharProtDB: a database of experimentally characterized protein annotations. Nucleic Acids Res 2011; 40:D237-41. [PMID: 22140108 PMCID: PMC3245046 DOI: 10.1093/nar/gkr1133] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
Collapse
Affiliation(s)
- Ramana Madupu
- J Craig Venter institute, 9704 Medical Center Drive Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Deutschbauer A, Price MN, Wetmore KM, Shao W, Baumohl JK, Xu Z, Nguyen M, Tamse R, Davis RW, Arkin AP. Evidence-based annotation of gene function in Shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions. PLoS Genet 2011; 7:e1002385. [PMID: 22125499 PMCID: PMC3219624 DOI: 10.1371/journal.pgen.1002385] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Accepted: 09/30/2011] [Indexed: 11/21/2022] Open
Abstract
Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. Many computationally predicted gene annotations in bacteria are incomplete or wrong. Consequently, experimental methods to systematically determine gene function in bacteria are required. Here, we describe a genetic approach to meet this challenge. We constructed a large transposon mutant library in the metal-reducing bacterium Shewanella oneidensis MR-1 and profiled the fitness of this collection in more than 100 diverse experimental conditions. In addition to identifying a phenotype for more than 2,000 genes, we demonstrate that mutant fitness profiles can be used to assign “evidence-based” gene annotations for enzymes, signaling proteins, transporters, and transcription factors, a subset of which we verify experimentally.
Collapse
Affiliation(s)
- Adam Deutschbauer
- Physical Bioscience Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Morgan N. Price
- Physical Bioscience Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Kelly M. Wetmore
- Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Wenjun Shao
- Physical Bioscience Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Jason K. Baumohl
- Physical Bioscience Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Zhuchen Xu
- Department of Bioengineering, University of California Berkeley, Berkeley, California, United States of America
| | - Michelle Nguyen
- Stanford Genome Technology Center, Department of Biochemistry, Stanford University, Stanford, California, United States of America
| | - Raquel Tamse
- Stanford Genome Technology Center, Department of Biochemistry, Stanford University, Stanford, California, United States of America
| | - Ronald W. Davis
- Stanford Genome Technology Center, Department of Biochemistry, Stanford University, Stanford, California, United States of America
| | - Adam P. Arkin
- Physical Bioscience Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Department of Bioengineering, University of California Berkeley, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
37
|
Brown SD, Babbitt PC. Inference of functional properties from large-scale analysis of enzyme superfamilies. J Biol Chem 2011; 287:35-42. [PMID: 22069325 DOI: 10.1074/jbc.r111.283408] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
As increasingly large amounts of data from genome and other sequencing projects become available, new approaches are needed to determine the functions of the proteins these genes encode. We show how large-scale computational analysis can help to address this challenge by linking functional information to sequence and structural similarities using protein similarity networks. Network analyses using three functionally diverse enzyme superfamilies illustrate the use of these approaches for facile updating and comparison of available structures for a large superfamily, for creation of functional hypotheses for metagenomic sequences, and to summarize the limits of our functional knowledge about even well studied superfamilies.
Collapse
Affiliation(s)
- Shoshana D Brown
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, 94158-2330
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, 94158-2330; Pharmaceutical Chemistry, School of Pharmacy; California Institute for Quantitative Biosciences, University of California, San Francisco, California 94158-2330.
| |
Collapse
|
38
|
Klimke W, O'Donovan C, White O, Brister JR, Clark K, Fedorov B, Mizrachi I, Pruitt KD, Tatusova T. Solving the Problem: Genome Annotation Standards before the Data Deluge. Stand Genomic Sci 2011; 5:168-93. [PMID: 22180819 PMCID: PMC3236044 DOI: 10.4056/sigs.2084864] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Collapse
|
39
|
Abstract
COMBREX (computational bridges to experimentation) is a project to engage the biological community in providing better functional annotation of genomes. In essence, the project involves the generation by computational biologists of a database of predicted functions for genes in bacterial genomes. Those genes for which no functional assignments have been proven experimentally are then open for bids by biochemists to test the predicted functions. High-priority genes are those for which no previous functional assignment has been made as well as those where uncharacterized examples are present in many genomes. A pilot project is running that focuses on bacterial and archaeal genomes.
Collapse
|
40
|
Ecosystems biology of microbial metabolism. Curr Opin Biotechnol 2011; 22:541-6. [PMID: 21592777 DOI: 10.1016/j.copbio.2011.04.018] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2011] [Revised: 03/31/2011] [Accepted: 04/20/2011] [Indexed: 11/22/2022]
Abstract
The metabolic capabilities of many environmentally and medically important microbes can be quantitatively explored using systems biology approaches to metabolic networks. Yet, as we learn more about the complex microbe-microbe and microbe-environment interactions in microbial communities, it is important to understand whether and how system-level approaches can be extended to the ecosystem level. Here we summarize recent work that addresses these challenges at multiple scales, starting from two-species natural and synthetic ecology models, up to biosphere-level approaches. Among the many fascinating open challenges in this field is whether the integration of high throughput sequencing methods and mathematical models will help us capture emerging principles of ecosystem-level metabolic organization and evolution.
Collapse
|
41
|
Galperin MY, Cochrane GR. The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res 2011; 39:D1-6. [PMID: 21177655 PMCID: PMC3013748 DOI: 10.1093/nar/gkq1243] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The current 18th Database Issue of Nucleic Acids Research features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology. It includes two editorials, one that discusses COMBREX, a new exciting project aimed at figuring out the functions of the ‘conserved hypothetical’ proteins, and one concerning BioDBcore, a proposed description of the ‘minimal information about a biological database’. Papers from the members of the International Nucleotide Sequence Database collaboration (INSDC) describe each of the participating databases, DDBJ, ENA and GenBank, principles of data exchange within the collaboration, and the recently established Sequence Read Archive. A testament to the longevity of databases, this issue includes updates on the RNA modification database, Definition of Secondary Structure of Proteins (DSSP) and Homology-derived Secondary Structure of Proteins (HSSP) databases, which have not been featured here in >12 years. There is also a block of papers describing recent progress in protein structure databases, such as Protein DataBank (PDB), PDB in Europe (PDBe), CATH, SUPERFAMILY and others, as well as databases on protein structure modeling, protein–protein interactions and the organization of inter-protein contact sites. Other highlights include updates of the popular gene expression databases, GEO and ArrayExpress, several cancer gene databases and a detailed description of the UK PubMed Central project. The Nucleic Acids Research online Database Collection, available at: http://www.oxfordjournals.org/nar/database/a/, now lists 1330 carefully selected molecular biology databases. The full content of the Database Issue is freely available online at the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|