1
|
Mazloom R, Pierce-Ward NT, Sharma P, Pritchard L, Brown CT, Vinatzer BA, Heath LS. LINgroups as a Robust Principled Approach to Compare and Integrate Multiple Bacterial Taxonomies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2304-2314. [PMID: 39374286 DOI: 10.1109/tcbb.2024.3475917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
As a central organizing principle of biology, bacteria and archaea are classified into a hierarchical structure across taxonomic ranks from kingdom to subspecies. Traditionally, this organization was based on observable characteristics of form and chemistry but recently, bacterial taxonomy has been robustly quantified using comparisons of sequenced genomes, as exemplified in the Genome Taxonomy Database (GTDB). Such genome-based taxonomies resolve genomes down to genera and species and are useful in many contexts yet lack the flexibility and resolution of a fine-grained approach. The Life Identification Number (LIN) approach is a common, quantitative framework to tie existing (and future) bacterial taxonomies together, increase the resolution of genome-based discrimination of taxa, and extend taxonomic identification below the species level in a principled way. Utilizing LINgroup as an organizational concept helps resolve some of the confusion and unforeseen negative effects resulting from nomenclature changes of microorganisms that are closely related by overall genomic similarity (often due to genome-based reclassification). Our experimental results demonstrate the value of LINs and LINgroups in mapping between taxonomies, translating between different nomenclatures, and integrating them into a single taxonomic framework. They also reveal the robustness of LIN assignment to hyper-parameter changes when considering within-species taxonomic groups.
Collapse
|
2
|
Jansen van Rensburg MJ, Berger DJ, Yassine I, Shaw D, Fohrmann A, Bray JE, Jolley KA, Maiden MCJ, Brueggemann AB. Development of the Pneumococcal Genome Library, a core genome multilocus sequence typing scheme, and a taxonomic life identification number barcoding system to investigate and define pneumococcal population structure. Microb Genom 2024; 10:001280. [PMID: 39137139 PMCID: PMC11321556 DOI: 10.1099/mgen.0.001280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 07/17/2024] [Indexed: 08/15/2024] Open
Abstract
Investigating the genomic epidemiology of major bacterial pathogens is integral to understanding transmission, evolution, colonization, disease, antimicrobial resistance and vaccine impact. Furthermore, the recent accumulation of large numbers of whole genome sequences for many bacterial species enhances the development of robust genome-wide typing schemes to define the overall bacterial population structure and lineages within it. Using the previously published data, we developed the Pneumococcal Genome Library (PGL), a curated dataset of 30 976 genomes and contextual data for carriage and disease pneumococci recovered between 1916 and 2018 in 82 countries. We leveraged the size and diversity of the PGL to develop a core genome multilocus sequence typing (cgMLST) scheme comprised of 1222 loci. Finally, using multilevel single-linkage clustering, we stratified pneumococci into hierarchical clusters based on allelic similarity thresholds and defined these with a taxonomic life identification number (LIN) barcoding system. The PGL, cgMLST scheme and LIN barcodes represent a high-quality genomic resource and fine-scale clustering approaches for the analysis of pneumococcal populations, which support the genomic epidemiology and surveillance of this leading global pathogen.
Collapse
Affiliation(s)
| | - Duncan J. Berger
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Iman Yassine
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - David Shaw
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Andy Fohrmann
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - James E. Bray
- Department of Biology, University of Oxford, Oxford, UK
| | | | | | | |
Collapse
|
3
|
Krisna MA, Jolley KA, Monteith W, Boubour A, Hamers RL, Brueggemann AB, Harrison OB, Maiden MCJ. Development and implementation of a core genome multilocus sequence typing scheme for Haemophilus influenzae. Microb Genom 2024; 10:001281. [PMID: 39120932 PMCID: PMC11315579 DOI: 10.1099/mgen.0.001281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 07/18/2024] [Indexed: 08/10/2024] Open
Abstract
Haemophilus influenzae is part of the human nasopharyngeal microbiota and a pathogen causing invasive disease. The extensive genetic diversity observed in H. influenzae necessitates discriminatory analytical approaches to evaluate its population structure. This study developed a core genome multilocus sequence typing (cgMLST) scheme for H. influenzae using pangenome analysis tools and validated the cgMLST scheme using datasets consisting of complete reference genomes (N = 14) and high-quality draft H. influenzae genomes (N = 2297). The draft genome dataset was divided into a development dataset (N = 921) and a validation dataset (N = 1376). The development dataset was used to identify potential core genes, and the validation dataset was used to refine the final core gene list to ensure the reliability of the proposed cgMLST scheme. Functional classifications were made for all the resulting core genes. Phylogenetic analyses were performed using both allelic profiles and nucleotide sequence alignments of the core genome to test congruence, as assessed by Spearman's correlation and ordinary least square linear regression tests. Preliminary analyses using the development dataset identified 1067 core genes, which were refined to 1037 with the validation dataset. More than 70% of core genes were predicted to encode proteins essential for metabolism or genetic information processing. Phylogenetic and statistical analyses indicated that the core genome allelic profile accurately represented phylogenetic relatedness among the isolates (R 2 = 0.945). We used this cgMLST scheme to define a high-resolution population structure for H. influenzae, which enhances the genomic analysis of this clinically relevant human pathogen.
Collapse
Affiliation(s)
- Made Ananda Krisna
- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford, UK
- Department of Biology, University of Oxford, Oxford, UK
- Oxford University Clinical Research Unit Indonesia, Faculty of Medicine Universitas Indonesia, Jakarta, Indonesia
| | | | - William Monteith
- Department of Biology, University of Oxford, Oxford, UK
- Department of Biology and Biochemistry, University of Bath, Bath, UK
| | - Alexandra Boubour
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Raph L. Hamers
- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford, UK
- Oxford University Clinical Research Unit Indonesia, Faculty of Medicine Universitas Indonesia, Jakarta, Indonesia
| | | | - Odile B. Harrison
- Department of Biology, University of Oxford, Oxford, UK
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | | |
Collapse
|
4
|
Hennart M, Guglielmini J, Bridel S, Maiden MCJ, Jolley KA, Criscuolo A, Brisse S. A dual barcoding approach to bacterial strain nomenclature: Genomic taxonomy of Klebsiella pneumoniae strains. Mol Biol Evol 2022; 39:6608353. [PMID: 35700230 PMCID: PMC9254007 DOI: 10.1093/molbev/msac135] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Sublineages (SLs) within microbial species can differ widely in their ecology and pathogenicity, and their precise definition is important in basic research and for industrial or public health applications. Widely accepted strategies to define SLs are currently missing, which confuses communication in population biology and epidemiological surveillance. Here, we propose a broadly applicable genomic classification and nomenclature approach for bacterial strains, using the prominent public health threat Klebsiella pneumoniae as a model. Based on a 629-gene core genome multilocus sequence typing (cgMLST) scheme, we devised a dual barcoding system that combines multilevel single linkage (MLSL) clustering and life identification numbers (LINs). Phylogenetic and clustering analyses of >7,000 genome sequences captured population structure discontinuities, which were used to guide the definition of 10 infraspecific genetic dissimilarity thresholds. The widely used 7-gene multilocus sequence typing (MLST) nomenclature was mapped onto MLSL SLs (threshold: 190 allelic mismatches) and clonal group (threshold: 43) identifiers for backwards nomenclature compatibility. The taxonomy is publicly accessible through a community-curated platform (https://bigsdb.pasteur.fr/klebsiella), which also enables external users’ genomic sequences identification. The proposed strain taxonomy combines two phylogenetically informative barcode systems that provide full stability (LIN codes) and nomenclatural continuity with previous nomenclature (MLSL). This species-specific dual barcoding strategy for the genomic taxonomy of microbial strains is broadly applicable and should contribute to unify global and cross-sector collaborative knowledge on the emergence and microevolution of bacterial pathogens.
Collapse
Affiliation(s)
- Melanie Hennart
- Institut Pasteur, Université Paris Cité, Biodiversity and Epidemiology of Bacterial Pathogens, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
| | - Julien Guglielmini
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, France
| | - Sébastien Bridel
- Institut Pasteur, Université Paris Cité, Biodiversity and Epidemiology of Bacterial Pathogens, Paris, France
| | | | - Keith A. Jolley
- Department of Zoology, University of Oxford, Oxford, United Kingdom
| | - Alexis Criscuolo
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, France
| | | |
Collapse
|
5
|
Pritchard L, Brown CT, Harrington B, Heath LS, Pierce-Ward NT, Vinatzer BA. Could a Focus on the “Why” of Taxonomy Help Taxonomy Better Respond to the Needs of Science and Society? Front Microbiol 2022; 13:887310. [PMID: 35663905 PMCID: PMC9160990 DOI: 10.3389/fmicb.2022.887310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 04/29/2022] [Indexed: 11/20/2022] Open
Abstract
Genomics has put prokaryotic rank-based taxonomy on a solid phylogenetic foundation. However, most taxonomic ranks were set long before the advent of DNA sequencing and genomics. In this concept paper, we thus ask the following question: should prokaryotic classification schemes besides the current phylum-to-species ranks be explored, developed, and incorporated into scientific discourse? Could such alternative schemes provide better solutions to the basic need of science and society for which taxonomy was developed, namely, precise and meaningful identification? A neutral genome-similarity based framework is then described that could allow alternative classification schemes to be explored, compared, and translated into each other without having to choose only one as the gold standard. Classification schemes could thus continue to evolve and be selected according to their benefits and based on how well they fulfill the need for prokaryotic identification.
Collapse
Affiliation(s)
- Leighton Pritchard
- Strathclyde Institute for Pharmacy and Biomedical Sciences (SIPBS), University of Strathclyde, Glasgow, United Kingdom
| | - C. Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, United States
| | - Bailey Harrington
- Strathclyde Institute for Pharmacy and Biomedical Sciences (SIPBS), University of Strathclyde, Glasgow, United Kingdom
| | - Lenwood S. Heath
- Department of Computer Science, Virginia Tech, Blacksburg, VA, United States
| | - N. Tessa Pierce-Ward
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, United States
| | - Boris A. Vinatzer
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, United States
- *Correspondence: Boris A. Vinatzer,
| |
Collapse
|
6
|
Draft Genome Sequences of Four Streptomycin-Sensitive Erwinia amylovora Strains Isolated from Commercial Apple Orchards in Ohio. Microbiol Resour Announc 2021; 10:e0089321. [PMID: 34913716 PMCID: PMC8675263 DOI: 10.1128/mra.00893-21] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Erwinia amylovora is the causative agent of fire blight, a devastating disease of apples and pears worldwide. Here, we report draft genome sequences of four streptomycin-sensitive strains of E. amylovora that were isolated from diseased apple trees in Ohio.
Collapse
|
7
|
Tian L, Mazloom R, Heath LS, Vinatzer BA. LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes. PeerJ 2021; 9:e10906. [PMID: 33828908 PMCID: PMC8000461 DOI: 10.7717/peerj.10906] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 01/14/2021] [Indexed: 01/21/2023] Open
Abstract
Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.
Collapse
Affiliation(s)
- Long Tian
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USA
| | - Reza Mazloom
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| | - Lenwood S Heath
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| | - Boris A Vinatzer
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USA
| |
Collapse
|
8
|
Tian L, Huang C, Mazloom R, Heath LS, Vinatzer BA. LINbase: a web server for genome-based identification of prokaryotes as members of crowdsourced taxa. Nucleic Acids Res 2020; 48:W529-W537. [PMID: 32232369 PMCID: PMC7319462 DOI: 10.1093/nar/gkaa190] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 03/04/2020] [Accepted: 03/16/2020] [Indexed: 02/07/2023] Open
Abstract
High throughput DNA sequencing in combination with efficient algorithms could provide the basis for a highly resolved, genome phylogeny-based and digital prokaryotic taxonomy. However, current taxonomic practice continues to rely on cumbersome journal publications for the description of new species, which still constitute the smallest taxonomic units. In response, we introduce LINbase, a web server that allows users to genomically circumscribe any group of prokaryotes with measurable DNA similarity and that uses the individual isolate as smallest unit. Since LINbase leverages the concept of Life Identification Numbers (LINs), which are codes assigned to individual genomes based on reciprocal average nucleotide identity, we refer to groups circumscribed in LINbase as LINgroups. Users can associate with each LINgroup a name, a short description, and a URL to a peer-reviewed publication. As soon as a LINgroup is circumscribed, any user can immediately identify query genomes as members and submit comments about the LINgroup. Most genomes currently in LINbase were imported from GenBank, but users can upload their own genome sequences as well. In conclusion, LINbase combines the resolution of LINs with the power of crowdsourcing in support of a highly resolved, genome phylogeny-based digital taxonomy. LINbase is available at http://www.LINbase.org.
Collapse
Affiliation(s)
- Long Tian
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Chengjie Huang
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Reza Mazloom
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Lenwood S Heath
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Boris A Vinatzer
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
9
|
Koonin EV, Dolja VV, Krupovic M, Varsani A, Wolf YI, Yutin N, Zerbini FM, Kuhn JH. Global Organization and Proposed Megataxonomy of the Virus World. Microbiol Mol Biol Rev 2020; 84:e00061-19. [PMID: 32132243 PMCID: PMC7062200 DOI: 10.1128/mmbr.00061-19] [Citation(s) in RCA: 371] [Impact Index Per Article: 74.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Viruses and mobile genetic elements are molecular parasites or symbionts that coevolve with nearly all forms of cellular life. The route of virus replication and protein expression is determined by the viral genome type. Comparison of these routes led to the classification of viruses into seven "Baltimore classes" (BCs) that define the major features of virus reproduction. However, recent phylogenomic studies identified multiple evolutionary connections among viruses within each of the BCs as well as between different classes. Due to the modular organization of virus genomes, these relationships defy simple representation as lines of descent but rather form complex networks. Phylogenetic analyses of virus hallmark genes combined with analyses of gene-sharing networks show that replication modules of five BCs (three classes of RNA viruses and two classes of reverse-transcribing viruses) evolved from a common ancestor that encoded an RNA-directed RNA polymerase or a reverse transcriptase. Bona fide viruses evolved from this ancestor on multiple, independent occasions via the recruitment of distinct cellular proteins as capsid subunits and other structural components of virions. The single-stranded DNA (ssDNA) viruses are a polyphyletic class, with different groups evolving by recombination between rolling-circle-replicating plasmids, which contributed the replication protein, and positive-sense RNA viruses, which contributed the capsid protein. The double-stranded DNA (dsDNA) viruses are distributed among several large monophyletic groups and arose via the combination of distinct structural modules with equally diverse replication modules. Phylogenomic analyses reveal the finer structure of evolutionary connections among RNA viruses and reverse-transcribing viruses, ssDNA viruses, and large subsets of dsDNA viruses. Taken together, these analyses allow us to outline the global organization of the virus world. Here, we describe the key aspects of this organization and propose a comprehensive hierarchical taxonomy of viruses.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Valerian V Dolja
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, Oregon, USA
| | - Mart Krupovic
- Institut Pasteur, Archaeal Virology Unit, Department of Microbiology, Paris, France
| | - Arvind Varsani
- The Biodesign Center for Fundamental and Applied Microbiomics, Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, Arizona, USA
- Structural Biology Research Unit, Department of Clinical Laboratory Sciences, University of Cape Town, Observatory, Cape Town, South Africa
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Natalya Yutin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - F Murilo Zerbini
- Departamento de Fitopatologia/Bioagro, Universidade Federal de Viçosa, Viçosa, Minas Gerais, Brazil
| | - Jens H Kuhn
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Frederick, Maryland, USA
| |
Collapse
|