1
|
Narechania A, Bobo D, DeSalle R, Mathema B, Kreiswirth B, Planet PJ. What Do We Gain When Tolerating Loss? The Information Bottleneck Wrings Out Recombination. Mol Biol Evol 2025; 42:msaf029. [PMID: 39899343 PMCID: PMC11890988 DOI: 10.1093/molbev/msaf029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 12/03/2024] [Accepted: 01/14/2025] [Indexed: 02/04/2025] Open
Abstract
Most microbes have the capacity to acquire genetic material from their environment. Recombination of foreign DNA yields genomes that are, at least in part, incongruent with the vertical history of their species. Dominant approaches for detecting these transfers are phylogenetic, requiring a painstaking series of analyses including alignment and tree reconstruction. But these methods do not scale. Here, we propose an unsupervised, alignment-free, and tree-free technique based on the sequential information bottleneck, an optimization procedure designed to extract some portion of relevant information from 1 random variable conditioned on another. In our case, this joint probability distribution tabulates occurrence counts of k-mers against their genomes of origin with the expectation that recombination will create a strong signal that unifies certain sets of co-occurring k-mers. We conceptualize the technique as a rate-distortion problem, measuring distortion in the relevance information as k-mers are compressed into clusters based on their co-occurrence in the source genomes. The result is fast, model-free, lossy compression of k-mers into learned groups of shared genome sequence, differentiating recombined elements from the vertically inherited core. We show that the technique yields a new recombination measure based purely on information, divorced from any biases and limitations inherent to alignment and phylogeny.
Collapse
Affiliation(s)
- Apurva Narechania
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Section for Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Dean Bobo
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Department of Ecology, Evolution, and Environmental Biology, Columbia University, New York, NY, USA
| | - Rob DeSalle
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
| | - Barun Mathema
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Barry Kreiswirth
- Center for Discovery and Innovation, Hackensack Meridian Health, Nutley, NJ, USA
| | - Paul J Planet
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Division of Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
2
|
Wittouck S, Eilers T, van Noort V, Lebeer S. SCARAP: scalable cross-species comparative genomics of prokaryotes. Bioinformatics 2024; 41:btae735. [PMID: 39661475 PMCID: PMC11681940 DOI: 10.1093/bioinformatics/btae735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 10/31/2024] [Accepted: 12/10/2024] [Indexed: 12/13/2024] Open
Abstract
MOTIVATION Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly. RESULTS Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions. AVAILABILITY AND IMPLEMENTATION The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.
Collapse
Affiliation(s)
- Stijn Wittouck
- Lab of Applied Microbiology and Biotechnology, Department of Bioscience Engineering, University of Antwerp, Antwerpen 2020, Belgium
| | - Tom Eilers
- Lab of Applied Microbiology and Biotechnology, Department of Bioscience Engineering, University of Antwerp, Antwerpen 2020, Belgium
| | - Vera van Noort
- Faculty of Bioscience Engineering, KU Leuven, Leuven 3001, Belgium
- Institute of Biology Leiden, Leiden University, Leiden 2333 BE, The Netherlands
| | - Sarah Lebeer
- Lab of Applied Microbiology and Biotechnology, Department of Bioscience Engineering, University of Antwerp, Antwerpen 2020, Belgium
| |
Collapse
|
3
|
Moradigaravand D, Li L, Dechesne A, Nesme J, de la Cruz R, Ahmad H, Banzhaf M, Sørensen SJ, Smets BF, Kreft JU. Plasmid permissiveness of wastewater microbiomes can be predicted from 16S rRNA sequences by machine learning. Bioinformatics 2023; 39:btad400. [PMID: 37348862 PMCID: PMC10318386 DOI: 10.1093/bioinformatics/btad400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 06/13/2023] [Accepted: 06/21/2023] [Indexed: 06/24/2023] Open
Abstract
MOTIVATION Wastewater treatment plants (WWTPs) harbor a dense and diverse microbial community. They constantly receive antimicrobial residues and resistant strains, and therefore provide conditions for horizontal gene transfer (HGT) of antimicrobial resistance (AMR) determinants. This facilitates the transmission of clinically important genes between, e.g. enteric and environmental bacteria, and vice versa. Despite the clinical importance, tools for predicting HGT remain underdeveloped. RESULTS In this study, we examined to which extent water cycle microbial community composition, as inferred by partial 16S rRNA gene sequences, can predict plasmid permissiveness, i.e. the ability of cells to receive a plasmid through conjugation, based on data from standardized filter mating assays using fluorescent bio-reporter plasmids. We leveraged a range of machine learning models for predicting the permissiveness for each taxon in the community, representing the range of hosts a plasmid is able to transfer to, for three broad host-range resistance IncP plasmids (pKJK5, pB10, and RP4). Our results indicate that the predicted permissiveness from the best performing model (random forest) showed a moderate-to-strong average correlation of 0.49 for pB10 [95% confidence interval (CI): 0.44-0.55], 0.43 for pKJK5 (0.95% CI: 0.41-0.49), and 0.53 for RP4 (0.95% CI: 0.48-0.57) with the experimental permissiveness in the unseen test dataset. Predictive phylogenetic signals occurred despite the broad host-range nature of these plasmids. Our results provide a framework that contributes to the assessment of the risk of AMR pollution in wastewater systems. AVAILABILITY AND IMPLEMENTATION The predictive tool is available as an application at https://github.com/DaneshMoradigaravand/PlasmidPerm.
Collapse
Affiliation(s)
- Danesh Moradigaravand
- Laboratory of Infectious Disease Epidemiology, KAUST Smart-Health Initiative and Biological and Environmental Science and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Liguan Li
- Department of Environmental Engineering, Technical University of Denmark, 2800 Kgs Lyngby, Denmark
- Department of Civil Engineering, The University of Hong Kong, Hong Kong, China
| | - Arnaud Dechesne
- Department of Environmental Engineering, Technical University of Denmark, 2800 Kgs Lyngby, Denmark
| | - Joseph Nesme
- Department of Biology, University of Copenhagen, 2100 Copenhagen, Denmark
| | - Roberto de la Cruz
- Center for Computational Biology, University of Birmingham, Birmingham, B15 2TT, United Kingdom
- Institute of Microbiology and Infection, University of Birmingham, Birmingham, B15 2TT, United Kingdom
- School of Biosciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| | - Huda Ahmad
- Laboratory of Infectious Disease Epidemiology, KAUST Smart-Health Initiative and Biological and Environmental Science and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Center for Computational Biology, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| | - Manuel Banzhaf
- Institute of Microbiology and Infection, University of Birmingham, Birmingham, B15 2TT, United Kingdom
- School of Biosciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| | - Søren J Sørensen
- Department of Biology, University of Copenhagen, 2100 Copenhagen, Denmark
| | - Barth F Smets
- Department of Environmental Engineering, Technical University of Denmark, 2800 Kgs Lyngby, Denmark
| | - Jan-Ulrich Kreft
- Center for Computational Biology, University of Birmingham, Birmingham, B15 2TT, United Kingdom
- Institute of Microbiology and Infection, University of Birmingham, Birmingham, B15 2TT, United Kingdom
- School of Biosciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
| |
Collapse
|
4
|
Shikov AE, Malovichko YV, Nizhnikov AA, Antonets KS. Current Methods for Recombination Detection in Bacteria. Int J Mol Sci 2022; 23:ijms23116257. [PMID: 35682936 PMCID: PMC9181119 DOI: 10.3390/ijms23116257] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 05/30/2022] [Accepted: 05/30/2022] [Indexed: 02/05/2023] Open
Abstract
The role of genetic exchanges, i.e., homologous recombination (HR) and horizontal gene transfer (HGT), in bacteria cannot be overestimated for it is a pivotal mechanism leading to their evolution and adaptation, thus, tracking the signs of recombination and HGT events is importance both for fundamental and applied science. To date, dozens of bioinformatics tools for revealing recombination signals are available, however, their pros and cons as well as the spectra of solvable tasks have not yet been systematically reviewed. Moreover, there are two major groups of software. One aims to infer evidence of HR, while the other only deals with horizontal gene transfer (HGT). However, despite seemingly different goals, all the methods use similar algorithmic approaches, and the processes are interconnected in terms of genomic evolution influencing each other. In this review, we propose a classification of novel instruments for both HR and HGT detection based on the genomic consequences of recombination. In this context, we summarize available methodologies paying particular attention to the type of traceable events for which a certain program has been designed.
Collapse
Affiliation(s)
- Anton E. Shikov
- Laboratory for Proteomics of Supra-Organismal Systems, All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 196608 St. Petersburg, Russia; (A.E.S.); (Y.V.M.); (A.A.N.)
- Faculty of Biology, St. Petersburg State University (SPbSU), 199034 St. Petersburg, Russia
| | - Yury V. Malovichko
- Laboratory for Proteomics of Supra-Organismal Systems, All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 196608 St. Petersburg, Russia; (A.E.S.); (Y.V.M.); (A.A.N.)
- Faculty of Biology, St. Petersburg State University (SPbSU), 199034 St. Petersburg, Russia
| | - Anton A. Nizhnikov
- Laboratory for Proteomics of Supra-Organismal Systems, All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 196608 St. Petersburg, Russia; (A.E.S.); (Y.V.M.); (A.A.N.)
- Faculty of Biology, St. Petersburg State University (SPbSU), 199034 St. Petersburg, Russia
| | - Kirill S. Antonets
- Laboratory for Proteomics of Supra-Organismal Systems, All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 196608 St. Petersburg, Russia; (A.E.S.); (Y.V.M.); (A.A.N.)
- Faculty of Biology, St. Petersburg State University (SPbSU), 199034 St. Petersburg, Russia
- Correspondence:
| |
Collapse
|
5
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput Biol 2022; 18:e1010056. [PMID: 35486906 PMCID: PMC9094560 DOI: 10.1371/journal.pcbi.1010056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 05/11/2022] [Accepted: 03/25/2022] [Indexed: 11/26/2022] Open
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, United States of America
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
6
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.15.435416. [PMID: 33758852 PMCID: PMC7987011 DOI: 10.1101/2021.03.15.435416] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
- Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
7
|
Abstract
Genome-wide association studies in bacteria have great potential to deliver a better understanding of the genetic basis of many biologically important phenotypes, including antibiotic resistance, pathogenicity, and host adaptation. Such studies need however to account for the specificities of bacterial genomics, especially in terms of population structure, homologous recombination, and genomic plasticity. A powerful way to tackle this challenge is to use a phylogenetic approach, which is based on long-standing methodology for the evolutionary analysis of bacterial genomic data. Here we present both the theoretical and practical aspects involved in the use of phylogenetic methods for bacterial genome-wide association studies.
Collapse
Affiliation(s)
- Xavier Didelot
- School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK.
| |
Collapse
|
8
|
Zhou Z, Charlesworth J, Achtman M. Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res 2020; 30:1667-1679. [PMID: 33055096 PMCID: PMC7605250 DOI: 10.1101/gr.260828.120] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Accepted: 09/01/2020] [Indexed: 12/22/2022]
Abstract
Bacterial genomes can contain traces of a complex evolutionary history, including extensive homologous recombination, gene loss, gene duplications, and horizontal gene transfer. To reconstruct the phylogenetic and population history of a set of multiple bacteria, it is necessary to examine their pangenome, the composite of all the genes in the set. Here we introduce PEPPAN, a novel pipeline that can reliably construct pangenomes from thousands of genetically diverse bacterial genomes that represent the diversity of an entire genus. PEPPAN outperforms existing pangenome methods by providing consistent gene and pseudogene annotations extended by similarity-based gene predictions, and identifying and excluding paralogs by combining tree- and synteny-based approaches. The PEPPAN package additionally includes PEPPAN_parser, which implements additional downstream analyses, including the calculation of trees based on accessory gene content or allelic differences between core genes. To test the accuracy of PEPPAN, we implemented SimPan, a novel pipeline for simulating the evolution of bacterial pangenomes. We compared the accuracy and speed of PEPPAN with four state-of-the-art pangenome pipelines using both empirical and simulated data sets. PEPPAN was more accurate and more specific than any of the other pipelines and was almost as fast as any of them. As a case study, we used PEPPAN to construct a pangenome of approximately 40,000 genes from 3052 representative genomes spanning at least 80 species of Streptococcus The resulting gene and allelic trees provide an unprecedented overview of the genomic diversity of the entire Streptococcus genus.
Collapse
Affiliation(s)
- Zhemin Zhou
- Warwick Medical School, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Jane Charlesworth
- Warwick Medical School, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Mark Achtman
- Warwick Medical School, University of Warwick, Coventry CV4 7AL, United Kingdom
| |
Collapse
|
9
|
Bobay LM. CoreSimul: a forward-in-time simulator of genome evolution for prokaryotes modeling homologous recombination. BMC Bioinformatics 2020; 21:264. [PMID: 32580695 PMCID: PMC7315543 DOI: 10.1186/s12859-020-03619-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/19/2020] [Indexed: 12/26/2022] Open
Abstract
Background Prokaryotes are asexual, but these organisms frequently engage in homologous recombination, a process that differs from meiotic recombination in sexual organisms. Most tools developed to simulate genome evolution either assume sexual reproduction or the complete absence of DNA flux in the population. As a result, very few simulators are adapted to model prokaryotic genome evolution while accounting for recombination. Moreover, many simulators are based on the coalescent, which assumes a neutral model of genomic evolution, and those are best suited for organisms evolving under weak selective pressures, such as animals and plants. In contrast, prokaryotes are thought to be evolving under much stronger selective pressures, suggesting that forward-in-time simulators are better suited for these organisms. Results Here, I present CoreSimul, a forward-in-time simulator of core genome evolution for prokaryotes modeling homologous recombination. Simulations are guided by a phylogenetic tree and incorporate different substitution models, including models of codon selection. Conclusions CoreSimul is a flexible forward-in-time simulator that constitutes a significant addition to the limited list of available simulators applicable to prokaryote genome evolution.
Collapse
Affiliation(s)
- Louis-Marie Bobay
- Department of Biology, University of North Carolina Greensboro, 321 McIver Street, PO Box 26170, Greensboro, NC, 27402, USA.
| |
Collapse
|
10
|
Saber MM, Shapiro BJ. Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes. Microb Genom 2020; 6:e000337. [PMID: 32100713 PMCID: PMC7200059 DOI: 10.1099/mgen.0.000337] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Accepted: 01/23/2020] [Indexed: 11/18/2022] Open
Abstract
Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true 'hits' (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in plink, pyseer and gemma) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.
Collapse
Affiliation(s)
- Morteza M. Saber
- Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada
| | - B. Jesse Shapiro
- Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada
| |
Collapse
|
11
|
Ferrés I, Fresia P, Iraola G. simurg: simulate bacterial pangenomes in R. Bioinformatics 2020; 36:1273-1274. [PMID: 31584605 DOI: 10.1093/bioinformatics/btz735] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 08/06/2019] [Accepted: 09/25/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The pangenome concept describes genetic variability as the union of genes shared in a set of genomes and constitutes the current paradigm for comparative analysis of bacterial populations. However, there is a lack of tools to simulate pangenome variability and structure using defined evolutionary models. RESULTS We developed simurg, an R package that allows to simulate bacterial pangenomes using different combinations of evolutionary constraints such as gene gain, gene loss and mutation rates. Our tool allows the straightforward and reproducible simulation of bacterial pangenomes using real sequence data, providing a valuable tool for benchmarking of pangenome software or comparing evolutionary hypotheses. AVAILABILITY AND IMPLEMENTATION The simurg package is released under the GPL-3 license, and is freely available for download from GitHub (https://github.com/iferres/simurg). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ignacio Ferrés
- Microbial Genomics Laboratory, Institut Pasteur Montevideo, Uruguay
| | - Pablo Fresia
- Microbial Genomics Laboratory, Institut Pasteur Montevideo, Uruguay.,Unidad Mixta UMPI, Institut Pasteur Montevideo + INIA, Montevideo 11400, Uruguay
| | - Gregorio Iraola
- Microbial Genomics Laboratory, Institut Pasteur Montevideo, Uruguay.,Center for Integrative Biology, Universidad Mayor, Santiago de Chile 7510041, Chile.,Wellcome Sanger Institute, Hinxton CB10 1SA, UK
| |
Collapse
|
12
|
Sipola A, Marttinen P, Corander J. Bacmeta: simulator for genomic evolution in bacterial metapopulations. Bioinformatics 2019; 34:2308-2310. [PMID: 29474733 DOI: 10.1093/bioinformatics/bty093] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 02/20/2018] [Indexed: 12/25/2022] Open
Abstract
Summary The advent of genomic data from densely sampled bacterial populations has created a need for flexible simulators by which models and hypotheses can be efficiently investigated in the light of empirical observations. Bacmeta provides fast stochastic simulation of neutral evolution within a large collection of interconnected bacterial populations with completely adjustable connectivity network. Stochastic events of mutations, recombinations, insertions/deletions, migrations and micro-epidemics can be simulated in discrete non-overlapping generations with a Wright-Fisher model that operates on explicit sequence data of any desired genome length. Each model component, including locus, bacterial strain, population and ultimately the whole metapopulation, is efficiently simulated using C++ objects and detailed metadata from each level can be acquired. The software can be executed in a cluster environment using simple textual input files, enabling, e.g. large-scale simulations and likelihood-free inference. Availability and implementation Bacmeta is implemented with C++ for Linux, Mac and Windows. It is available at https://bitbucket.org/aleksisipola/bacmeta under the BSD 3-clause license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aleksi Sipola
- Department of Mathematics and Statistics, University of Helsinki, Finland.,Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Finland
| | - Pekka Marttinen
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Finland
| | - Jukka Corander
- Department of Mathematics and Statistics, University of Helsinki, Finland.,Department of Biostatistics, University of Oslo, Norway
| |
Collapse
|
13
|
Abstract
Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations. Results We present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm. Conclusions PanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos. Electronic supplementary material The online version of this article (10.1186/s12859-018-2417-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy.
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| | - Vincenzo Manca
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| |
Collapse
|
14
|
Zhou Z, Alikhan NF, Sergeant MJ, Luhmann N, Vaz C, Francisco AP, Carriço JA, Achtman M. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res 2018; 28:1395-1404. [PMID: 30049790 PMCID: PMC6120633 DOI: 10.1101/gr.232397.117] [Citation(s) in RCA: 610] [Impact Index Per Article: 87.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 07/24/2018] [Indexed: 11/24/2022]
Abstract
Current methods struggle to reconstruct and visualize the genomic relationships of large numbers of bacterial genomes. GrapeTree facilitates the analyses of large numbers of allelic profiles by a static “GrapeTree Layout” algorithm that supports interactive visualizations of large trees within a web browser window. GrapeTree also implements a novel minimum spanning tree algorithm (MSTree V2) to reconstruct genetic relationships despite high levels of missing data. GrapeTree is a stand-alone package for investigating phylogenetic trees plus associated metadata and is also integrated into EnteroBase to facilitate cutting edge navigation of genomic relationships among bacterial pathogens.
Collapse
Affiliation(s)
- Zhemin Zhou
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| | - Nabil-Fareed Alikhan
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| | - Martin J Sergeant
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| | - Nina Luhmann
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| | - Cátia Vaz
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), 1000-029 Lisboa, Portugal.,ADEETC, Instituto Superior de Engenharia de Lisboa, Instituto Politécnico de Lisboa, 1959-007 Lisboa, Portugal
| | - Alexandre P Francisco
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), 1000-029 Lisboa, Portugal.,Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal
| | - João André Carriço
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-004 Lisboa, Portugal
| | - Mark Achtman
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| |
Collapse
|
15
|
Akita T, Takuno S, Innan H. Coalescent framework for prokaryotes undergoing interspecific homologous recombination. Heredity (Edinb) 2018; 120:474-484. [PMID: 29358726 PMCID: PMC5889408 DOI: 10.1038/s41437-017-0034-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 10/04/2017] [Accepted: 10/23/2017] [Indexed: 12/11/2022] Open
Abstract
Coalescent process for prokaryote species is theoretically considered. Prokaryotes undergo homologous recombination with individuals of the same species (intraspecific recombination) and with individuals of other species (interspecific recombination). This work particularly focuses on interspecific recombination because intraspecific recombination has been well incorporated in coalescent framework. We present a simulation framework for generating SNP (single-nucleotide polymorphism) patterns that allows external DNA integration into host genome from other species. Using this simulation tool, msPro, we observed that the joint processes of intra- and interspecific recombination generate complex SNP patterns. The direct effect of interspecific recombination includes increased polymorphism. Because interspecific recombination is very rare in nature, it generates regions with exceptionally high polymorphism. Following interspecific recombination, intraspecific recombination cuts the integrated external DNA into small fragments, generating a complex SNP pattern that appears as if external DNA was integrated multiple times. The insight gained from our work using the msPro simulator will be useful for understanding and evaluating the relative contributions of intra- and interspecific recombination events in generating complex SNP patters in prokaryotes.
Collapse
Affiliation(s)
- Tetsuya Akita
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan
- National Research Institute of Far Seas Fisheries, Fisheries Research Agency, Yokohama, Kanagawa, 236-8648, Japan
| | - Shohei Takuno
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan
| | - Hideki Innan
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan.
| |
Collapse
|
16
|
De Maio N, Worby CJ, Wilson DJ, Stoesser N. Bayesian reconstruction of transmission within outbreaks using genomic variants. PLoS Comput Biol 2018; 14:e1006117. [PMID: 29668677 PMCID: PMC5927459 DOI: 10.1371/journal.pcbi.1006117] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 04/30/2018] [Accepted: 04/03/2018] [Indexed: 01/19/2023] Open
Abstract
Pathogen genome sequencing can reveal details of transmission histories and is a powerful tool in the fight against infectious disease. In particular, within-host pathogen genomic variants identified through heterozygous nucleotide base calls are a potential source of information to identify linked cases and infer direction and time of transmission. However, using such data effectively to model disease transmission presents a number of challenges, including differentiating genuine variants from those observed due to sequencing error, as well as the specification of a realistic model for within-host pathogen population dynamics. Here we propose a new Bayesian approach to transmission inference, BadTrIP (BAyesian epiDemiological TRansmission Inference from Polymorphisms), that explicitly models evolution of pathogen populations in an outbreak, transmission (including transmission bottlenecks), and sequencing error. BadTrIP enables the inference of host-to-host transmission from pathogen sequencing data and epidemiological data. By assuming that genomic variants are unlinked, our method does not require the computationally intensive and unreliable reconstruction of individual haplotypes. Using simulations we show that BadTrIP is robust in most scenarios and can accurately infer transmission events by efficiently combining information from genetic and epidemiological sources; thanks to its realistic model of pathogen evolution and the inclusion of epidemiological data, BadTrIP is also more accurate than existing approaches. BadTrIP is distributed as an open source package (https://bitbucket.org/nicofmay/badtrip) for the phylogenetic software BEAST2. We apply our method to reconstruct transmission history at the early stages of the 2014 Ebola outbreak, showcasing the power of within-host genomic variants to reconstruct transmission events. We present a new tool to reconstruct transmission events within outbreaks. Our approach makes use of pathogen genetic information, notably genetic variants at low frequency within host that are usually discarded, and combines it with epidemiological information of host exposure to infection. This leads to accurate reconstruction of transmission even in cases where abundant within-host pathogen genetic variation and weak transmission bottlenecks (multiple pathogen units colonising a new host at transmission) would otherwise make inference difficult due to the transmission history differing from the pathogen evolution history inferred from pathogen isolets. Also, the use of within-host pathogen genomic variants increases the resolution of the reconstruction of the transmission tree even in scenarios with limited within-outbreak pathogen genetic diversity: within-host pathogen populations that appear identical at the level of consensus sequences can be discriminated using within-host variants. Our Bayesian approach provides a measure of the confidence in different possible transmission histories, and is published as open source software. We show with simulations and with an analysis of the beginning of the 2014 Ebola outbreak that our approach is applicable in many scenarios, improves our understanding of transmission dynamics, and will contribute to finding and limiting sources and routes of transmission, and therefore preventing the spread of infectious disease.
Collapse
Affiliation(s)
- Nicola De Maio
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | - Colin J Worby
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Daniel J Wilson
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | - Nicole Stoesser
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
17
|
Yu X, Reva ON. SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees. Evol Bioinform Online 2018; 14:1176934318759299. [PMID: 29511354 PMCID: PMC5826093 DOI: 10.1177/1176934318759299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 01/24/2018] [Indexed: 11/17/2022] Open
Abstract
Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.
Collapse
Affiliation(s)
- Xiaoyu Yu
- Department of Biochemistry, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| | - Oleg N Reva
- Department of Biochemistry, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| |
Collapse
|
18
|
Mortimer TD, Annis DS, O’Neill MB, Bohr LL, Smith TM, Poinar HN, Mosher DF, Pepperell CS. Adaptation in a Fibronectin Binding Autolysin of Staphylococcus saprophyticus. mSphere 2017; 2:e00511-17. [PMID: 29202045 PMCID: PMC5705806 DOI: 10.1128/msphere.00511-17] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 11/13/2017] [Indexed: 12/18/2022] Open
Abstract
Human-pathogenic bacteria are found in a variety of niches, including free-living, zoonotic, and microbiome environments. Identifying bacterial adaptations that enable invasive disease is an important means of gaining insight into the molecular basis of pathogenesis and understanding pathogen emergence. Staphylococcus saprophyticus, a leading cause of urinary tract infections, can be found in the environment, food, animals, and the human microbiome. We identified a selective sweep in the gene encoding the Aas adhesin, a key virulence factor that binds host fibronectin. We hypothesize that the mutation under selection (aas_2206A>C) facilitates colonization of the urinary tract, an environment where bacteria are subject to strong shearing forces. The mutation appears to have enabled emergence and expansion of a human-pathogenic lineage of S. saprophyticus. These results demonstrate the power of evolutionary genomic approaches in discovering the genetic basis of virulence and emphasize the pleiotropy and adaptability of bacteria occupying diverse niches. IMPORTANCEStaphylococcus saprophyticus is an important cause of urinary tract infections (UTI) in women; such UTI are common, can be severe, and are associated with significant impacts to public health. In addition to being a cause of human UTI, S. saprophyticus can be found in the environment, in food, and associated with animals. After discovering that UTI strains of S. saprophyticus are for the most part closely related to each other, we sought to determine whether these strains are specially adapted to cause disease in humans. We found evidence suggesting that a mutation in the gene aas is advantageous in the context of human infection. We hypothesize that the mutation allows S. saprophyticus to survive better in the human urinary tract. These results show how bacteria found in the environment can evolve to cause disease.
Collapse
Affiliation(s)
- Tatum D. Mortimer
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Microbiology Doctoral Training Program, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Douglas S. Annis
- Department of Biomolecular Chemistry, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Mary B. O’Neill
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Laboratory of Genetics, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Lindsey L. Bohr
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Microbiology Doctoral Training Program, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Tracy M. Smith
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Department of Medicine, Division of Infectious Diseases, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Hendrik N. Poinar
- McMaster Ancient DNA Centre, Department of Anthropology, McMaster University, Hamilton, Ontario, Canada
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- Humans and the Microbiome Program, Canadian Institute for Advanced Research, Toronto, Ontario, Canada
| | - Deane F. Mosher
- Department of Biomolecular Chemistry, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Caitlin S. Pepperell
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Department of Medicine, Division of Infectious Diseases, School of Medicine and Public Health, University of Wisconsin—Madison, Madison, Wisconsin, USA
| |
Collapse
|
19
|
Mostowy R, Croucher NJ, Andam CP, Corander J, Hanage WP, Marttinen P. Efficient Inference of Recent and Ancestral Recombination within Bacterial Populations. Mol Biol Evol 2017; 34:1167-1182. [PMID: 28199698 PMCID: PMC5400400 DOI: 10.1093/molbev/msx066] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Prokaryotic evolution is affected by horizontal transfer of genetic material through recombination. Inference of an evolutionary tree of bacteria thus relies on accurate identification of the population genetic structure and recombination-derived mosaicism. Rapidly growing databases represent a challenge for computational methods to detect recombinations in bacterial genomes. We introduce a novel algorithm called fastGEAR which identifies lineages in diverse microbial alignments, and recombinations between them and from external origins. The algorithm detects both recent recombinations (affecting a few isolates) and ancestral recombinations between detected lineages (affecting entire lineages), thus providing insight into recombinations affecting deep branches of the phylogenetic tree. In simulations, fastGEAR had comparable power to detect recent recombinations and outstanding power to detect the ancestral ones, compared with state-of-the-art methods, often with a fraction of computational cost. We demonstrate the utility of the method by analyzing a collection of 616 whole-genomes of a recombinogenic pathogen Streptococcus pneumoniae, for which the method provided a high-resolution view of recombination across the genome. We examined in detail the penicillin-binding genes across the Streptococcus genus, demonstrating previously undetected genetic exchanges between different species at these three loci. Hence, fastGEAR can be readily applied to investigate mosaicism in bacterial genes across multiple species. Finally, fastGEAR correctly identified many known recombination hotspots and pointed to potential new ones. Matlab code and Linux/Windows executables are available at https://users.ics.aalto.fi/~pemartti/fastGEAR/ (last accessed February 6, 2017).
Collapse
Affiliation(s)
- Rafal Mostowy
- Department of Infectious Disease Epidemiology, St. Mary's Campus, Imperial College London, London, United Kingdom
| | - Nicholas J Croucher
- Department of Infectious Disease Epidemiology, St. Mary's Campus, Imperial College London, London, United Kingdom
| | - Cheryl P Andam
- Department of Epidemiology, Harvard TH Chan School of Public Health, Center for Communicable Disease Dynamics, Boston, MA
| | - Jukka Corander
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.,Department of Biostatistics, University of Oslo, Oslo, Norway
| | - William P Hanage
- Department of Epidemiology, Harvard TH Chan School of Public Health, Center for Communicable Disease Dynamics, Boston, MA
| | - Pekka Marttinen
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
| |
Collapse
|
20
|
Abstract
Bacteria can exchange and acquire new genetic material from other organisms directly and via the environment. This process, known as bacterial recombination, has a strong impact on the evolution of bacteria, for example, leading to the spread of antibiotic resistance across clades and species, and to the avoidance of clonal interference. Recombination hinders phylogenetic and transmission inference because it creates patterns of substitutions (homoplasies) inconsistent with the hypothesis of a single evolutionary tree. Bacterial recombination is typically modeled as statistically akin to gene conversion in eukaryotes, i.e., using the coalescent with gene conversion (CGC). However, this model can be very computationally demanding as it needs to account for the correlations of evolutionary histories of even distant loci. So, with the increasing popularity of whole genome sequencing, the need has emerged for a faster approach to model and simulate bacterial genome evolution. We present a new model that approximates the coalescent with gene conversion: the bacterial sequential Markov coalescent (BSMC). Our approach is based on a similar idea to the sequential Markov coalescent (SMC)-an approximation of the coalescent with crossover recombination. However, bacterial recombination poses hurdles to a sequential Markov approximation, as it leads to strong correlations and linkage disequilibrium across very distant sites in the genome. Our BSMC overcomes these difficulties, and shows a considerable reduction in computational demand compared to the exact CGC, and very similar patterns in simulated data. We implemented our BSMC model within new simulation software FastSimBac. In addition to the decreased computational demand compared to previous bacterial genome evolution simulators, FastSimBac provides more general options for evolutionary scenarios, allowing population structure with migration, speciation, population size changes, and recombination hotspots. FastSimBac is available from https://bitbucket.org/nicofmay/fastsimbac, and is distributed as open source under the terms of the GNU General Public License. Lastly, we use the BSMC within an Approximate Bayesian Computation (ABC) inference scheme, and suggest that parameters simulated under the exact CGC can correctly be recovered, further showcasing the accuracy of the BSMC. With this ABC we infer recombination rate, mutation rate, and recombination tract length of Bacillus cereus from a whole genome alignment.
Collapse
Affiliation(s)
- Nicola De Maio
- Institute for Emerging Infections, Oxford Martin School, University of Oxford, Oxford, OX1 3PA, United Kingdom
- Nuffield Department of Medicine, University of Oxford, Oxford, OX1 3PA, United Kingdom
| | - Daniel J Wilson
- Institute for Emerging Infections, Oxford Martin School, University of Oxford, Oxford, OX1 3PA, United Kingdom
- Nuffield Department of Medicine, University of Oxford, Oxford, OX1 3PA, United Kingdom
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX1 3PA, United Kingdom
| |
Collapse
|