1
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 2025; 22:692-697. [PMID: 40205174 DOI: 10.1038/s41592-025-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 02/12/2025] [Indexed: 04/11/2025]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
Affiliation(s)
- Karel Břinda
- Inria, Irisa, Univ. Rennes, Rennes, France.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | - Simone Pignotti
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | | | - Kamil Salikhov
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | - Rayan Chikhi
- Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | | - Zamin Iqbal
- EMBL-EBI, Hinxton, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
2
|
Matthews CA, Watson-Haigh NS, Burton RA, Sheppard AE. A gentle introduction to pangenomics. Brief Bioinform 2024; 25:bbae588. [PMID: 39552065 PMCID: PMC11570541 DOI: 10.1093/bib/bbae588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 09/12/2024] [Accepted: 11/01/2024] [Indexed: 11/19/2024] Open
Abstract
Pangenomes have emerged in response to limitations associated with traditional linear reference genomes. In contrast to a traditional reference that is (usually) assembled from a single individual, pangenomes aim to represent all of the genomic variation found in a group of organisms. The term 'pangenome' is currently used to describe multiple different types of genomic information, and limited language is available to differentiate between them. This is frustrating for researchers working in the field and confusing for researchers new to the field. Here, we provide an introduction to pangenomics relevant to both prokaryotic and eukaryotic organisms and propose a formalization of the language used to describe pangenomes (see the Glossary) to improve the specificity of discussion in the field.
Collapse
Affiliation(s)
- Chelsea A Matthews
- School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Nathan S Watson-Haigh
- Australian Genome Research Facility, Victorian Comprehensive Cancer Centre, Melbourne, Victoria 3000, Australia
- South Australian Genomics Centre, SAHMRI, North Terrace, Adelaide, South Australia 5000, Australia
- Alkahest Inc., San Carlos, CA 94070, United States
| | - Rachel A Burton
- School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Anna E Sheppard
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia 5005, Australia
| |
Collapse
|
3
|
Roberts M, Josephs EB. Previously unmeasured genetic diversity explains part of Lewontin's paradox in a k -mer-based meta-analysis of 112 plant species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.17.594778. [PMID: 38798362 PMCID: PMC11118579 DOI: 10.1101/2024.05.17.594778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
At the molecular level, most evolution is expected to be neutral. A key prediction of this expectation is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. To test this idea, we calculated nucleotide diversity ( π ) and k -mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individual plants. We then compared how these different metrics correlated with proxies of population size that account for both range size and population density variation across species. We found that our population size proxies scaled anywhere from about 3 to over 20 times faster with k -mer diversity than nucleotide diversity after adjusting for evolutionary history, mating system, life cycle habit, cultivation status, and invasiveness. The relationship between k -mer diversity and population size proxies also remains significant after correcting for genome size, whereas the analogous relationship for nucleotide diversity does not. These results suggest that variation not captured by common SNP-based analyses explains part of Lewontin's paradox in plants.
Collapse
Affiliation(s)
- Miles Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing MI
| | - Emily B. Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI
- Plant Resilience Institute, Michigan State University, East Lansing, MI
| |
Collapse
|
4
|
Garg V, Bohra A, Mascher M, Spannagl M, Xu X, Bevan MW, Bennetzen JL, Varshney RK. Unlocking plant genetics with telomere-to-telomere genome assemblies. Nat Genet 2024; 56:1788-1799. [PMID: 39048791 DOI: 10.1038/s41588-024-01830-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Accepted: 06/12/2024] [Indexed: 07/27/2024]
Abstract
Contiguous genome sequence assemblies will help us to realize the full potential of crop translational genomics. Recent advances in sequencing technologies, especially long-read sequencing strategies, have made it possible to construct gapless telomere-to-telomere (T2T) assemblies, thus offering novel insights into genome organization and function. Plant genomes pose unique challenges, such as a continuum of ancient to recent polyploidy and abundant highly similar and long repetitive elements. Owing to progress in sequencing approaches, for most crop plants, chromosome-scale reference genome assemblies are available, but T2T assembly construction remains challenging. Here we describe methods for haplotype-resolved, gapless T2T assembly construction in plants, including various crop species. We outline the impact of T2T assemblies in elucidating the roles of repetitive elements in gene regulation, as well as in pangenomics, functional genomics, genome-assisted breeding and targeted genome manipulation. In conjunction with sequence-enriched germplasm repositories, T2T assemblies thus hold great promise for basic and applied plant sciences.
Collapse
Affiliation(s)
- Vanika Garg
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
| | - Abhishek Bohra
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- ICAR-Indian Institute of Pulses Research, Kanpur, India
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Seeland, Germany
| | - Manuel Spannagl
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- Plant Genome and Systems Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
| | - Xun Xu
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- BGI-Shenzhen, Shenzhen, China
| | | | | | - Rajeev K Varshney
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia.
| |
Collapse
|
5
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
6
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.15.536996. [PMID: 37131636 PMCID: PMC10153118 DOI: 10.1101/2023.04.15.536996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k -mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
|
7
|
Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The De Bruijn graph (DBG) has been widely used in the algorithms for indexing or organizing read and reference sequences in bioinformatics. However, a DBG model that can locate each node, edge and path on sequence has not been proposed so far. Recently, DBG has been used for representing reference sequences in read mapping tasks. In this process, it is not a one-to-one correspondence between the paths of DBG and the substrings of reference sequence. This results in the false path on DBG, which means no substrings of reference producing the path. Moreover, if a candidate path of a read is true, we need to locate it and verify the candidate on sequence. To solve these problems, we proposed a DBG model, called MiniDBG, which stores the position lists of a minimal set of edges. With the position lists, MiniDBG can locate any node, edge and path efficiently. We also proposed algorithms for generating MiniDBG based on an original DBG and algorithms for locating edges or paths on sequence. We designed and ran experiments on real datasets for comparing them with BWT-based and position list-based methods. The experimental results show that MiniDBG can locate the edges and paths efficiently with lower memory costs.
Collapse
|
8
|
Schamp CN, Dhowlaghar N, Hudson LK, Bryan DW, Zhong Q, Fozo EM, Gaballa A, Wiedmann M, Denes TG. Selection of mutant Listeria phages under food-relevant conditions can enhance application potential. Appl Environ Microbiol 2023; 89:e0100723. [PMID: 37800961 PMCID: PMC10617581 DOI: 10.1128/aem.01007-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/04/2023] [Indexed: 10/07/2023] Open
Abstract
Bacteriophages are viruses that infect and kill bacteria. Currently, phage products are available for the control of the pathogen Listeria monocytogenes in food products in the United States. In this study, we explore whether experimental evolution can be used to generate phages with improved abilities to function under specific food-relevant conditions. Ultra-pasteurized oat and whole milk were chosen as test matrices as they represent different food groups, yet have similar physical traits and macronutrient composition. We showed that (i) wild-type phage LP-125 infection kinetics are different in the two matrices and (ii) LP-125 has a significantly higher burst size in oat milk. From this, we attempted to evolve LP-125 to have improved infection kinetics in whole milk. Ancestral LP-125 was passaged through 10 rounds of amplification in milk conditions. Plaque-purified DNA samples from milk-selected phages were isolated and sequenced, and mutations present in the isolated phages were identified. We found two nonsynonymous substitutions in LP125_108 and LP125_112 genes, which encode putative baseplate-associated glycerophosphoryl diester phosphodiesterase and baseplate protein, respectively. Protein structural modeling showed that the substituted amino acids in the mutant phages are predicted to localize to surface-exposed helices on the corresponding structures, which might affect the surface charge of proteins and their interaction with the bacterial cell. The phage containing the LP125_112 mutation adsorbed significantly faster than the ancestral phage in both oat and whole milk. Follow-up experiments suggest that fat content may be a key factor for the expression of the phenotype of this mutation. IMPORTANCE Bacteriophages are one of the tools available to control the foodborne pathogen, Listeria monocytogenes. Phage products must work under a broad range of food conditions to be an effective control for L. monocytogenes. Here, we show that the experimental evolution of phages can be used to generate new phages with phenotypes useful under specific conditions. We used this approach to select for a mutant phage that more efficiently binds to L. monocytogenes that is grown in whole milk and oat milk. We show that the fat content of these milks is necessary for the expression of this phenotype. Our findings show that experimental evolution can be used to select for improved phages with better performance under specific conditions. This approach has the potential to support the development of condition-specific phage-based biocontrols in the food industry.
Collapse
Affiliation(s)
- Claire N. Schamp
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| | - Nitin Dhowlaghar
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| | - Lauren K. Hudson
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| | - Daniel W. Bryan
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| | - Qixin Zhong
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| | - Elizabeth M. Fozo
- Department of Microbiology, The University of Tennessee, Knoxville, Tennessee, USA
| | - Ahmed Gaballa
- Department of Food Science, Cornell University, Ithaca, New York, USA
| | - Martin Wiedmann
- Department of Food Science, Cornell University, Ithaca, New York, USA
| | - Thomas G. Denes
- Department of Food Science, The University of Tennessee, Knoxville, Tennessee, USA
| |
Collapse
|
9
|
Shi ZJ, Nayfach S, Pollard KS. Maast: genotyping thousands of microbial strains efficiently. Genome Biol 2023; 24:186. [PMID: 37563669 PMCID: PMC10416524 DOI: 10.1186/s13059-023-03030-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Accepted: 07/31/2023] [Indexed: 08/12/2023] Open
Abstract
Existing single nucleotide polymorphism (SNP) genotyping algorithms do not scale for species with thousands of sequenced strains, nor do they account for conspecific redundancy. Here we present a bioinformatics tool, Maast, which empowers population genetic meta-analysis of microbes at an unrivaled scale. Maast implements a novel algorithm to heuristically identify a minimal set of diverse conspecific genomes, then constructs a reliable SNP panel for each species, and enables rapid and accurate genotyping using a hybrid of whole-genome alignment and k-mer exact matching. We demonstrate Maast's utility by genotyping thousands of Helicobacter pylori strains and tracking SARS-CoV-2 diversification.
Collapse
Affiliation(s)
- Zhou Jason Shi
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Gladstone Institutes of Data Science and Biotechnology, San Francisco, CA, USA
| | - Stephen Nayfach
- Joint Genome Institute, Department of Energy, Walnut Creek, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Katherine S Pollard
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
- Gladstone Institutes of Data Science and Biotechnology, San Francisco, CA, USA.
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
10
|
Shipilina D, Pal A, Stankowski S, Chan YF, Barton NH. On the origin and structure of haplotype blocks. Mol Ecol 2023; 32:1441-1457. [PMID: 36433653 PMCID: PMC10946714 DOI: 10.1111/mec.16793] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 11/16/2022] [Accepted: 11/18/2022] [Indexed: 11/27/2022]
Abstract
The term "haplotype block" is commonly used in the developing field of haplotype-based inference methods. We argue that the term should be defined based on the structure of the Ancestral Recombination Graph (ARG), which contains complete information on the ancestry of a sample. We use simulated examples to demonstrate key features of the relationship between haplotype blocks and ancestral structure, emphasizing the stochasticity of the processes that generate them. Even the simplest cases of neutrality or of a "hard" selective sweep produce a rich structure, often missed by commonly used statistics. We highlight a number of novel methods for inferring haplotype structure, based on the full ARG, or on a sequence of trees, and illustrate how they can be used to define haplotype blocks using an empirical data set. While the advent of new, computationally efficient methods makes it possible to apply these concepts broadly, they (and additional new methods) could benefit from adding features to explore haplotype blocks, as we define them. Understanding and applying the concept of the haplotype block will be essential to fully exploit long and linked-read sequencing technologies.
Collapse
Affiliation(s)
- Daria Shipilina
- Evolutionary Biology Program, Department of Ecology and Genetics (IEG)Uppsala UniversityUppsalaSweden
- Institute of Science and Technology AustriaKlosterneuburgAustria
- Swedish Collegium for Advanced StudyUppsalaSweden
| | - Arka Pal
- Institute of Science and Technology AustriaKlosterneuburgAustria
| | - Sean Stankowski
- Institute of Science and Technology AustriaKlosterneuburgAustria
| | | | | |
Collapse
|
11
|
Akhter S, Westrin KJ, Zivi N, Nordal V, Kretzschmar WW, Delhomme N, Street NR, Nilsson O, Emanuelsson O, Sundström JF. Cone-setting in spruce is regulated by conserved elements of the age-dependent flowering pathway. THE NEW PHYTOLOGIST 2022; 236:1951-1963. [PMID: 36076311 PMCID: PMC9825996 DOI: 10.1111/nph.18449] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
Reproductive phase change is well characterized in angiosperm model species, but less studied in gymnosperms. We utilize the early cone-setting acrocona mutant to study reproductive phase change in the conifer Picea abies (Norway spruce), a gymnosperm. The acrocona mutant frequently initiates cone-like structures, called transition shoots, in positions where wild-type P. abies always produces vegetative shoots. We collect acrocona and wild-type samples, and RNA-sequence their messenger RNA (mRNA) and microRNA (miRNA) fractions. We establish gene expression patterns and then use allele-specific transcript assembly to identify mutations in acrocona. We genotype a segregating population of inbred acrocona trees. A member of the SQUAMOSA BINDING PROTEIN-LIKE (SPL) gene family, PaSPL1, is active in reproductive meristems, whereas two putative negative regulators of PaSPL1, miRNA156 and the conifer specific miRNA529, are upregulated in vegetative and transition shoot meristems. We identify a mutation in a putative miRNA156/529 binding site of the acrocona PaSPL1 allele and show that the mutation renders the acrocona allele tolerant to these miRNAs. We show co-segregation between the early cone-setting phenotype and trees homozygous for the acrocona mutation. In conclusion, we demonstrate evolutionary conservation of the age-dependent flowering pathway and involvement of this pathway in regulating reproductive phase change in the conifer P. abies.
Collapse
Affiliation(s)
- Shirin Akhter
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| | - Karl Johan Westrin
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Nathan Zivi
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
- Skogforsk, Uppsala Science ParkUppsalaSE‐751 83Sweden
| | - Veronika Nordal
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| | - Warren W. Kretzschmar
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Nicolas Delhomme
- Department of Forest Genetics and Plant Physiology, Umeå Plant Science CentreSwedish University of Agricultural Sciences (SLU)SE‐901 83UmeåSweden
| | - Nathaniel R. Street
- Department of Plant Physiology, Umeå Plant Science CentreUmeå UniversitySE‐901 87UmeåSweden
| | - Ove Nilsson
- Department of Forest Genetics and Plant Physiology, Umeå Plant Science CentreSwedish University of Agricultural Sciences (SLU)SE‐901 83UmeåSweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Jens F. Sundström
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| |
Collapse
|
12
|
Hunt M, Letcher B, Malone KM, Nguyen G, Hall MB, Colquhoun RM, Lima L, Schatz MC, Ramakrishnan S, Iqbal Z. Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes. Genome Biol 2022; 23:147. [PMID: 35791022 PMCID: PMC9254434 DOI: 10.1186/s13059-022-02714-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 06/20/2022] [Indexed: 12/30/2022] Open
Abstract
There are many short-read variant-calling tools, with different strengths and weaknesses. We present a tool, Minos, which combines outputs from arbitrary variant callers, increasing recall without loss of precision. We benchmark on 62 samples from three bacterial species and an outbreak of 385 Mycobacterium tuberculosis samples. Minos also enables joint genotyping; we demonstrate on a large (N=13k) M. tuberculosis cohort, building a map of non-synonymous SNPs and indels in a region where all such variants are assumed to cause rifampicin resistance. We quantify the correlation with phenotypic resistance and then replicate in a second cohort (N=10k).
Collapse
Affiliation(s)
- Martin Hunt
- EMBL-EBI, Cambridge, UK
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | | | | | | | | | - Rachel M Colquhoun
- Institute of Evolutionary Biology, Ashworth Laboratories, University of Edinburgh, Edinburgh, UK
| | | | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | | |
Collapse
|
13
|
Yu C, Mao K, Zhao Y, Chang C, Wang G. StLiter: A Novel Algorithm to Iteratively Build the Compacted de Bruijn Graph From Many Complete Genomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2471-2483. [PMID: 33630738 DOI: 10.1109/tcbb.2021.3062068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recently, the compacted de Bruijn graph (cDBG) of complete genome sequences was successfully used in read mapping due to its ability to deal with the repetitions in genomes. However, current approaches are not flexible enough to fit frequently building the graphs with different k-mer lengths. Instead of building the graph directly, how can we build the compacted de Bruijin graph of longer k-mer based on the one of short k-mer? In this article, we present StLiter, a novel algorithm to build the compacted de Bruijn graph either directly from genome sequences or indirectly based on the graph of a short k-mer. For 100 simulated human genomes, StLiter can construct the graph of k-mer length 15-18 in 2.5-3.2 hours with maximal ∼70GB memory in the case of without considering the reverese complements of the reference genomes. And it costs 4.5-5.9 hours when considering the reverse complements. In experiments, we compared StLiter with TwoPaCo, the state-of-art method for building the graph, on 4 datasets. For k-mer length 15-18, StLiter can build the graph 5-9 times faster than TwoPaCo using less maximal memory cost. For k-mer length larger than 18, given the graph of a short (k- x)-mer, such as x= 1-2, compared with TwoPaCo building the graph directly, StLiter can also build the graph more efficiently. The source codes of StLiter can be downloaded from web site https://github.com/BioLab-cz/StLiter.
Collapse
|
14
|
Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Korbel JO, Eichler EE, Zody MC, Dilthey AT, Marschall T. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat Genet 2022; 54:518-525. [PMID: 35410384 PMCID: PMC9005351 DOI: 10.1038/s41588-022-01043-w] [Citation(s) in RCA: 121] [Impact Index Per Article: 40.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 03/03/2022] [Indexed: 12/30/2022]
Abstract
Typical genotyping workflows map reads to a reference genome before identifying genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. Furthermore, short-read lengths limit the ability to characterize repetitive genomic regions, which are particularly challenging for fast k-mer-based genotypers. In the present study, we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation-a process we refer to as genome inference. Compared with mapping-based approaches, PanGenie is more than 4 times faster at 30-fold coverage and achieves better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (≥50 bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being faster compared with alignment-based workflows.
Collapse
Affiliation(s)
- Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | | | - Tobias Rausch
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
- European Molecular Biology Laboratory, GeneCore, Heidelberg, Germany
| | - Peter A Audano
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Yafei Mao
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | | | - Alexander T Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute of Medical Statistics and Computational Biology, University of Cologne, Cologne, Germany
- Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases, University of Cologne, Cologne, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
15
|
Rivera D, Moreno-Switt AI, Denes TG, Hudson LK, Peters TL, Samir R, Aziz RK, Noben JP, Wagemans J, Dueñas F. Novel Salmonella Phage, vB_Sen_STGO-35-1, Characterization and Evaluation in Chicken Meat. Microorganisms 2022; 10:606. [PMID: 35336181 PMCID: PMC8954984 DOI: 10.3390/microorganisms10030606] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 03/03/2022] [Accepted: 03/10/2022] [Indexed: 02/05/2023] Open
Abstract
Salmonellosis is one of the most frequently reported zoonotic foodborne diseases worldwide, and poultry is the most important reservoir of Salmonella enterica serovar Enteritidis. The use of lytic bacteriophages (phages) to reduce foodborne pathogens has emerged as a promising biocontrol intervention for Salmonella spp. Here, we describe and evaluate the newly isolated Salmonella phage STGO-35-1, including: (i) genomic and phenotypic characterization, (ii) an analysis of the reduction of Salmonella in chicken meat, and (iii) genome plasticity testing. Phage STGO-35-1 represents an unclassified siphovirus, with a length of 47,483 bp, a G + C content of 46.5%, a headful strategy of packaging, and a virulent lifestyle. Phage STGO-35-1 reduced S. Enteritidis counts in chicken meat by 2.5 orders of magnitude at 4 °C. We identified two receptor-binding proteins with affinity to LPS, and their encoding genes showed plasticity during an exposure assay. Phenotypic, proteomic, and genomic characteristics of STGO-35-1, as well as the Salmonella reduction in chicken meat, support the potential use of STGO-35-1 as a targeted biocontrol agent against S. Enteritidis in chicken meat. Additionally, computational analysis and a short exposure time assay allowed us to predict the plasticity of genes encoding putative receptor-binding proteins.
Collapse
Affiliation(s)
- Dácil Rivera
- Escuela de Medicina Veterinaria, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago 8320000, Chile;
- Millennium Initiative for Collaborative Research on Bacterial Resistance (MICROB-R), Santiago 7550000, Chile;
| | - Andrea I. Moreno-Switt
- Millennium Initiative for Collaborative Research on Bacterial Resistance (MICROB-R), Santiago 7550000, Chile;
- Escuela de Medicina Veterinaria, Facultad de Agronomía e Ingeniería Forestal, Facultad de Ciencias Biológicas, Facultad de Medicina, Pontificia Universidad Católica de Chile, Santiago 7810000, Chile
| | - Thomas G. Denes
- Department of Food Science, University of Tennessee, Knoxville, TN 37996, USA; (T.G.D.); (L.K.H.); (T.L.P.)
| | - Lauren K. Hudson
- Department of Food Science, University of Tennessee, Knoxville, TN 37996, USA; (T.G.D.); (L.K.H.); (T.L.P.)
| | - Tracey L. Peters
- Department of Food Science, University of Tennessee, Knoxville, TN 37996, USA; (T.G.D.); (L.K.H.); (T.L.P.)
| | - Reham Samir
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, 11562 Cairo, Egypt; (R.S.); (R.K.A.)
| | - Ramy K. Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, 11562 Cairo, Egypt; (R.S.); (R.K.A.)
- Microbiology and Immunology Research Program, Children’s Cancer Hospital Egypt 57357, 11617 Cairo, Egypt
| | - Jean-Paul Noben
- Biomedical Research Institute and Transnational University Limburg, Hasselt University, Agoralaan D, 3590 Hasselt, Belgium;
| | | | - Fernando Dueñas
- Escuela de Medicina Veterinaria, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago 8320000, Chile;
| |
Collapse
|
16
|
Balaji A, Sapoval N, Seto C, Leo Elworth R, Fu Y, Nute MG, Savidge T, Segarra S, Treangen TJ. KOMB: K-core based de novo characterization of copy number variation in microbiomes. Comput Struct Biotechnol J 2022; 20:3208-3222. [PMID: 35832621 PMCID: PMC9249589 DOI: 10.1016/j.csbj.2022.06.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/08/2022] [Accepted: 06/09/2022] [Indexed: 11/29/2022] Open
Abstract
Characterizing metagenomes via kmer-based, database-dependent taxonomic classification has yielded key insights into underlying microbiome dynamics. However, novel approaches are needed to track community dynamics and genomic flux within metagenomes, particularly in response to perturbations. We describe KOMB, a novel method for tracking genome level dynamics within microbiomes. KOMB utilizes K-core decomposition to identify Structural variations (SVs), specifically, population-level Copy Number Variation (CNV) within microbiomes. K-core decomposition partitions the graph into shells containing nodes of induced degree at least K, yielding reduced computational complexity compared to prior approaches. Through validation on a synthetic community, we show that KOMB recovers and profiles repetitive genomic regions in the sample. KOMB is shown to identify functionally-important regions in Human Microbiome Project datasets, and was used to analyze longitudinal data and identify keystone taxa in Fecal Microbiota Transplantation (FMT) samples. In summary, KOMB represents a novel graph-based, taxonomy-oblivious, and reference-free approach for tracking CNV within microbiomes. KOMB is open source and available for download at https://gitlab.com/treangenlab/komb.
Collapse
Affiliation(s)
- Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Charlie Seto
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - R.A. Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Tor Savidge
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
- Corresponding author.
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Corresponding author.
| |
Collapse
|
17
|
CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure. PLoS Comput Biol 2021; 17:e1009631. [PMID: 34813594 PMCID: PMC8651127 DOI: 10.1371/journal.pcbi.1009631] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 12/07/2021] [Accepted: 11/11/2021] [Indexed: 11/19/2022] Open
Abstract
With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/. Within transcriptome reference sets, non-chimeric sequences are representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. One area where such sets are utilized is in the quantification of gene expression patterns; where RNA-Seq reads are mapped to the sequences within, and subsequent count values reflect expression levels. Artificial chimeras can have a negative impact on count values by erroneously increasing variation in relation to the reads being mapped. Reference sets can be created from de novo assembled contigs, but chimeras can be introduced during the assembly process via the required traversal of graphs, representing gene families, constructed from the RNA-Seq data. Graph complexity determines how likely chimeras will arise. We have created CStone, a de novo assembler that utilizes a classification system to describe such complexity. Contigs created by CStone are labelled in a manner that indicates whether or not they are non-chimeric. This encourages contig dependent results to be presented with increased objectivity by maintaining the context of ambiguity associated with the assembly process. CStone has been tested extensively. Additionally, we have quantified the relationship between chimeras within reference sets and the identification of differentially expressed genes.
Collapse
|
18
|
Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021; 38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Sebastian Niehus
- Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany
| | | | - Bjarni V Halldórsson
- deCODE Genetics, Reykjavík 102, Iceland,Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland
| | - Birte Kehr
- To whom correspondence should be addressed. or
| |
Collapse
|
19
|
Danciu D, Karasikov M, Mustafa H, Kahles A, Rätsch G. Topology-based sparsification of graph annotations. Bioinformatics 2021; 37:i169-i176. [PMID: 34252940 PMCID: PMC8346655 DOI: 10.1093/bioinformatics/btab330] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/03/2021] [Indexed: 01/03/2023] Open
Abstract
Motivation Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. Results In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. Availability and implementation RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.
Collapse
Affiliation(s)
- Daniel Danciu
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Harun Mustafa
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,Department of Biology, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
20
|
Alanko J, Alipanahi B, Settle J, Boucher C, Gagie T. Buffering updates enables efficient dynamic de Bruijn graphs. Comput Struct Biotechnol J 2021; 19:4067-4078. [PMID: 34377371 PMCID: PMC8326735 DOI: 10.1016/j.csbj.2021.06.047] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/29/2021] [Accepted: 06/29/2021] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.
Collapse
Affiliation(s)
- Jarno Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Jonathen Settle
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Travis Gagie
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
21
|
Rahman A, Chikhi R, Medvedev P. Disk compression of k-mer sets. Algorithms Mol Biol 2021; 16:10. [PMID: 34154632 PMCID: PMC8218509 DOI: 10.1186/s13015-021-00192-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 06/08/2021] [Indexed: 12/23/2022] Open
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
Collapse
Affiliation(s)
| | - Rayan Chikhi
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France
| | | |
Collapse
|
22
|
High-Resolution Genomic Comparisons within Salmonella enterica Serotypes Derived from Beef Feedlot Cattle: Parsing the Roles of Cattle Source, Pen, Animal, Sample Type, and Production Period. Appl Environ Microbiol 2021; 87:e0048521. [PMID: 33863705 DOI: 10.1128/aem.00485-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Salmonella enterica is a major foodborne pathogen, and contaminated beef products have been identified as one of the primary sources of Salmonella-related outbreaks. Pathogenicity and antibiotic resistance of Salmonella are highly serotype and subpopulation specific, which makes it essential to understand high-resolution Salmonella population dynamics in cattle. Time of year, source of cattle, pen, and sample type (i.e., feces, hide, or lymph nodes) have previously been identified as important factors influencing the serotype distribution of Salmonella (e.g., Anatum, Lubbock, Cerro, Montevideo, Kentucky, Newport, and Norwich) that were isolated from a longitudinal sampling design in a research feedlot. In this study, we performed high-resolution genomic comparisons of Salmonella isolates within each serotype using both single-nucleotide polymorphism-based maximum-likelihood phylogeny and hierarchical clustering of core-genome multilocus sequence typing. The importance of the aforementioned features in clonal Salmonella expansion was further explored using a supervised machine learning algorithm. In addition, we identified and compared the resistance genes, plasmids, and pathogenicity island profiles of the isolates within each subpopulation. Our findings indicate that clonal expansion of Salmonella strains in cattle was mainly influenced by the randomization of block and pen, as well as the origin/source of the cattle, i.e., regardless of sampling time and sample type (i.e., feces, lymph node, or hide). Further research is needed concerning the role of the feedlot pen environment prior to cattle placement to better understand carryover contributions of existing strains of Salmonella and their bacteriophages. IMPORTANCE Salmonella serotypes isolated from outbreaks in humans can also be found in beef cattle and feedlots. Virulence factors and antibiotic resistance are among the primary defense mechanisms of Salmonella, and are often associated with clonal expansion. This makes understanding the subpopulation dynamics of Salmonella in cattle critical for effective mitigation. There remains a gap in the literature concerning subpopulation dynamics within Salmonella serotypes in feedlot cattle from the beginning of feeding up until slaughter. Here, we explore Salmonella population dynamics within each serotype using core-genome phylogeny and hierarchical classifications. We used machine learning to quantitatively parse the relative importance of both hierarchical and longitudinal clustering among cattle host samples. Our results reveal that Salmonella populations in cattle are highly clonal over a 6-month study period and that clonal dissemination of Salmonella in cattle is mainly influenced spatially by experimental block and pen, as well by the geographical origin of the cattle.
Collapse
|
23
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
24
|
Horesh G, Blackwell GA, Tonkin-Hill G, Corander J, Heinz E, Thomson NR. A comprehensive and high-quality collection of Escherichia coli genomes and their genes. Microb Genom 2021; 7:000499. [PMID: 33417534 PMCID: PMC8208696 DOI: 10.1099/mgen.0.000499] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/07/2020] [Indexed: 01/25/2023] Open
Abstract
Escherichia coli is a highly diverse organism that includes a range of commensal and pathogenic variants found across a range of niches and worldwide. In addition to causing severe intestinal and extraintestinal disease, E. coli is considered a priority pathogen due to high levels of observed drug resistance. The diversity in the E. coli population is driven by high genome plasticity and a very large gene pool. All these have made E. coli one of the most well-studied organisms, as well as a commonly used laboratory strain. Today, there are thousands of sequenced E. coli genomes stored in public databases. While data is widely available, accessing the information in order to perform analyses can still be a challenge. Collecting relevant available data requires accessing different sources, where data may be stored in a range of formats, and often requires further manipulation and processing to apply various analyses and extract useful information. In this study, we collated and intensely curated a collection of over 10 000 E. coli and Shigella genomes to provide a single, uniform, high-quality dataset. Shigella were included as they are considered specialized pathovars of E. coli. We provide these data in a number of easily accessible formats that can be used as the foundation for future studies addressing the biological differences between E. coli lineages and the distribution and flow of genes in the E. coli population at a high resolution. The analysis we present emphasizes our lack of understanding of the true diversity of the E. coli species, and the biased nature of our current understanding of the genetic diversity of such a key pathogen.
Collapse
Affiliation(s)
- Gal Horesh
- Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, UK
| | - Grace A. Blackwell
- Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, UK
- EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Gerry Tonkin-Hill
- Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, UK
| | - Jukka Corander
- Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, UK
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), University of Helsinki, Helsinki, Finland
| | - Eva Heinz
- Department of Vector Biology and Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool L3 5QA, UK
| | - Nicholas R. Thomson
- Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, UK
- Department of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK
| |
Collapse
|
25
|
Taiwo AO, Harper LA, Derbyshire MC. Impacts of fludioxonil resistance on global gene expression in the necrotrophic fungal plant pathogen Sclerotinia sclerotiorum. BMC Genomics 2021; 22:91. [PMID: 33516198 PMCID: PMC7847169 DOI: 10.1186/s12864-021-07402-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Accepted: 01/21/2021] [Indexed: 01/23/2023] Open
Abstract
Background The fungicide fludioxonil over-stimulates the fungal response to osmotic stress, leading to over-accumulation of glycerol and hyphal swelling and bursting. Fludioxonil-resistant fungal strains that are null-mutants for osmotic stress response genes are easily generated through continual sub-culturing on sub-lethal fungicide doses. Using this approach combined with RNA sequencing, we aimed to characterise the effects of mutations in osmotic stress response genes on the transcriptional profile of the important agricultural pathogen Sclerotinia sclerotiorum under standard laboratory conditions. Our objective was to understand the impact of disruption of the osmotic stress response on the global transcriptional regulatory network in an important agricultural pathogen. Results We generated two fludioxonil-resistant S. sclerotiorum strains, which exhibited growth defects and hypersensitivity to osmotic stressors. Both had missense mutations in the homologue of the Neurospora crassa osmosensing two component histidine kinase gene OS1, and one had a disruptive in-frame deletion in a non-associated gene. RNA sequencing showed that both strains together differentially expressed 269 genes relative to the parent during growth in liquid broth. Of these, 185 (69%) were differentially expressed in both strains in the same direction, indicating similar effects of the different point mutations in OS1 on the transcriptome. Among these genes were numerous transmembrane transporters and secondary metabolite biosynthetic genes. Conclusions Our study is an initial investigation into the kinds of processes regulated through the osmotic stress pathway in S. sclerotiorum. It highlights a possible link between secondary metabolism and osmotic stress signalling, which could be followed up in future studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07402-x.
Collapse
Affiliation(s)
- Akeem O Taiwo
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, Perth, Australia
| | - Lincoln A Harper
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, Perth, Australia
| | - Mark C Derbyshire
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, Perth, Australia.
| |
Collapse
|
26
|
Holley G, Beyter D, Ingimundardottir H, Møller PL, Kristmundsdottir S, Eggertsson HP, Halldorsson BV. Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 2021; 22:28. [PMID: 33419473 PMCID: PMC7792008 DOI: 10.1186/s13059-020-02244-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 12/15/2020] [Indexed: 12/20/2022] Open
Abstract
A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.
Collapse
Affiliation(s)
| | | | | | - Peter L Møller
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
| | - Snædis Kristmundsdottir
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| | | | - Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| |
Collapse
|
27
|
Mutant and Recombinant Phages Selected from In Vitro Coevolution Conditions Overcome Phage-Resistant Listeria monocytogenes. Appl Environ Microbiol 2020; 86:AEM.02138-20. [PMID: 32887717 DOI: 10.1128/aem.02138-20] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 08/31/2020] [Indexed: 12/17/2022] Open
Abstract
Bacteriophages (phages) are currently available for use by the food industry to control the foodborne pathogen Listeria monocytogenes Although phage biocontrols are effective under specific conditions, their use can select for phage-resistant bacteria that repopulate phage-treated environments. Here, we performed short-term coevolution experiments to investigate the impact of single phages and a two-phage cocktail on the regrowth of phage-resistant L. monocytogenes and the adaptation of the phages to overcome this resistance. We used whole-genome sequencing to identify mutations in the target host that confer phage resistance and in the phages that alter host range. We found that infections with Listeria phages LP-048, LP-125, or a combination of both select for different populations of phage-resistant L. monocytogenes bacteria with different regrowth times. Phages isolated from the end of the coevolution experiments were found to have gained the ability to infect phage-resistant mutants of L. monocytogenes and L. monocytogenes strains previously found to be broadly resistant to phage infection. Phages isolated from coinfected cultures were identified as recombinants of LP-048 and LP-125. Interestingly, recombination events occurred twice independently in a locus encoding two proteins putatively involved in DNA binding. We show that short-term coevolution of phages and their hosts can be utilized to obtain mutant and recombinant phages with adapted host ranges. These laboratory-evolved phages may be useful for limiting the emergence of phage resistance and for targeting strains that show general resistance to wild-type (WT) phages.IMPORTANCE Listeria monocytogenes is a life-threatening bacterial foodborne pathogen that can persist in food processing facilities for years. Phages can be used to control L. monocytogenes in food production, but phage-resistant bacterial subpopulations can regrow in phage-treated environments. Coevolution experiments were conducted on a Listeria phage-host system to provide insight into the genetic variation that emerges in both the phage and bacterial host under reciprocal selective pressure. As expected, mutations were identified in both phage and host, but additionally, recombination events were shown to have repeatedly occurred between closely related phages that coinfected L. monocytogenes This study demonstrates that in vitro evolution of phages can be utilized to expand the host range and improve the long-term efficacy of phage-based control of L. monocytogenes This approach may also be applied to other phage-host systems for applications in biocontrol, detection, and phage therapy.
Collapse
|
28
|
The ecological and genomic basis of explosive adaptive radiation. Nature 2020; 586:75-79. [DOI: 10.1038/s41586-020-2652-7] [Citation(s) in RCA: 87] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Accepted: 05/22/2020] [Indexed: 12/22/2022]
|
29
|
Garimella KV, Iqbal Z, Krause MA, Campino S, Kekre M, Drury E, Kwiatkowski D, Sá JM, Wellems TE, McVean G. Detection of simple and complex de novo mutations with multiple reference sequences. Genome Res 2020; 30:1154-1169. [PMID: 32817236 PMCID: PMC7462078 DOI: 10.1101/gr.255505.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 07/17/2020] [Indexed: 12/25/2022]
Abstract
The characterization of de novo mutations in regions of high sequence and structural diversity from whole-genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both de novo assembly, in which short reads do not capture the long-range context required for resolution, and mapping approaches, in which improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multisample, colored de Bruijn graphs from short-read data for all samples, align long-read–derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of de novo mutation events in 119 progeny from four Plasmodium falciparum experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel nonallelic homologous recombination events.
Collapse
Affiliation(s)
- Kiran V Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Michael A Krause
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom.,Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Susana Campino
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Mihir Kekre
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Eleanor Drury
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Dominic Kwiatkowski
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Juliana M Sá
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Thomas E Wellems
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| |
Collapse
|
30
|
Petit RA, Read TD. Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes. mSystems 2020; 5:e00190-20. [PMID: 32753501 PMCID: PMC7406220 DOI: 10.1128/msystems.00190-20] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 07/15/2020] [Indexed: 12/19/2022] Open
Abstract
Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopiaIMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.
Collapse
Affiliation(s)
- Robert A Petit
- Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA
| | - Timothy D Read
- Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA
| |
Collapse
|
31
|
A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol 2020; 39:105-114. [PMID: 32690973 PMCID: PMC7801254 DOI: 10.1038/s41587-020-0603-3] [Citation(s) in RCA: 688] [Impact Index Per Article: 137.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 05/31/2020] [Indexed: 01/08/2023]
Abstract
Comprehensive, high-quality reference genomes are required for functional characterization and taxonomic assignment of the human gut microbiota. We present the Unified Human Gastrointestinal Genome (UHGG) collection, comprising 204,938 nonredundant genomes from 4,644 gut prokaryotes. These genomes encode >170 million protein sequences, which we collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations. Intraspecies genomic variation analyses revealed a large reservoir of accessory genes and single-nucleotide variants, many of which are specific to individual human populations. The UHGG and UHGP collections will enable studies linking genotypes to phenotypes in the human gut microbiome. More than 200,000 gut prokaryotic reference genomes and the proteins they encode are collated, providing comprehensive resources for microbiome researchers.
Collapse
|
32
|
Listeria monocytogenes is prevalent in retail produce environments but Salmonella enterica is rare. Food Control 2020. [DOI: 10.1016/j.foodcont.2020.107173] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
33
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 136] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
34
|
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020; 27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, Maryland
| | - Prashant Pandey
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| | - Michael Ferdman
- Department of Computer Science, Stony Brook University, Stony Brook, New York
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, New York
- VMware Research, Palo Alto, California
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland
| |
Collapse
|
35
|
Homburgvirus LP-018 Has a Unique Ability to Infect Phage-Resistant Listeria monocytogenes. Viruses 2019; 11:v11121166. [PMID: 31861087 PMCID: PMC6950383 DOI: 10.3390/v11121166] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Revised: 12/11/2019] [Accepted: 12/15/2019] [Indexed: 12/17/2022] Open
Abstract
Listeria phage LP-018 is the only phage from a diverse collection of 120 phages able to form plaques on a phage-resistant Listeria monocytogenes strain lacking rhamnose in its cell wall teichoic acids. The aim of this study was to characterize phage LP-018 and to identify what types of mutations can confer resistance to LP-018. Whole genome sequencing and transmission electron microscopy revealed LP-018 to be a member of the Homburgvirus genus. One-step-growth curve analysis of LP-018 revealed an eclipse period of ~60-90 min and a burst size of ~2 PFU per infected cell. Despite slow growth and small burst size, LP-018 can inhibit the growth of Listeria monocytogenes at a high multiplicity of infection. Ten distinct LP-018-resistant mutants were isolated from infected Listeria monocytogenes 10403S and characterized by whole genome sequencing. In each mutant, a single mutation was identified in either the LMRG_00278 or LMRG_01613 encoding genes. Interesting, LP-018 was able to bind to a representative phage-resistant mutant with a mutation in each gene, suggesting these mutations confer resistance through a mechanism independent of adsorption inhibition. Despite forming plaques on the rhamnose deficient 10403S mutant, LP-018 showed reduced binding efficiency, and we did not observe inhibition of the strain under the conditions tested. Two mutants of LP-018 were also isolated and characterized, one with a single SNP in a gene encoding a BppU domain protein that likely alters its host range. LP-018 is shown to be a unique Listeria phage that, with additional evaluation, may be useful in biocontrol applications that aim to reduce the emergence of phage resistance.
Collapse
|
36
|
Complete Genome Sequences of Two Listeria Phages of the Genus Pecentumvirus. Microbiol Resour Announc 2019; 8:8/46/e01229-19. [PMID: 31727716 PMCID: PMC6856282 DOI: 10.1128/mra.01229-19] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Bacteriophages isolated from environmental sources can be used as a biocontrol against the foodborne pathogen Listeria monocytogenes Here, we present the complete genomes of LP-039 and LP-066, two Pecentumvirus bacteriophages that infect L. monocytogenes The genome sizes of LP-039 and LP-066 are 136.2 kb and 139.0 kb, respectively.
Collapse
|
37
|
Rivera D, Hudson LK, Denes TG, Hamilton-West C, Pezoa D, Moreno-Switt AI. Two Phages of the Genera Felixounavirus Subjected to 12 Hour Challenge on Salmonella Infantis Showed Distinct Genotypic and Phenotypic Changes. Viruses 2019; 11:E586. [PMID: 31252667 PMCID: PMC6669636 DOI: 10.3390/v11070586] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 06/23/2019] [Accepted: 06/25/2019] [Indexed: 12/15/2022] Open
Abstract
Salmonella Infantis is considered in recent years an emerging Salmonella serovar, as it has been associated with several outbreaks and multidrug resistance phenotypes. Phages appear as a possible alternative strategy to control Salmonella Infantis (SI). The aims of this work were to characterize two phages of the Felixounavirus genus, isolated using the same strain of SI, and to expose them to interact in challenge assays to identify genetic and phenotypic changes generated from these interactions. These two phages have a shared nucleotide identity of 97% and are differentiated by their host range: one phage has a wide host range (lysing 14 serovars), and the other has a narrow host range (lysing 6 serovars). During the 12 h challenge we compared: (1) optical density of SI, (2) proportion of SI survivors from phage-infected cultures, and (3) phage titer. Isolates obtained through the assays were evaluated by efficiency of plating (EOP) and by host-range characterization. Genomic modifications were characterized by evaluation of single nucleotide polymorphisms (SNPs). The optical density (600 nm) of phage-infected SI decreased, as compared to the uninfected control, by an average of 0.7 for SI infected with the wide-host-range (WHR) phage and by 0.3 for SI infected with the narrow-host-range (NHR) phage. WHR phage reached higher phage titer (7 × 1011 PFU/mL), and a lower proportion of SI survivor was obtained from the challenge assay. In SI that interacted with phages, we identified SNPs in two genes (rfaK and rfaB), which are both involved in lipopolysaccharide (LPS) polymerization. Therefore, mutations that could impact potential phage receptors on the host surface were selected by lytic phage exposure. This work demonstrates that the interaction of Salmonella phages (WHR and NHR) with SI for 12 h in vitro leads to emergence of new phenotypic and genotypic traits in both phage and host. This information is crucial for the rational design of phage-based control strategies.
Collapse
Affiliation(s)
- Dácil Rivera
- Escuela de Medicina Veterinaria, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago 8320000, Chile.
- Departamento de Ciencia de los Alimentos y Tecnología Química, Facultad de Ciencias Químicas y Farmacéuticas, Universidad de Chile, Santiago 8380492, Chile.
| | - Lauren K Hudson
- Department of Food Science, University of Tennessee, Knoxville, TN 37996, USA.
| | - Thomas G Denes
- Department of Food Science, University of Tennessee, Knoxville, TN 37996, USA.
| | - Christopher Hamilton-West
- Departamento de Medicina Preventiva Animal, Facultad de Ciencias Veterinarias, Universidad de Chile, Santiago 8330015, Chile.
| | - David Pezoa
- Escuela de Medicina Veterinaria, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile.
| | - Andrea I Moreno-Switt
- Escuela de Medicina Veterinaria, Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago 8320000, Chile.
- Millennium Nucleus for Collaborative Research on Bacterial Resistance (MICROB-R), Santiago 7550000, Chile.
| |
Collapse
|
38
|
Cross-resistance to phage infection in Listeria monocytogenes serotype 1/2a mutants. Food Microbiol 2019; 84:103239. [PMID: 31421769 DOI: 10.1016/j.fm.2019.06.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2019] [Revised: 05/31/2019] [Accepted: 06/03/2019] [Indexed: 01/22/2023]
Abstract
Bacteriophage-based biocontrols are one of several tools available to control Listeria monocytogenes in food and food processing environments. The objective of this study was to determine if phage-resistance that has been characterized with a select few Listeria phages would also confer resistance to a diverse collection of over 100 other Listeria phages. We show that some mutations that are likely to emerge in bacteriophage-treated populations of serotype 1/2a L. monocytogenes can lead to cross-resistance against almost all types of characterized Listeria phages. Out of the 120 phages that showed activity against the parental strain, only one could form visible plaques on the mutant strain of L. monocytogenes lacking rhamnose in its wall teichoic acids. An additional two phages showed signs of lytic activity against this mutant strain; although no visible plaques were observed. The findings presented here are consistent with other studies showing mutations conferring phage resistance through loss of rhamnose likely pose the greatest challenge for phage-based biocontrol in serotype 1/2a strains.
Collapse
|
39
|
Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 2019; 37:152-159. [PMID: 30718882 PMCID: PMC6420049 DOI: 10.1038/s41587-018-0010-1] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2017] [Accepted: 12/20/2018] [Indexed: 02/07/2023]
Abstract
Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search-terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web-search to produce a searchable data structure named Bitsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole genome sequence datasets using 4 orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1/2/3, determine the host-range of 2827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.
Collapse
|
40
|
Akhter S, Kretzschmar WW, Nordal V, Delhomme N, Street NR, Nilsson O, Emanuelsson O, Sundström JF. Integrative Analysis of Three RNA Sequencing Methods Identifies Mutually Exclusive Exons of MADS-Box Isoforms During Early Bud Development in Picea abies. FRONTIERS IN PLANT SCIENCE 2018; 9:1625. [PMID: 30483285 PMCID: PMC6243048 DOI: 10.3389/fpls.2018.01625] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 10/18/2018] [Indexed: 05/06/2023]
Abstract
Recent efforts to sequence the genomes and transcriptomes of several gymnosperm species have revealed an increased complexity in certain gene families in gymnosperms as compared to angiosperms. One example of this is the gymnosperm sister clade to angiosperm TM3-like MADS-box genes, which at least in the conifer lineage has expanded in number of genes. We have previously identified a member of this sub-clade, the conifer gene DEFICIENS AGAMOUS LIKE 19 (DAL19), as being specifically upregulated in cone-setting shoots. Here, we show through Sanger sequencing of mRNA-derived cDNA and mapping to assembled conifer genomic sequences that DAL19 produces six mature mRNA splice variants in Picea abies. These splice variants use alternate first and last exons, while their four central exons constitute a core region present in all six transcripts. Thus, they are likely to be transcript isoforms. Quantitative Real-Time PCR revealed that two mutually exclusive first DAL19 exons are differentially expressed across meristems that will form either male or female cones, or vegetative shoots. Furthermore, mRNA in situ hybridization revealed that two mutually exclusive last DAL19 exons were expressed in a cell-specific pattern within bud meristems. Based on these findings in DAL19, we developed a sensitive approach to transcript isoform assembly from short-read sequencing of mRNA. We applied this method to 42 putative MADS-box core regions in P. abies, from which we assembled 1084 putative transcripts. We manually curated these transcripts to arrive at 933 assembled transcript isoforms of 38 putative MADS-box genes. 152 of these isoforms, which we assign to 28 putative MADS-box genes, were differentially expressed across eight female, male, and vegetative buds. We further provide evidence of the expression of 16 out of the 38 putative MADS-box genes by mapping PacBio Iso-Seq circular consensus reads derived from pooled sample sequencing to assembled transcripts. In summary, our analyses reveal the use of mutually exclusive exons of MADS-box gene isoforms during early bud development in P. abies, and we find that the large number of identified MADS-box transcripts in P. abies results not only from expansion of the gene family through gene duplication events but also from the generation of numerous splice variants.
Collapse
Affiliation(s)
- Shirin Akhter
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Warren W. Kretzschmar
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Veronika Nordal
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Nicolas Delhomme
- Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden
| | - Nathaniel R. Street
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Ove Nilsson
- Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Jens F. Sundström
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
- *Correspondence: Jens F. Sundström,
| |
Collapse
|