1
|
Bálint B, Merényi Z, Hegedüs B, Grigoriev IV, Hou Z, Földi C, Nagy LG. ContScout: sensitive detection and removal of contamination from annotated genomes. Nat Commun 2024; 15:936. [PMID: 38296951 PMCID: PMC10831095 DOI: 10.1038/s41467-024-45024-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 01/08/2024] [Indexed: 02/02/2024] Open
Abstract
Contamination of genomes is an increasingly recognized problem affecting several downstream applications, from comparative evolutionary genomics to metagenomics. Here we introduce ContScout, a precise tool for eliminating foreign sequences from annotated genomes. It achieves high specificity and sensitivity on synthetic benchmark data even when the contaminant is a closely related species, outperforms competing tools, and can distinguish horizontal gene transfer from contamination. A screen of 844 eukaryotic genomes for contamination identified bacteria as the most common source, followed by fungi and plants. Furthermore, we show that contaminants in ancestral genome reconstructions lead to erroneous early origins of genes and inflate gene loss rates, leading to a false notion of complex ancestral genomes. Taken together, we offer here a tool for sensitive removal of foreign proteins, identify and remove contaminants from diverse eukaryotic genomes and evaluate their impact on phylogenomic analyses.
Collapse
Affiliation(s)
- Balázs Bálint
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Zsolt Merényi
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Botond Hegedüs
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
| | - Igor V Grigoriev
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, 94720, USA
| | - Zhihao Hou
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
- Doctoral School of Biology, Faculty of Science and Informatics, University of Szeged, Szeged, 6720, Hungary
| | - Csenge Földi
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary
- Doctoral School of Biology, Faculty of Science and Informatics, University of Szeged, Szeged, 6720, Hungary
| | - László G Nagy
- Synthetic and Systems Biology Unit, HUN-REN Biological Research Centre, Szeged, Szeged, 6726, Hungary.
| |
Collapse
|
2
|
Nazarizadeh M, Nováková M, Drábková M, Catchen J, Olson PD, Štefka J. Highly resolved genome assembly and comparative transcriptome profiling reveal genes related to developmental stages of tapeworm Ligula intestinalis. Proc Biol Sci 2024; 291:20232563. [PMID: 38290545 PMCID: PMC10827431 DOI: 10.1098/rspb.2023.2563] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 01/02/2024] [Indexed: 02/01/2024] Open
Abstract
Ligula intestinalis (Cestoda: Diphyllobothriidae) is an emerging model organism for studies on parasite population biology and host-parasite interactions. However, a well-resolved genome and catalogue of its gene content has not been previously developed. Here, we present the first genome assembly of L. intestinalis, based on Oxford Nanopore Technologies, Illumina and Omni-C sequencing methodologies. We use transcriptome profiling to compare plerocercoid larvae and adult worms and identify differentially expressed genes (DEGs) associated with these life stages. The genome assembly is 775.3 mega (M)bp in size, with scaffold N50 value of 118 Mbp and encodes 27 256 predicted protein-coding sequences. Over 60% of the genome consists of repetitive sequences. Synteny analyses showed that the 10 largest scaffolds representing 75% of the genome display high correspondence to full chromosomes of cyclophyllidean tapeworms. Mapping RNA-seq data to the new reference genome, we identified 3922 differentially expressed genes in adults compared with plerocercoids. Gene ontology analyses revealed over-represented genes involved in reproductive development of the adult stage (e.g. sperm production) and significantly enriched DEGs associated with immune evasion of plerocercoids in their fish host. This study provides the first insights into the molecular biology of L. intestinalis and provides the most highly contiguous assembly to date of a diphyllobothriid tapeworm useful for population and comparative genomic investigations of parasitic flatworms.
Collapse
Affiliation(s)
- Masoud Nazarizadeh
- Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre CAS, České Budějovice, Czech Republic
| | - Milena Nováková
- Institute of Parasitology, Biology Centre CAS, České Budějovice, Czech Republic
| | - Marie Drábková
- Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
| | - Julian Catchen
- Department of Evolution, Ecology and Behavior, University of Illinois, Urbana-Champaign, IL 61801, USA
| | - Peter D. Olson
- Life Sciences Department, Natural History Museum, London, UK
| | - Jan Štefka
- Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre CAS, České Budějovice, Czech Republic
| |
Collapse
|
3
|
Cornet L, Baurain D. Contamination detection in genomic data: more is not enough. Genome Biol 2022; 23:60. [PMID: 35189924 PMCID: PMC8862208 DOI: 10.1186/s13059-022-02619-9] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 01/18/2022] [Indexed: 12/20/2022] Open
Abstract
The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection.
Collapse
Affiliation(s)
- Luc Cornet
- BCCM/IHEM, Mycology and Aerobiology, Sciensano, Bruxelles, Belgium
| | - Denis Baurain
- InBioS-PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium.
| |
Collapse
|
4
|
A high-quality fungal genome assembly resolved from a sample accidentally contaminated by multiple taxa. Biotechniques 2021; 72:39-50. [PMID: 34846173 DOI: 10.2144/btn-2021-0097] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Contamination in sequenced genomes is a relatively common problem and several methods to remove non-target sequences have been devised. Typically, the target and contaminating organisms reside in different kingdoms, simplifying their separation. The authors present the case of a genome for the ascomycete fungus Teratosphaeria eucalypti, contaminated by another ascomycete fungus and a bacterium. Approaching the problem as a low-complexity metagenomics project, the authors used two available software programs, BlobToolKit and anvi'o, to filter the contaminated genome. Both the de novo and reference-assisted approaches yielded a high-quality draft genome assembly for the target fungus. Incorporating reference sequences increased assembly completeness and visualization elucidated previously unknown genome features. The authors suggest that visualization should be routine in any sequencing project, regardless of suspected contamination.
Collapse
|
5
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
6
|
Keeling CI, Campbell EO, Batista PD, Shegelski VA, Trevoy SAL, Huber DPW, Janes JK, Sperling FAH. Chromosome-level genome assembly reveals genomic architecture of northern range expansion in the mountain pine beetle, Dendroctonus ponderosae Hopkins (Coleoptera: Curculionidae). Mol Ecol Resour 2021; 22:1149-1167. [PMID: 34637588 DOI: 10.1111/1755-0998.13528] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 10/04/2021] [Accepted: 10/05/2021] [Indexed: 12/22/2022]
Abstract
Genome sequencing methods and assembly tools have improved dramatically since the 2013 publication of draft genome assemblies for the mountain pine beetle, Dendroctonus ponderosae Hopkins (Coleoptera: Curculionidae). We conducted proximity ligation library sequencing and scaffolding to improve contiguity, and then used linkage mapping and recent bioinformatic tools for correction and further improvement. The new assemblies have dramatically improved contiguity and gaps compared to the originals: N50 values increased 26- to 36-fold, and the number of gaps were reduced by half. Ninety per cent of the content of the assemblies is now contained in 12 and 11 scaffolds for the female and male assemblies, respectively. Based on linkage mapping information, the 12 largest scaffolds in both assemblies represent all 11 autosomal chromosomes and the neo-X chromosome. These assemblies now have nearly chromosome-sized scaffolds and will be instrumental for studying genomic architecture, chromosome evolution, population genomics, functional genomics, and adaptation in this and other pest insects. We also identified regions in two chromosomes, including the ancestral-X portion of the neo-X chromosome, with elevated differentiation between northern and southern Canadian populations.
Collapse
Affiliation(s)
- Christopher I Keeling
- Laurentian Forestry Centre, Canadian Forest Service, Natural Resources Canada, Québec, QC, Canada.,Département de biochimie, de microbiologie et de bio-informatique, Université Laval, Québec, QC, Canada
| | - Erin O Campbell
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Philip D Batista
- Faculty of Environment, University of Northern British Columbia, Prince George, BC, Canada
| | - Victor A Shegelski
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Stephen A L Trevoy
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Dezene P W Huber
- Faculty of Environment, University of Northern British Columbia, Prince George, BC, Canada
| | - Jasmine K Janes
- Biology Department, Vancouver Island University, Nanaimo, BC, Canada.,School of Environmental and Rural Studies, University of New England, Armidale, NSW, Australia
| | - Felix A H Sperling
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
7
|
Whibley A, Kelley JL, Narum SR. The changing face of genome assemblies: Guidance on achieving high-quality reference genomes. Mol Ecol Resour 2021; 21:641-652. [PMID: 33326691 DOI: 10.1111/1755-0998.13312] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 12/08/2020] [Accepted: 12/11/2020] [Indexed: 12/20/2022]
Abstract
The quality of genome assemblies has improved rapidly in recent years due to continual advances in sequencing technology, assembly approaches, and quality control. In the field of molecular ecology, this has led to the development of exceptional quality genome assemblies that will be important long-term resources for broader studies into ecological, conservation, evolutionary, and population genomics of naturally occurring species. Moreover, the extent to which a single reference genome represents the diversity within a species varies: pan-genomes will become increasingly important ecological genomics resources, particularly in systems found to have considerable presence-absence variation in their functional content. Here, we highlight advances in technology that have raised the bar for genome assembly and provide guidance on standards to achieve exceptional quality reference genomes. Key recommendations include the following: (a) Genome assemblies should include long-read sequencing except in rare cases where it is effectively impossible to acquire adequately preserved samples needed for high molecular weight DNA standards. (b) At least one scaffolding approach should be included with genome assembly such as Hi-C or optical mapping. (c) Genome assemblies should be carefully evaluated, this may involve utilising short read data for genome polishing, error correction, k-mer analyses, and estimating the percent of reads that map back to an assembly. Finally, a genome assembly is most valuable if all data and methods are made publicly available and the utility of a genome for further studies is verified through examples. While these recommendations are based on current technology, we anticipate that future advances will push the field further and the molecular ecology community should continue to adopt new approaches that attain the highest quality genome assemblies.
Collapse
Affiliation(s)
| | | | - Shawn R Narum
- University of Idaho, Moscow, ID, USA.,Columbia River Inter-Tribal Fish Commission, Hagerman, ID, USA
| |
Collapse
|
8
|
Draft Genome Sequences of Thelohania contejeani and Cucumispora dikerogammari, Pathogenic Microsporidia of Freshwater Crustaceans. Microbiol Resour Announc 2021; 10:10/2/e01346-20. [PMID: 33446596 PMCID: PMC7849709 DOI: 10.1128/mra.01346-20] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
We announce the draft genome sequences of two pathogenic microsporidia of European freshwater crustaceans, Thelohania contejeani (the causative agent of porcelain disease) and Cucumispora dikerogammari. Both species are implicated in mass mortalities in natural populations of their crayfish and amphipod hosts, respectively. We announce the draft genome sequences of two pathogenic microsporidia of European freshwater crustaceans, Thelohania contejeani (the causative agent of porcelain disease) and Cucumispora dikerogammari. Both species are implicated in mass mortalities in natural populations of their crayfish and amphipod hosts, respectively.
Collapse
|
9
|
Cormier A, Chebbi MA, Giraud I, Wattier R, Teixeira M, Gilbert C, Rigaud T, Cordaux R. Comparative Genomics of Strictly Vertically Transmitted, Feminizing Microsporidia Endosymbionts of Amphipod Crustaceans. Genome Biol Evol 2020; 13:5995313. [PMID: 33216144 DOI: 10.1093/gbe/evaa245] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/17/2020] [Indexed: 12/19/2022] Open
Abstract
Microsporidia are obligate intracellular eukaryotic parasites of vertebrates and invertebrates. Microsporidia are usually pathogenic and undergo horizontal transmission or a mix of horizontal and vertical transmission. However, cases of nonpathogenic microsporidia, strictly vertically transmitted from mother to offspring, have been reported in amphipod crustaceans. Some of them further evolved the ability to feminize their nontransmitting male hosts into transmitting females. However, our understanding of the evolution of feminization in microsporidia is hindered by a lack of genomic resources. We report the sequencing and analysis of three strictly vertically transmitted microsporidia species for which feminization induction has been demonstrated (Nosema granulosis) or is strongly suspected (Dictyocoela muelleri and Dictyocoela roeselum), along with a draft genome assembly of their host Gammarus roeselii. Contrary to horizontally transmitted microsporidia that form environmental spores that can be purified, feminizing microsporidia cannot be easily isolated from their host cells. Therefore, we cosequenced symbiont and host genomic DNA and devised a computational strategy to obtain genome assemblies for the different partners. Genomic comparison with feminizing Wolbachia bacterial endosymbionts of isopod crustaceans indicated independent evolution of feminization in microsporidia and Wolbachia at the molecular genetic level. Feminization thus represents a remarkable evolutionary convergence of eukaryotic and prokaryotic microorganisms. Furthermore, a comparative genomics analysis of microsporidia allowed us to identify several candidate genes for feminization, involving functions such as DNA binding and membrane fusion. The genomic resources we generated contribute to establish Gammarus roeselii and its microsporidia symbionts as a new model to study the evolution of symbiont-mediated feminization.
Collapse
Affiliation(s)
- Alexandre Cormier
- Laboratoire Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Université de Poitiers, UMR CNRS 7267, France
| | - Mohamed Amine Chebbi
- Laboratoire Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Université de Poitiers, UMR CNRS 7267, France
| | - Isabelle Giraud
- Laboratoire Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Université de Poitiers, UMR CNRS 7267, France
| | - Rémi Wattier
- Laboratoire Biogéosciences, Université Bourgogne Franche-Comté, UMR CNRS 6282, Dijon, France
| | - Maria Teixeira
- Laboratoire Biogéosciences, Université Bourgogne Franche-Comté, UMR CNRS 6282, Dijon, France
| | - Clément Gilbert
- Université Paris-Saclay, CNRS, IRD, UMR Évolution, Génomes, Comportement et Écologie, 91198 Gif-sur-Yvette, France
| | - Thierry Rigaud
- Laboratoire Biogéosciences, Université Bourgogne Franche-Comté, UMR CNRS 6282, Dijon, France
| | - Richard Cordaux
- Laboratoire Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Université de Poitiers, UMR CNRS 7267, France
| |
Collapse
|
10
|
Manni M, Simao FA, Robertson HM, Gabaglio MA, Waterhouse RM, Misof B, Niehuis O, Szucsich NU, Zdobnov EM. The Genome of the Blind Soil-Dwelling and Ancestrally Wingless Dipluran Campodea augens: A Key Reference Hexapod for Studying the Emergence of Insect Innovations. Genome Biol Evol 2020; 12:3534-3549. [PMID: 31778187 PMCID: PMC6938034 DOI: 10.1093/gbe/evz260] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/26/2019] [Indexed: 12/13/2022] Open
Abstract
The dipluran two-pronged bristletail Campodea augens is a blind ancestrally wingless hexapod with the remarkable capacity to regenerate lost body appendages such as its long antennae. As sister group to Insecta (sensu stricto), Diplura are key to understanding the early evolution of hexapods and the origin and evolution of insects. Here we report the 1.2-Gb draft genome of C. augens and results from comparative genomic analyses with other arthropods. In C. augens, we uncovered the largest chemosensory gene repertoire of ionotropic receptors in the animal kingdom, a massive expansion that might compensate for the loss of vision. We found a paucity of photoreceptor genes mirroring at the genomic level the secondary loss of an ancestral external photoreceptor organ. Expansions of detoxification and carbohydrate metabolism gene families might reflect adaptations for foraging behavior, and duplicated apoptotic genes might underlie its high regenerative potential. The C. augens genome represents one of the key references for studying the emergence of genomic innovations in insects, the most diverse animal group, and opens up novel opportunities to study the under-explored biology of diplurans.
Collapse
Affiliation(s)
- Mosè Manni
- Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Switzerland
| | - Felipe A Simao
- Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Switzerland
| | - Hugh M Robertson
- Department of Entomology, University of Illinois at Urbana-Champaign
| | - Marco A Gabaglio
- Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Switzerland
| | - Robert M Waterhouse
- Department of Ecology and Evolution, Swiss Institute of Bioinformatics, University of Lausanne, Switzerland
| | - Bernhard Misof
- Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, Bonn, Germany
| | - Oliver Niehuis
- Department of Evolutionary Biology and Ecology, Albert Ludwig University, Institute of Biology I (Zoology), Freiburg, Germany
| | | | - Evgeny M Zdobnov
- Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Switzerland
| |
Collapse
|
11
|
Derkarabetian S, Castillo S, Koo PK, Ovchinnikov S, Hedin M. A demonstration of unsupervised machine learning in species delimitation. Mol Phylogenet Evol 2019; 139:106562. [PMID: 31323334 PMCID: PMC6880864 DOI: 10.1016/j.ympev.2019.106562] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2019] [Revised: 07/03/2019] [Accepted: 07/15/2019] [Indexed: 01/13/2023]
Abstract
One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.
Collapse
Affiliation(s)
- Shahan Derkarabetian
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, United States; Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, Riverside, CA 92521, United States.
| | - Stephanie Castillo
- Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Entomology, University of California, Riverside, Riverside, CA 92521, United States
| | - Peter K Koo
- Howard Hughes Medical Institute, Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, United States
| | - Sergey Ovchinnikov
- Center for Systems Biology, Harvard University, Cambridge, MA 02138, United States
| | - Marshal Hedin
- Department of Biology, San Diego State University, San Diego, CA 92182, United States
| |
Collapse
|
12
|
Low AJ, Koziol AG, Manninger PA, Blais B, Carrillo CD. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data. PeerJ 2019; 7:e6995. [PMID: 31183253 PMCID: PMC6546082 DOI: 10.7717/peerj.6995] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 04/20/2019] [Indexed: 12/16/2022] Open
Abstract
Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0-20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.
Collapse
Affiliation(s)
- Andrew J Low
- Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | - Adam G Koziol
- Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | - Paul A Manninger
- Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | - Burton Blais
- Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | - Catherine D Carrillo
- Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| |
Collapse
|
13
|
Zhang F, Ding Y, Zhou QS, Wu J, Luo A, Zhu CD. A High-quality Draft Genome Assembly of Sinella curviseta: A Soil Model Organism (Collembola). Genome Biol Evol 2019; 11:521-530. [PMID: 30668671 PMCID: PMC6389355 DOI: 10.1093/gbe/evz013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/16/2019] [Indexed: 12/25/2022] Open
Abstract
Sinella curviseta, among the most widespread springtails (Collembola) in Northern Hemisphere, has often been treated as a model organism in soil ecology and environmental toxicology. However, little information on its genetic knowledge severely hinders our understanding of its adaptations to the soil habitat. We present the largest genome assembly within Collembola using ∼44.86 Gb (118X) of single-molecule real-time Pacific Bioscience Sequel sequencing. The final assembly of 599 scaffolds was ∼381.46 Mb with a N50 length of 3.28 Mb, which captured 95.3% complete and 1.5% partial arthropod Benchmarking Universal Single-Copy Orthologs (n = 1066). Transcripts and circularized mitochondrial genome were also assembled. We predicted 23,943 protein-coding genes, of which 83.88% were supported by transcriptome-based evidence and 82.49% matched protein records in UniProt. In addition, we also identified 222,501 repeats and 881 noncoding RNAs. Phylogenetic reconstructions for Collembola support Tomoceridae sistered to the remaining Entomobryomorpha with the position of Symphypleona not fully resolved. Gene family evolution analyses identified 9,898 gene families, of which 156 experienced significant expansions or contractions. Our high-quality reference genome of S. curviseta provides the genetic basis for future investigations in evolutionary biology, soil ecology, and ecotoxicology.
Collapse
Affiliation(s)
- Feng Zhang
- Department of Entomology, College of Plant Protection, Nanjing Agricultural University.,Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Yinhuan Ding
- Department of Entomology, College of Plant Protection, Nanjing Agricultural University
| | - Qing-Song Zhou
- Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Jun Wu
- Nanjing Institute of Environmental Sciences under Ministry of Environmental Protection, Nanjing, China
| | - Arong Luo
- Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Chao-Dong Zhu
- Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
14
|
Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics 2018; 19:54. [PMID: 29338683 PMCID: PMC5771137 DOI: 10.1186/s12864-017-4429-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 12/29/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Without knowledge of their genomic sequences, it is impossible to make functional models of the bacteria that make up human and animal microbiota. Unfortunately, the vast majority of publicly available genomes are only working drafts, an incompleteness that causes numerous problems and constitutes a major obstacle to genotypic and phenotypic interpretation. In this work, we began with an example from the class Bacteroidia in the phylum Bacteroidetes, which is preponderant among human orodigestive microbiota. We successfully identify the genetic loci responsible for assembly breaks and misassemblies and demonstrate the importance and usefulness of long-read sequencing and curated reannotation. RESULTS We showed that the fragmentation in Bacteroidia draft genomes assembled from massively parallel sequencing linearly correlates with genomic repeats of the same or greater size than the reads. We also demonstrated that some of these repeats, especially the long ones, correspond to misassembled loci in three reference Porphyromonas gingivalis genomes marked as circularized (thus complete or finished). We prove that even at modest coverage (30X), long-read resequencing together with PCR contiguity verification (rrn operons and an integrative and conjugative element or ICE) can be used to identify and correct the wrongly combined or assembled regions. Finally, although time-consuming and labor-intensive, consistent manual biocuration of three P. gingivalis strains allowed us to compare and correct the existing genomic annotations, resulting in a more accurate interpretation of the genomic differences among these strains. CONCLUSIONS In this study, we demonstrate the usefulness and importance of long-read sequencing in verifying published genomes (even when complete) and generating assemblies for new bacterial strains/species with high genomic plasticity. We also show that when combined with biological validation processes and diligent biocurated annotation, this strategy helps reduce the propagation of errors in shared databases, thus limiting false conclusions based on incomplete or misleading information.
Collapse
Affiliation(s)
- Luis Acuña-Amador
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.,Laboratorio de Investigación en Bacteriología Anaerobia, Centro de Investigación en Enfermedades Tropicales, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica
| | - Aline Primot
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Edouard Cadieu
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Alain Roulet
- GenoToul Genome & Transcriptome (GeT-PlaGe), INRA, US1426, Castanet-Tolosan, France
| | - Frédérique Barloy-Hubler
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.
| |
Collapse
|
15
|
Dittami SM, Corre E. Detection of bacterial contaminants and hybrid sequences in the genome of the kelp Saccharina japonica using Taxoblast. PeerJ 2017; 5:e4073. [PMID: 29158994 PMCID: PMC5695246 DOI: 10.7717/peerj.4073] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Accepted: 10/30/2017] [Indexed: 12/03/2022] Open
Abstract
Modern genome sequencing strategies are highly sensitive to contamination making the detection of foreign DNA sequences an important part of analysis pipelines. Here we use Taxoblast, a simple pipeline with a graphical user interface, for the post-assembly detection of contaminating sequences in the published genome of the kelp Saccharina japonica. Analyses were based on multiple blastn searches with short sequence fragments. They revealed a number of probable bacterial contaminations as well as hybrid scaffolds that contain both bacterial and algal sequences. This or similar types of analysis, in combination with manual curation, may thus constitute a useful complement to standard bioinformatics analyses prior to submission of genomic data to public repositories. Our analysis pipeline is open-source and freely available at http://sdittami.altervista.org/taxoblast and via SourceForge (https://sourceforge.net/projects/taxoblast).
Collapse
Affiliation(s)
- Simon M Dittami
- UMR8227-Sorbonne Universités CNRS UPMC, Station Biologique de Roscoff, Roscoff, Brittany, France
| | - Erwan Corre
- FR2424-Sorbonne Universités CNRS UPMC, Station Biologique de Roscoff, Roscoff, Brittany, France
| |
Collapse
|
16
|
Abstract
The goal of many genome sequencing projects is to provide a complete representation of a target genome (or genomes) as underpinning data for further analyses. However, it can be problematic to identify which sequences in an assembly truly derive from the target genome(s) and which are derived from associated microbiome or contaminant organisms. We present BlobTools, a modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets. Using guanine+cytosine content of sequences, read coverage in sequencing libraries and taxonomy of sequence similarity matches, BlobTools can assist in primary partitioning of data, leading to improved assemblies, and screening of final assemblies for potential contaminants. Through simulated paired-end read dataset,s containing a mixture of metazoan and bacterial taxa, we illustrate the main BlobTools workflow and suggest useful parameters for taxonomic partitioning of low-complexity metagenome assemblies.
Collapse
|