1
|
Léonard RR, Leleu M, Van Vlierberghe M, Cornet L, Kerff F, Baurain D. ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies. PeerJ 2021; 9:e11348. [PMID: 33996287 PMCID: PMC8106394 DOI: 10.7717/peerj.11348] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 04/04/2021] [Indexed: 11/20/2022] Open
Abstract
TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is primarily optimized to dereplicate at higher taxonomic levels (phylum/class), as opposed to the other dereplication tools, but also works at lower taxonomic levels (species/strain) like the other dereplication tools. TQMD is available from source and as a Singularity container at [https://bitbucket.org/phylogeno/tqmd ].
Collapse
Affiliation(s)
- Raphaël R Léonard
- InBioS - Centre d'Ingénierie des Protéines, Université de Liège, Liège, Belgium.,InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| | - Marie Leleu
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium.,UGSF -Unité de Glycobiologie Structurale et Fonctionnelle, Université de Lille/CNRS, Lille, France
| | - Mick Van Vlierberghe
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| | - Luc Cornet
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium.,Mycology and Aerobiology, Sciensano, Service Public Fédéral, Bruxelles, Belgium
| | - Frédéric Kerff
- InBioS - Centre d'Ingénierie des Protéines, Université de Liège, Liège, Belgium
| | - Denis Baurain
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| |
Collapse
|
2
|
Cornet L, Meunier L, Van Vlierberghe M, Léonard RR, Durieu B, Lara Y, Misztak A, Sirjacobs D, Javaux EJ, Philippe H, Wilmotte A, Baurain D. Consensus assessment of the contamination level of publicly available cyanobacterial genomes. PLoS One 2018; 13:e0200323. [PMID: 30044797 PMCID: PMC6059444 DOI: 10.1371/journal.pone.0200323] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 06/22/2018] [Indexed: 12/31/2022] Open
Abstract
Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.
Collapse
Affiliation(s)
- Luc Cornet
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
- UR Geology–Palaeobiogeology-Palaeobotany-Palaeopalynology, University of Liège, Liège, Belgium
| | - Loïc Meunier
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
| | - Mick Van Vlierberghe
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
| | - Raphaël R. Léonard
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
- InBioS–CIP, Macromolecular Crystallography, University of Liège, Liège, Belgium
| | - Benoit Durieu
- InBioS–CIP, Centre for Protein Engineering, University of Liège, Liège, Belgium
| | - Yannick Lara
- InBioS–CIP, Centre for Protein Engineering, University of Liège, Liège, Belgium
| | - Agnieszka Misztak
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
- Intercollegiate Faculty of Biotechnology UG-MUG, Gdansk, Poland
| | - Damien Sirjacobs
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
| | - Emmanuelle J. Javaux
- UR Geology–Palaeobiogeology-Palaeobotany-Palaeopalynology, University of Liège, Liège, Belgium
| | - Hervé Philippe
- Centre for Biodiversity Theory and Modelling, Moulis, France
| | - Annick Wilmotte
- InBioS–CIP, Centre for Protein Engineering, University of Liège, Liège, Belgium
| | - Denis Baurain
- InBioS–PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
- * E-mail:
| |
Collapse
|