1
|
Tomasella F, Pizzi C. MetaComBin: combining abundances and overlaps for binning metagenomics reads. FRONTIERS IN BIOINFORMATICS 2025; 5:1504728. [PMID: 40099113 PMCID: PMC11912761 DOI: 10.3389/fbinf.2025.1504728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 01/27/2025] [Indexed: 03/19/2025] Open
Abstract
Introduction Metagenomics is the discipline that studies heterogeneous microbial samples extracted directly from their natural environment, for example, from soil, water, or the human body. The detection and quantification of species that populate microbial communities have been the subject of many recent studies based on classification and clustering, motivated by being the first step in more complex pipelines (e.g., for functional analysis, de novo assembly, or comparison of metagenomes). Metagenomics has an impact on both environmental studies and precision medicine; thus, it is crucial to improve the quality of species identification through computational tools. Methods In this paper, we explore the idea of improving the overall quality of metagenomics binning at the read level by proposing a computational framework that sequentially combines two complementary read-binning approaches: one based on species abundance determination and another one relying on read overlap in order to cluster reads together. We called this approach MetaComBin (metagenomics combined binning). Results and Discussion The results of our experiments with the MetaComBin approach showed that the combination of two tools, based on different approaches, can improve the clustering quality in realistic conditions where the number of species is not known beforehand.
Collapse
Affiliation(s)
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
2
|
Mian E, Petrucci E, Pizzi C, Comin M. MISSH: Fast Hashing of Multiple Spaced Seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2330-2339. [PMID: 39320990 DOI: 10.1109/tcbb.2024.3467368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2024]
Abstract
Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing -mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional -mers. Spaced seeds offer enhanced sensitivity in many applications when compared to -mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S (Ounit and Lonardi, 2016). Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.
Collapse
|
3
|
Ali S, Chourasia P, Patterson M. From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets. Med Biol Eng Comput 2024; 62:2449-2483. [PMID: 38622438 DOI: 10.1007/s11517-024-03074-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 03/13/2024] [Indexed: 04/17/2024]
Abstract
Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.
Collapse
Affiliation(s)
- Sarwan Ali
- Georgia State University, Atlanta, GA, USA.
| | | | | |
Collapse
|
4
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
5
|
Zhang Z, Xiao J, Wang H, Yang C, Huang Y, Yue Z, Chen Y, Han L, Yin K, Lyu A, Fang X, Zhang L. Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity. Nat Commun 2024; 15:4631. [PMID: 38821971 PMCID: PMC11143213 DOI: 10.1038/s41467-024-49060-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 05/17/2024] [Indexed: 06/02/2024] Open
Abstract
Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
Collapse
Grants
- This research was partially supported by the Young Collaborative Research Grant (C2004-23Y, L.Z.), HMRF (11221026, L.Z.), the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220012, L.Z.), the Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419, L.Z.), HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007, L.Z.), HKBU IRCMS (No. IRCMS/19-20/D02, L.Z.).
- This research was partially supported by the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220014, KJ.Y.).
- The study were partially supported by the Science Technology and Innovation Committee of Shenzhen Municipality, China (SGDX20190919142801722, XD.F.),
Collapse
Affiliation(s)
- Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Hongbo Wang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | | | - Zhen Yue
- BGI Research, Sanya, 572025, China
| | - Yang Chen
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese, Guangzhou, China
| | - Lijuan Han
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Xiaodong Fang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Sanya, 572025, China
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China.
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China.
| |
Collapse
|
6
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
7
|
Cavattoni M, Comin M. ClassGraph: Improving Metagenomic Read Classification with Overlap Graphs. J Comput Biol 2023. [PMID: 37023405 DOI: 10.1089/cmb.2022.0208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
ABSTRACT Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most methods that are currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads), the performance is often poor. One reason is that the reads in a sample can be very different from the corresponding reference genomes; for example, viral genomes are usually highly mutated. To address this issue, in this article, we propose ClassGraph, a new taxonomic classification method that makes use of the read overlap graph and applies a label propagation algorithm to refine the results of existing tools. We evaluated its performance on simulated and real datasets with several taxonomic classification tools, and the results showed an improved sensitivity and F-measure, while maintaining high precision. ClassGraph is capable of improving the classification accuracy, especially in difficult cases such as virus and real datasets, where traditional tools can classify <40% of reads.
Collapse
Affiliation(s)
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
8
|
Mallawaarachchi V, Lin Y. Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs. J Comput Biol 2022; 29:1357-1376. [DOI: 10.1089/cmb.2022.0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vijini Mallawaarachchi
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| | - Yu Lin
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| |
Collapse
|
9
|
Zhou Y, Liu M, Yang J. Recovering metagenome-assembled genomes from shotgun metagenomic sequencing data: methods, applications, challenges, and opportunities. Microbiol Res 2022; 260:127023. [DOI: 10.1016/j.micres.2022.127023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 03/07/2022] [Accepted: 04/05/2022] [Indexed: 12/12/2022]
|
10
|
Storato D, Comin M. K2Mem: Discovering Discriminative K-mers From Sequencing Data for Metagenomic Reads Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:220-229. [PMID: 34606462 DOI: 10.1109/tcbb.2021.3117406] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g., viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com.
Collapse
|
11
|
Yoshimura Y, Hamada A, Augey Y, Akiyama M, Sakakibara Y. Genomic style: yet another deep-learning approach to characterize bacterial genome sequences. BIOINFORMATICS ADVANCES 2021; 1:vbab039. [PMID: 36700086 PMCID: PMC9710696 DOI: 10.1093/bioadv/vbab039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 11/15/2021] [Accepted: 11/26/2021] [Indexed: 01/28/2023]
Abstract
Motivation Biological sequence classification is the most fundamental task in bioinformatics analysis. For example, in metagenome analysis, binning is a typical type of DNA sequence classification. In order to classify sequences, it is necessary to define sequence features. The k-mer frequency, base composition and alignment-based metrics are commonly used. On the other hand, in the field of image recognition using machine learning, image classification is broadly divided into those based on shape and those based on style. A style matrix was introduced as a method of expressing the style of an image (e.g. color usage and texture). Results We propose a novel sequence feature, called genomic style, inspired by image classification approaches, for classifying and clustering DNA sequences. As with the style of images, the DNA sequence is considered to have a genomic style unique to the bacterial species, and the style matrix concept is applied to the DNA sequence. Our main aim is to introduce the genomics style as yet another basic sequence feature for metagenome binning problem in replace of the most commonly used sequence feature k-mer frequency. Performance evaluations showed that our method using a style matrix has the potential for accurate binning when compared with state-of-the-art binning tools based on k-mer frequency. Availability and implementation The source code for the implementation of this genomic style method, along with the dataset for the performance evaluation, is available from https://github.com/friendflower94/binning-style. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Yuka Yoshimura
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan
| | - Akifumi Hamada
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan
| | - Yohann Augey
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan
| | - Manato Akiyama
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan,To whom correspondence should be addressed.
| |
Collapse
|
12
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
13
|
Andreace F, Pizzi C, Comin M. MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics. J Comput Biol 2021; 28:1052-1062. [PMID: 34448593 DOI: 10.1089/cmb.2021.0270] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. The major difficulties of taxonomic analysis are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, and sequencing errors. Microbial communities can be studied with reads clustering, a process referred to as genome binning. In this study, we present MetaProb 2 an unsupervised genome binning method based on reads assembly and probabilistic k-mers statistics. The novelties of MetaProb 2 are the use of minimizers to efficiently assemble reads into unitigs and a community detection algorithm based on graph modularity to cluster unitigs and to detect representative unitigs. The effectiveness of MetaProb 2 is demonstrated in both simulated and real datasets in comparison with state-of-art binning tools such as MetaProb, AbundanceBin, Bimeta, and MetaCluster. On real datasets, it is the only one capable of producing promising results while being parsimonious with computational resources.
Collapse
Affiliation(s)
- Francesco Andreace
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
14
|
Mallawaarachchi VG, Wickramarachchi AS, Lin Y. Improving metagenomic binning results with overlapped bins using assembly graphs. Algorithms Mol Biol 2021; 16:3. [PMID: 33947431 PMCID: PMC8097841 DOI: 10.1186/s13015-021-00185-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/20/2021] [Indexed: 11/18/2022] Open
Abstract
Background Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). Results In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins. Conclusion GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets. GraphBin2 is freely available at https://github.com/Vini2/GraphBin2. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00185-6.
Collapse
|
15
|
Mallawaarachchi V, Wickramarachchi A, Lin Y. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics 2020; 36:3307-3313. [PMID: 32167528 DOI: 10.1093/bioinformatics/btaa180] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 02/18/2020] [Accepted: 03/10/2020] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. One key step in metagenomics analysis is to assemble reads into longer contigs which are then binned into groups of contigs that belong to different species present in the metagenomic sample. Binning of contigs plays an important role in metagenomics and most available binning algorithms bin contigs using genomic features such as oligonucleotide/k-mer composition and contig coverage. As metagenomic contigs are derived from the assembly process, they are output from the underlying assembly graph which contains valuable connectivity information between contigs that can be used for binning. RESULTS We propose GraphBin, a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools. We show that GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach. Moreover, we demonstrate improved experimental results from GraphBin in terms of identifying mis-binned contigs and binning of contigs discarded by existing binning tools. To the best of our knowledge, this is the first time that the information from the assembly graph has been used in a tool for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION The source code of GraphBin is available at https://github.com/Vini2/GraphBin. CONTACT vijini.mallawaarachchi@anu.edu.au or yu.lin@anu.edu.au. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Anuradha Wickramarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Yu Lin
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| |
Collapse
|
16
|
Guerrini V, Louza FA, Rosone G. Metagenomic analysis through the extended Burrows-Wheeler transform. BMC Bioinformatics 2020; 21:299. [PMID: 32938362 PMCID: PMC7493373 DOI: 10.1186/s12859-020-03628-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 06/22/2020] [Indexed: 11/10/2022] Open
Abstract
Background The development of Next Generation Sequencing (NGS) has had a major impact on the study of genetic sequences. Among problems that researchers in the field have to face, one of the most challenging is the taxonomic classification of metagenomic reads, i.e., identifying the microorganisms that are present in a sample collected directly from the environment. The analysis of environmental samples (metagenomes) are particularly important to figure out the microbial composition of different ecosystems and it is used in a wide variety of fields: for instance, metagenomic studies in agriculture can help understanding the interactions between plants and microbes, or in ecology, they can provide valuable insights into the functions of environmental communities. Results In this paper, we describe a new lightweight alignment-free and assembly-free framework for metagenomic classification that compares each unknown sequence in the sample to a collection of known genomes. We take advantage of the combinatorial properties of an extension of the Burrows-Wheeler transform, and we sequentially scan the required data structures, so that we can analyze unknown sequences of large collections using little internal memory. The tool LiME (Lightweight Metagenomics via eBWT) is available at https://github.com/veronicaguerrini/LiME. Conclusions In order to assess the reliability of our approach, we run several experiments on NGS data from two simulated metagenomes among those provided in benchmarking analysis and on a real metagenome from the Human Microbiome Project. The experiment results on the simulated data show that LiME is competitive with the widely used taxonomic classifiers. It achieves high levels of precision and specificity – e.g. 99.9% of the positive control reads are correctly assigned and the percentage of classified reads of the negative control is less than 0.01% – while keeping a high sensitivity. On the real metagenome, we show that LiME is able to deliver classification results comparable to that of MagicBlast. Overall, the experiments confirm the effectiveness of our method and its high accuracy even in negative control samples.
Collapse
Affiliation(s)
- Veronica Guerrini
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Felipe A Louza
- Faculty of Electrical Engineering, Federal University of Uberlândia, Uberlândia, Brazil
| | - Giovanna Rosone
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy.
| |
Collapse
|
17
|
Wickramarachchi A, Mallawaarachchi V, Rajan V, Lin Y. MetaBCC-LR: metagenomics binning by coverage and composition for long reads. Bioinformatics 2020; 36:i3-i11. [PMID: 32657364 PMCID: PMC7355282 DOI: 10.1093/bioinformatics/btaa441] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. RESULTS We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. AVAILABILITY AND IMPLEMENTATION The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anuradha Wickramarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| | - Vijini Mallawaarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| | - Vaibhav Rajan
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Yu Lin
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| |
Collapse
|
18
|
Comin M, Di Camillo B, Pizzi C, Vandin F. Comparison of microbiome samples: methods and computational challenges. Brief Bioinform 2020; 22:88-95. [PMID: 32577746 DOI: 10.1093/bib/bbaa121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2019] [Revised: 05/09/2020] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
The study of microbial communities crucially relies on the comparison of metagenomic next-generation sequencing data sets, for which several methods have been designed in recent years. Here, we review three key challenges in the comparison of such data sets: species identification and quantification, the efficient computation of distances between metagenomic samples and the identification of metagenomic features associated with a phenotype such as disease status. We present current solutions for such challenges, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
Collapse
|
19
|
Carr VR, Shkoporov A, Hill C, Mullany P, Moyes DL. Probing the Mobilome: Discoveries in the Dynamic Microbiome. Trends Microbiol 2020; 29:158-170. [PMID: 32448763 DOI: 10.1016/j.tim.2020.05.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 04/30/2020] [Accepted: 05/05/2020] [Indexed: 02/06/2023]
Abstract
There has been an explosion of metagenomic data representing human, animal, and environmental microbiomes. This provides an unprecedented opportunity for comparative and longitudinal studies of many functional aspects of the microbiome that go beyond taxonomic classification, such as profiling genetic determinants of antimicrobial resistance, interactions with the host, potentially clinically relevant functions, and the role of mobile genetic elements (MGEs). One of the most important but least studied of these aspects are the MGEs, collectively referred to as the 'mobilome'. Here we elaborate on the benefits and limitations of using different metagenomic protocols, discuss the relative merits of various sequencing technologies, and highlight relevant bioinformatics tools and pipelines to predict the presence of MGEs and their microbial hosts.
Collapse
Affiliation(s)
- Victoria R Carr
- Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral and Craniofacial Sciences, King's College London, London, UK; The Alan Turing Institute, British Library, London, UK.
| | - Andrey Shkoporov
- APC Microbiome Ireland, School of Microbiology, University College Cork, Cork, Ireland
| | - Colin Hill
- APC Microbiome Ireland, School of Microbiology, University College Cork, Cork, Ireland
| | - Peter Mullany
- Eastman Dental Institute, University College London, London, UK
| | - David L Moyes
- Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral and Craniofacial Sciences, King's College London, London, UK.
| |
Collapse
|
20
|
Li K, Lu Y, Deng L, Wang L, Shi L, Wang Z. Deconvolute individual genomes from metagenome sequences through short read clustering. PeerJ 2020; 8:e8966. [PMID: 32296615 PMCID: PMC7150542 DOI: 10.7717/peerj.8966] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 03/24/2020] [Indexed: 12/17/2022] Open
Abstract
Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.
Collapse
Affiliation(s)
- Kexue Li
- School of Mechanics Engineering and Automation, Shanghai University, Shanghai, China.,Shanghai Key Laboratory of Power Station Automation Technology, Shanghai, China
| | - Yakang Lu
- School of Mechanics Engineering and Automation, Shanghai University, Shanghai, China.,Shanghai Key Laboratory of Power Station Automation Technology, Shanghai, China
| | - Li Deng
- School of Mechanics Engineering and Automation, Shanghai University, Shanghai, China.,Shanghai Key Laboratory of Power Station Automation Technology, Shanghai, China.,Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA
| | - Lili Wang
- School of Mechanics Engineering and Automation, Shanghai University, Shanghai, China.,Shanghai Key Laboratory of Power Station Automation Technology, Shanghai, China
| | - Lizhen Shi
- Department of Computer Science, Florida State University, Tallahassee, FL, USA
| | - Zhong Wang
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA.,Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,School of Natural Sciences, University of California at Merced, Merced, CA, USA
| |
Collapse
|
21
|
Dovrolis N, Kolios G, Spyrou GM, Maroulakou I. Computational profiling of the gut-brain axis: microflora dysbiosis insights to neurological disorders. Brief Bioinform 2020; 20:825-841. [PMID: 29186317 DOI: 10.1093/bib/bbx154] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Revised: 10/17/2017] [Indexed: 12/14/2022] Open
Abstract
Almost 2500 years after Hippocrates' observations on health and its direct association to the gastrointestinal tract, a paradigm shift has recently occurred, making the gut and its symbionts (bacteria, fungi, archaea and viruses) a point of convergence for studies. It is nowadays well established that the gut microflora's compositional diversity regulates via its genes (the microbiome) the host's health and provides preliminary insights into disease progression and regulation. The microbiome's involvement is evident in immunological and physiological studies that link changes in its biodiversity to its contributions to the host's phenotype but also in neurological investigations, substantiating the aptly named gut-brain axis. The definitive mechanisms of this last bidirectional interaction will be our main focus because it presents researchers with a new conundrum. In this review, we prospect current literature for computational analysis methodologies that accommodate the need for better understanding of the microbiome-gut-brain interactions and neurological disorder onset and progression, through cross-disciplinary systems biology applications. We will present bioinformatics tools used in exploring these synergies that help build and interpret microbial 16S ribosomal RNA data sets, produced by shotgun and high-throughput sequencing of healthy and neurological disorder samples stored in biological databases. These approaches provide alternative means for researchers to form hypotheses to their inquests faster, cheaper and swith precision. The goal of these studies relies on the integration of combined metagenomics and metabolomics assessments. An accurate characterization of the microbiome and its functionality can support new diagnostic, prognostic and therapeutic strategies for neurological disorders, customized for each individual host.
Collapse
|
22
|
Kyrgyzov O, Prost V, Gazut S, Farcy B, Brüls T. Binning unassembled short reads based on k-mer abundance covariance using sparse coding. Gigascience 2020; 9:giaa028. [PMID: 32219339 PMCID: PMC7099633 DOI: 10.1093/gigascience/giaa028] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Revised: 03/06/2020] [Accepted: 03/10/2020] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Sequence-binning techniques enable the recovery of an increasing number of genomes from complex microbial metagenomes and typically require prior metagenome assembly, incurring the computational cost and drawbacks of the latter, e.g., biases against low-abundance genomes and inability to conveniently assemble multi-terabyte datasets. RESULTS We present here a scalable pre-assembly binning scheme (i.e., operating on unassembled short reads) enabling latent genome recovery by leveraging sparse dictionary learning and elastic-net regularization, and its use to recover hundreds of metagenome-assembled genomes, including very low-abundance genomes, from a joint analysis of microbiomes from the LifeLines DEEP population cohort (n = 1,135, >1010 reads). CONCLUSION We showed that sparse coding techniques can be leveraged to carry out read-level binning at large scale and that, despite lower genome reconstruction yields compared to assembly-based approaches, bin-first strategies can complement the more widely used assembly-first protocols by targeting distinct genome segregation profiles. Read enrichment levels across 6 orders of magnitude in relative abundance were observed, indicating that the method has the power to recover genomes consistently segregating at low levels.
Collapse
Affiliation(s)
- Olexiy Kyrgyzov
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université Paris-Saclay, 2 rue Gaston Crémieux, 91057 Evry, France
| | - Vincent Prost
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université Paris-Saclay, 2 rue Gaston Crémieux, 91057 Evry, France
- Laboratoire Sciences des Données et de la Décision, LIST, CEA, Bâtiment 565, 91191 Gif-sur-Yvette, France
| | - Stéphane Gazut
- Laboratoire Sciences des Données et de la Décision, LIST, CEA, Bâtiment 565, 91191 Gif-sur-Yvette, France
| | - Bruno Farcy
- Atos Bull Technologies, 68 avenue Jean Jaurès, 78340 Les Clayes-sous-Bois, France
| | - Thomas Brüls
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université Paris-Saclay, 2 rue Gaston Crémieux, 91057 Evry, France
| |
Collapse
|
23
|
Petrucci E, Noé L, Pizzi C, Comin M. Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing. J Comput Biol 2020; 27:223-233. [PMID: 31800307 DOI: 10.1089/cmb.2019.0298] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.
Collapse
Affiliation(s)
- Enrico Petrucci
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Laurent Noé
- CRIStAL UMR9189, Universit de Lille, Lille, France
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
24
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-Mers and Applications to Metagenomics. J Comput Biol 2019; 27:534-549. [PMID: 31891535 DOI: 10.1089/cmb.2019.0314] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik-Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.
Collapse
Affiliation(s)
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
25
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-mers and Applications to Metagenomics. LECTURE NOTES IN COMPUTER SCIENCE 2019. [DOI: 10.1007/978-3-030-17083-7_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
26
|
Benavides A, Isaza JP, Niño-García JP, Alzate JF, Cabarcas F. CLAME: a new alignment-based binning algorithm allows the genomic description of a novel Xanthomonadaceae from the Colombian Andes. BMC Genomics 2018; 19:858. [PMID: 30537931 PMCID: PMC6288851 DOI: 10.1186/s12864-018-5191-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Hot spring bacteria have unique biological adaptations to survive the extreme conditions of these environments; these bacteria produce thermostable enzymes that can be used in biotechnological and industrial applications. However, sequencing these bacteria is complex, since it is not possible to culture them. As an alternative, genome shotgun sequencing of whole microbial communities can be used. The problem is that the classification of sequences within a metagenomic dataset is very challenging particularly when they include unknown microorganisms since they lack genomic reference. We failed to recover a bacterium genome from a hot spring metagenome using the available software tools, so we develop a new tool that allowed us to recover most of this genome. Results We present a proteobacteria draft genome reconstructed from a Colombian’s Andes hot spring metagenome. The genome seems to be from a new lineage within the family Rhodanobacteraceae of the class Gammaproteobacteria, closely related to the genus Dokdonella. We were able to generate this genome thanks to CLAME. CLAME, from Spanish “CLAsificador MEtagenomico”, is a tool to group reads in bins. We show that most reads from each bin belong to a single chromosome. CLAME is very effective recovering most of the reads belonging to the predominant species within a metagenome. Conclusions We developed a tool that can be used to extract genomes (or parts of them) from a complex metagenome. Electronic supplementary material The online version of this article (10.1186/s12864-018-5191-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andres Benavides
- Grupo SISTEMIC, Ingeniería Electrónica, Facultad de Ingeniería, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia.
| | - Juan Pablo Isaza
- Centro Nacional de Secuenciación Genómica-CNSG, Sede de Investigación Universitaria-SIU, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia.,Grupo de Parasitología, Departamento de Microbiología y Parasitología, Facultad de Medicina, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia
| | - Juan Pablo Niño-García
- Escuela de Microbiología, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia
| | - Juan Fernando Alzate
- Centro Nacional de Secuenciación Genómica-CNSG, Sede de Investigación Universitaria-SIU, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia.,Grupo de Parasitología, Departamento de Microbiología y Parasitología, Facultad de Medicina, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia
| | - Felipe Cabarcas
- Grupo SISTEMIC, Ingeniería Electrónica, Facultad de Ingeniería, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia.,Centro Nacional de Secuenciación Genómica-CNSG, Sede de Investigación Universitaria-SIU, Universidad de Antioquia UdeA, Calle 70 No, 52-21, Medellín, Colombia
| |
Collapse
|
27
|
Abstract
Background Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation. Results In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run. Conclusions We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds. Electronic supplementary material The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samuele Girotto
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| |
Collapse
|
28
|
Ariza-Jimenez L, Quintero OL, Pinel N. Unsupervised fuzzy binning of metagenomic sequence fragments on three-dimensional Barnes-Hut t-Stochastic Neighbor Embeddings. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2018; 2018:1315-1318. [PMID: 30440633 DOI: 10.1109/embc.2018.8512529] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Shotgun metagenomic studies attempt to reconstruct population genome sequences from complex microbial communities. In some traditional genome demarcation approaches, high-dimensional sequence data are embedded into two-dimensional spaces and subsequently binned into candidate genomic populations. One such approach uses a combination of the Barnes-Hut approximation and the $t -$Stochastic Neighbor Embedding (BH-SNE) algorithm for dimensionality reduction of DNA sequence data pentamer profiles; and demarcation of groups based on Gaussian mixture models within humanimposed boundaries. We found that genome demarcation from three-dimensional BH-SNE embeddings consistently results in more accurate binnings than 2-D embeddings. We further addressed the lack of a priori population number information by developing an unsupervised binning approach based on the Subtractive and Fuzzy c-means (FCM) clustering algorithms combined with internal clustering validity indices. Lastly, we addressed the subject of shared membership of individual data objects in a mixed community by assigning a degree of membership to individual objects using the FCM algorithm, and discriminated between confidently binned and uncertain sequence data objects from the community for subsequent biological interpretation. The binning of metagenome sequence fragments according to thresholds in the degree of membership opens the door for the identification of horizontally transferred elements and other genomic regions of uncertain assignment in which biologically meaningful information resides. The reported approach improves the unsupervised genome demarcation of populations within complex communities, increases the confidence in the coherence of the binned elements, and enables the identification of evolutionary processes ignored in hard-binning approaches in shotgun metagenomic studies.
Collapse
|
29
|
Qu W, Lin D, Zhang Z, Di W, Gao B, Zeng R. Metagenomics Investigation of Agarlytic Genes and Genomes in Mangrove Sediments in China: A Potential Repertory for Carbohydrate-Active Enzymes. Front Microbiol 2018; 9:1864. [PMID: 30177916 PMCID: PMC6109693 DOI: 10.3389/fmicb.2018.01864] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 07/25/2018] [Indexed: 12/31/2022] Open
Abstract
Monosaccharides and oligosaccharides produced by agarose degradation exhibit potential in the fields of bioenergy, medicine, and cosmetics. Mangrove sediments (MGSs) provide a special environment to enrich enzymes for agarose degradation. However, representative investigations of the agarlytic genes in MGSs have been rarely reported. In this study, agarlytic genes in MGSs were researched in detail from the aspects of diversity, abundance, activity, and location through deep metagenomics sequencing. Functional genes in MGSs were usually incomplete but were shown as results, which could cause virtually high number of results in previous studies because multiple fragmented sequences could originate from the same genes. In our work, only complete and nonredundant (CNR) genes were analyzed to avoid virtually high amount of the results. The number of CNR agarlytic genes in our datasets was significantly higher than that in the datasets of previous studies. Twenty-one recombinant agarases with agarose-degrading activity were detected using heterologous expression based on numerous complete open-reading frames, which are rarely obtained in metagenomics sequencing of samples with complex microbial communities, such as MGSs. Aga2, which had the highest crude enzyme activity among the 21 recombinant agarases, was further purified and subjected to enzymatic characterization. With its high agarose-degrading activity, resistance to temperature changes and chemical agents, Aga2 could be a suitable option for industrial production. The agarase ratio with signal peptides to that without signal peptides in our MGS datasets was lower than that of other reported agarases. Six draft genomes, namely, Clusters 1-6, were recovered from the datasets. The taxonomic annotation of these genomes revealed that Clusters 1, 3, 5, and 6 were annotated as Desulfuromonas sp., Treponema sp., Ignavibacteriales spp., and Polyangiaceae spp., respectively. Meanwhile, Clusters 2 and 4 were potential new species. All these genomes were first reported and found to have abilities of degrading various important polysaccharides. The metabolic pathway of agarose in Cluster 4 was also speculated. Our results showed the capacity and activity of agarases in the MGS microbiome, and MGSs exert potential as a repertory for mining not only agarlytic genes but also almost all genes of the carbohydrate-active enzyme family.
Collapse
Affiliation(s)
- Wu Qu
- School of Life Sciences, Xiamen University, Xiamen, China
| | - Dan Lin
- Novogene Bioinformatics Technology Co. Ltd., Tianjin, China
| | - Zhouhao Zhang
- Novogene Bioinformatics Technology Co. Ltd., Tianjin, China
| | - Wenjie Di
- Key Laboratory of Marine Genetic Resources, Third Institute of Oceanography, State Oceanic Administration, Xiamen, China
| | - Boliang Gao
- School of Life Sciences, Xiamen University, Xiamen, China
| | - Runying Zeng
- Key Laboratory of Marine Genetic Resources, Third Institute of Oceanography, State Oceanic Administration, Xiamen, China.,Key Laboratory of Marine Genetic Resources, Xiamen, China
| |
Collapse
|
30
|
Girotto S, Comin M, Pizzi C. FSH: fast spaced seed hashing exploiting adjacent hashes. Algorithms Mol Biol 2018; 13:8. [PMID: 29588651 PMCID: PMC5863468 DOI: 10.1186/s13015-018-0125-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 03/12/2018] [Indexed: 01/27/2023] Open
Abstract
Background Patterns with wildcards in specified positions, namely spaced seeds, are increasingly used instead of k-mers in many bioinformatics applications that require indexing, querying and rapid similarity search, as they can provide better sensitivity. Many of these applications require to compute the hashing of each position in the input sequences with respect to the given spaced seed, or to multiple spaced seeds. While the hashing of k-mers can be rapidly computed by exploiting the large overlap between consecutive k-mers, spaced seeds hashing is usually computed from scratch for each position in the input sequence, thus resulting in slower processing. Results The method proposed in this paper, fast spaced-seed hashing (FSH), exploits the similarity of the hash values of spaced seeds computed at adjacent positions in the input sequence. In our experiments we compute the hash for each positions of metagenomics reads from several datasets, with respect to different spaced seeds. We also propose a generalized version of the algorithm for the simultaneous computation of multiple spaced seeds hashing. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× to 5.3\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}×, depending on the structure of the spaced seed. Conclusions Spaced seed hashing is a routine task for several bioinformatics application. FSH allows to perform this task efficiently and raise the question of whether other hashing can be exploited to further improve the speed up. This has the potential of major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. Availability The software FSH is freely available for academic use at: https://bitbucket.org/samu661/fsh/overview.
Collapse
|
31
|
Qian J, Marchiori D, Comin M. Fast and Sensitive Classification of Short Metagenomic Reads with SKraken. BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES 2018. [DOI: 10.1007/978-3-319-94806-5_12] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
32
|
Girotto S, Comin M, Pizzi C. Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics 2017; 18:917. [PMID: 29244002 PMCID: PMC5731601 DOI: 10.1186/s12864-017-4273-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. Results In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. Conclusions Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4273-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samuele Girotto
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, 35131, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, 35131, Italy.
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, 35131, Italy.
| |
Collapse
|
33
|
Kanj S, Brüls T, Gazut S. Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework. J Comput Biol 2017; 25:236-250. [PMID: 28953425 DOI: 10.1089/cmb.2017.0113] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a new algorithm to cluster high-dimensional sequence data and its application to the field of metagenomics, which aims at reconstructing individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, for example, using the shared nearest neighbors (SNN) rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new approach based on combining the SNN rule with the concept of locality sensitive hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and employing the SNN rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.
Collapse
Affiliation(s)
- Sawsan Kanj
- 1 CEA , Genoscope, Evry, France .,2 CEA, LIST, Laboratoire d'Analyse de Données et Intelligence des Systèmes, Gif-sur-Yvette, France .,3 Université d'Evry , Evry, France .,4 CNRS-UMR 8030 , Evry, France .,5 Université Paris-Saclay , Evry, France
| | - Thomas Brüls
- 1 CEA , Genoscope, Evry, France .,3 Université d'Evry , Evry, France .,4 CNRS-UMR 8030 , Evry, France .,5 Université Paris-Saclay , Evry, France
| | - Stéphane Gazut
- 2 CEA, LIST, Laboratoire d'Analyse de Données et Intelligence des Systèmes, Gif-sur-Yvette, France
| |
Collapse
|