1
|
Ruperao P, Rangan P, Shah T, Sharma V, Rathore A, Mayes S, Pandey MK. Developing pangenomes for large and complex plant genomes and their representation formats. J Adv Res 2025:S2090-1232(25)00071-2. [PMID: 39894347 DOI: 10.1016/j.jare.2025.01.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 01/27/2025] [Accepted: 01/27/2025] [Indexed: 02/04/2025] Open
Abstract
BACKGROUND The development of pangenomes has revolutionized genomic studies by capturing the complete genetic diversity within a species. Pangenome assembly integrates data from multiple individuals to construct a comprehensive genomic landscape, revealing both core and accessory genomic elements. This approach enables the identification of novel genes, structural variations, and gene presence-absence variations, providing insights into species evolution, adaptation, and trait variation. Representing pangenomes requires innovative visualization formats that effectively convey the complex genomic structures and variations. AIM This review delves into contemporary methodologies and recent advancements in constructing pangenomes, particularly in plant genomes. It examines the structure of pangenome representation, including format comparison, conversion, visualization techniques, and their implications for enhancing crop improvement strategies. KEY SCIENTIFIC CONCEPTS OF REVIEW Earlier comparative studies have illuminated novel gene sequences, copy number variations, and presence-absence variations across diverse crop species. The concept of a pan-genome, which captures multiple genetic variations from a broad spectrum of genotypes, offers a holistic perspective of a species' genetic makeup. However, constructing a pan-genome for plants with larger genomes poses challenges, including managing vast genome sequence data and comprehending the genetic variations within the germplasm. To address these challenges, researchers have explored cost-effective alternatives to encapsulate species diversity in a single assembly known as a pangenome. This involves reducing the volume of genome sequences while focusing on genetic variations. With the growing prominence of the pan-genome concept in plant genomics, several software tools have emerged to facilitate pangenome construction. This review sheds light on developing and utilizing software tools tailored for constructing pan-genomes in plants. It also discusses representation formats suitable for downstream analyses, offering valuable insights into the genetic landscape and evolutionary dynamics of plant species. In summary, this review underscores the significance of pan-genome construction and representation formats in resolving the genetic architecture of plants, particularly those with complex genomes. It provides a comprehensive overview of recent advancements, aiding in exploring and understanding plant genetic diversity.
Collapse
Affiliation(s)
- Pradeep Ruperao
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India.
| | - Parimalan Rangan
- ICAR-National Bureau of Plant Genetic Resources (NBPGR), New Delhi, India; Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St Lucia, Australia
| | - Trushar Shah
- International Institute of Tropical Agriculture (IITA), Nairobi, Kenya
| | - Vinay Sharma
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Abhishek Rathore
- International Maize and Wheat Improvement Center (CIMMYT), Nairobi, Kenya
| | - Sean Mayes
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Manish K Pandey
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India.
| |
Collapse
|
2
|
Fulke AB, Eranezhath S, Raut S, Jadhav HS. Recent toolset of metagenomics for taxonomical and functional annotation of marine associated viruses: A review. REGIONAL STUDIES IN MARINE SCIENCE 2024; 77:103728. [DOI: 10.1016/j.rsma.2024.103728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2025]
|
3
|
Jiang L, Quail MA, Fraser-Govil J, Wang H, Shi X, Oliver K, Mellado Gomez E, Yang F, Ning Z. The Bioinformatic Applications of Hi-C and Linked Reads. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae048. [PMID: 38905513 PMCID: PMC11580686 DOI: 10.1093/gpbjnl/qzae048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 05/07/2024] [Accepted: 06/19/2024] [Indexed: 06/23/2024]
Abstract
Long-range sequencing grants insight into additional genetic information beyond what can be accessed by both short reads and modern long-read technology. Several new sequencing technologies, such as "Hi-C" and "Linked Reads", produce long-range datasets for high-throughput and high-resolution genome analyses, which are rapidly advancing the field of genome assembly, genome scaffolding, and more comprehensive variant identification. In this review, we focused on five major long-range sequencing technologies: high-throughput chromosome conformation capture (Hi-C), 10X Genomics Linked Reads, haplotagging, transposase enzyme linked long-read sequencing (TELL-seq), and single- tube long fragment read (stLFR). We detailed the mechanisms and data products of the five platforms and their important applications, evaluated the quality of sequencing data from different platforms, and discussed the currently available bioinformatics tools. This work will benefit the selection of appropriate long-range technology for specific biological studies.
Collapse
Affiliation(s)
- Libo Jiang
- School of Life Sciences and Medicine, Shandong University of Technology, Zibo 255049, China
| | - Michael A Quail
- The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Jack Fraser-Govil
- The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Haipeng Wang
- School of Life Sciences and Medicine, Shandong University of Technology, Zibo 255049, China
| | - Xuequn Shi
- College of Food Science and Technology, Hainan University, Haikou 570228, China
| | - Karen Oliver
- The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Esther Mellado Gomez
- The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Fengtang Yang
- School of Life Sciences and Medicine, Shandong University of Technology, Zibo 255049, China
| | - Zemin Ning
- The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
4
|
Tolstoganov I, Chen Z, Pevzner P, Korobeynikov A. SpLitteR: diploid genome assembly using TELL-Seq linked-reads and assembly graphs. PeerJ 2024; 12:e18050. [PMID: 39351368 PMCID: PMC11441382 DOI: 10.7717/peerj.18050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 08/15/2024] [Indexed: 10/04/2024] Open
Abstract
Background Recent advances in long-read sequencing technologies enabled accurate and contiguous de novo assemblies of large genomes and metagenomes. However, even long and accurate high-fidelity (HiFi) reads do not resolve repeats that are longer than the read lengths. This limitation negatively affects the contiguity of diploid genome assemblies since two haplomes share many long identical regions. To generate the telomere-to-telomere assemblies of diploid genomes, biologists now construct their HiFi-based phased assemblies and use additional experimental technologies to transform them into more contiguous diploid assemblies. The barcoded linked-reads, generated using an inexpensive TELL-Seq technology, provide an attractive way to bridge unresolved repeats in phased assemblies of diploid genomes. Results We developed the SpLitteR tool for diploid genome assembly using linked-reads and assembly graphs and benchmarked it against state-of-the-art linked-read scaffolders ARKS and SLR-superscaffolder using human HG002 genome and sheep gut microbiome datasets. The benchmark showed that SpLitteR scaffolding results in 1.5-fold increase in NGA50 compared to the baseline LJA assembly and other scaffolders while introducing no additional misassemblies on the human dataset. Conclusion We developed the SpLitteR tool for assembly graph phasing and scaffolding using barcoded linked-reads. We benchmarked SpLitteR on assembly graphs produced by various long-read assemblers and have demonstrated that TELL-Seq reads facilitate phasing and scaffolding in these graphs. This benchmarking demonstrates that SpLitteR improves upon the state-of-the-art linked-read scaffolders in the accuracy and contiguity metrics. SpLitteR is implemented in C++ as a part of the freely available SPAdes package and is available at https://github.com/ablab/spades/releases/tag/splitter-preprint.
Collapse
Affiliation(s)
- Ivan Tolstoganov
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Zhoutao Chen
- Universal Sequencing Technology Corporation, Carlsbad, California, United States
| | - Pavel Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, San Diego, California, United States
| | - Anton Korobeynikov
- Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia
- Institute of Applied Computer Science, ITMO University, Saint Petersburg, Russia
| |
Collapse
|
5
|
Zhang Z, Xiao J, Wang H, Yang C, Huang Y, Yue Z, Chen Y, Han L, Yin K, Lyu A, Fang X, Zhang L. Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity. Nat Commun 2024; 15:4631. [PMID: 38821971 PMCID: PMC11143213 DOI: 10.1038/s41467-024-49060-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 05/17/2024] [Indexed: 06/02/2024] Open
Abstract
Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
Collapse
Grants
- This research was partially supported by the Young Collaborative Research Grant (C2004-23Y, L.Z.), HMRF (11221026, L.Z.), the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220012, L.Z.), the Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419, L.Z.), HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007, L.Z.), HKBU IRCMS (No. IRCMS/19-20/D02, L.Z.).
- This research was partially supported by the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220014, KJ.Y.).
- The study were partially supported by the Science Technology and Innovation Committee of Shenzhen Municipality, China (SGDX20190919142801722, XD.F.),
Collapse
Affiliation(s)
- Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Hongbo Wang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | | | - Zhen Yue
- BGI Research, Sanya, 572025, China
| | - Yang Chen
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese, Guangzhou, China
| | - Lijuan Han
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Xiaodong Fang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Sanya, 572025, China
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China.
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China.
| |
Collapse
|
6
|
Meleshko D, Prjbelski AD, Raiko M, Tomescu AI, Tilgner H, Hajirasouliha I. cloudrnaSPAdes: isoform assembly using bulk barcoded RNA sequencing data. Bioinformatics 2024; 40:btad781. [PMID: 38262343 PMCID: PMC10868327 DOI: 10.1093/bioinformatics/btad781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 12/09/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open
Abstract
MOTIVATION Recent advancements in long-read RNA sequencing have enabled the examination of full-length isoforms, previously uncaptured by short-read sequencing methods. An alternative powerful method for studying isoforms is through the use of barcoded short-read RNA reads, for which a barcode indicates whether two short-reads arise from the same molecule or not. Such techniques included the 10x Genomics linked-read based SParse Isoform Sequencing (SPIso-seq), as well as Loop-Seq, or Tell-Seq. Some applications, such as novel-isoform discovery, require very high coverage. Obtaining high coverage using long reads can be difficult, making barcoded RNA-seq data a valuable alternative for this task. However, most annotation pipelines are not able to work with a set of short reads instead of a single transcript, also not able to work with coverage gaps within a molecule if any. In order to overcome this challenge, we present an RNA-seq assembler that allows the determination of the expressed isoform per barcode. RESULTS In this article, we present cloudrnaSPAdes, a tool for assembling full-length isoforms from barcoded RNA-seq linked-read data in a reference-free fashion. Evaluating it on simulated and real human data, we found that cloudrnaSPAdes accurately assembles isoforms, even for genes with high isoform diversity. AVAILABILITY AND IMPLEMENTATION cloudrnaSPAdes is a feature release of a SPAdes assembler and version used for this article is available at https://github.com/1dayac/cloudrnaSPAdes-release.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY 10021, United States
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, United States
| | - Andrey D Prjbelski
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Mikhail Raiko
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St Petersburg State University, St Petersburg 199004, Russia
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Hagen Tilgner
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021, United States
- Center for Neurogenetics, Weill Cornell Medicine, New York, NY 10021, United States
| | - Iman Hajirasouliha
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, United States
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, NY 10021, United States
| |
Collapse
|
7
|
Yang C, Zhang Z, Huang Y, Xie X, Liao H, Xiao J, Veldsman WP, Yin K, Fang X, Zhang L. LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome. Gigascience 2024; 13:giae028. [PMID: 38869148 PMCID: PMC11170215 DOI: 10.1093/gigascience/giae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 03/15/2024] [Accepted: 05/09/2024] [Indexed: 06/14/2024] Open
Abstract
BACKGROUND Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform. FINDINGS To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots. CONCLUSIONS LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Yufen Huang
- BGI Research, Shenzhen 518083, China
- BGI Genomics, Shenzhen 518083, China
| | | | - Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR 999077, Hong Kong
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Werner Pieter Veldsman
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| | - Xiaodong Fang
- BGI Genomics, Shenzhen 518083, China
- BGI Research, Sanya 572025, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
- Institute for Research and Continuing Education, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong
| |
Collapse
|
8
|
Naithani S, Deng CH, Sahu SK, Jaiswal P. Exploring Pan-Genomes: An Overview of Resources and Tools for Unraveling Structure, Function, and Evolution of Crop Genes and Genomes. Biomolecules 2023; 13:1403. [PMID: 37759803 PMCID: PMC10527062 DOI: 10.3390/biom13091403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/29/2023] [Accepted: 09/12/2023] [Indexed: 09/29/2023] Open
Abstract
The availability of multiple sequenced genomes from a single species made it possible to explore intra- and inter-specific genomic comparisons at higher resolution and build clade-specific pan-genomes of several crops. The pan-genomes of crops constructed from various cultivars, accessions, landraces, and wild ancestral species represent a compendium of genes and structural variations and allow researchers to search for the novel genes and alleles that were inadvertently lost in domesticated crops during the historical process of crop domestication or in the process of extensive plant breeding. Fortunately, many valuable genes and alleles associated with desirable traits like disease resistance, abiotic stress tolerance, plant architecture, and nutrition qualities exist in landraces, ancestral species, and crop wild relatives. The novel genes from the wild ancestors and landraces can be introduced back to high-yielding varieties of modern crops by implementing classical plant breeding, genomic selection, and transgenic/gene editing approaches. Thus, pan-genomic represents a great leap in plant research and offers new avenues for targeted breeding to mitigate the impact of global climate change. Here, we summarize the tools used for pan-genome assembly and annotations, web-portals hosting plant pan-genomes, etc. Furthermore, we highlight a few discoveries made in crops using the pan-genomic approach and future potential of this emerging field of study.
Collapse
Affiliation(s)
- Sushma Naithani
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA;
| | - Cecilia H. Deng
- Molecular & Digital Breeing Group, New Cultivar Innovation, The New Zealand Institute for Plant and Food Research Limited, Private Bag 92169, Auckland 1142, New Zealand;
| | - Sunil Kumar Sahu
- State Key Laboratory of Agricultural Genomics, Key Laboratory of Genomics, Ministry of Agriculture, BGI Research, Shenzhen 518083, China;
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA;
| |
Collapse
|
9
|
Mak L, Meleshko D, Danko DC, Barakzai WN, Maharjan S, Belchikov N, Hajirasouliha I. Ariadne: synthetic long read deconvolution using assembly graphs. Genome Biol 2023; 24:197. [PMID: 37641111 PMCID: PMC10463629 DOI: 10.1186/s13059-023-03033-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Accepted: 08/07/2023] [Indexed: 08/31/2023] Open
Abstract
Synthetic long read sequencing techniques such as UST's TELL-Seq and Loop Genomics' LoopSeq combine 3[Formula: see text] barcoding with standard short-read sequencing to expand the range of linkage resolution from hundreds to tens of thousands of base-pairs. However, the lack of a 1:1 correspondence between a long fragment and a 3[Formula: see text] unique molecular identifier confounds the assignment of linkage between short reads. We introduce Ariadne, a novel assembly graph-based synthetic long read deconvolution algorithm, that can be used to extract single-species read-clouds from synthetic long read datasets to improve the taxonomic classification and de novo assembly of complex populations, such as metagenomes.
Collapse
Affiliation(s)
- Lauren Mak
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, USA.
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, USA.
| | - Dmitry Meleshko
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, USA
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, USA
| | - David C Danko
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, USA
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, USA
| | - Waris N Barakzai
- Department of Computer Science, New York University, New York, USA
| | - Salil Maharjan
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, USA
| | - Natan Belchikov
- Physiology, Biophysics & Systems Biology Program, Weill Cornell Medicine of Cornell University, New York, USA
| | - Iman Hajirasouliha
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, USA.
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine of Cornell University, New York, USA.
| |
Collapse
|
10
|
Meleshko D, Prjbelski AD, Raiko M, Tomescu AI, Tilgner H, Hajirasouliha I. cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.25.550587. [PMID: 37546844 PMCID: PMC10402000 DOI: 10.1101/2023.07.25.550587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Motivation Recent advancements in long-read RNA sequencing have enabled the examination of full-length isoforms, previously uncaptured by short-read sequencing methods. An alternative powerful method for studying isoforms is through the use of barcoded short-read RNA reads, for which a barcode indicates whether two short-reads arise from the same molecule or not. Such techniques included the 10x Genomics linked-read based SParse Isoform Sequencing (SPIso-seq), as well as Loop-Seq, or Tell-Seq. Some applications, such as novel-isoform discovery, require very high coverage. Obtaining high coverage using long reads can be difficult, making barcoded RNA-seq data a valuable alternative for this task. However, most annotation pipelines are not able to work with a set of short reads instead of a single transcript, also not able to work with coverage gaps within a molecule if any. In order to overcome this challenge, we present an RNA-seq assembler allowing the determination of the expressed isoform per barcode. Results In this paper, we present cloudrnaSPAdes, a tool for assembling full-length isoforms from barcoded RNA-seq linked-read data in a reference-free fashion. Evaluating it on simulated and real human data, we found that cloudrnaSPAdes accurately assembles isoforms, even for genes with high isoform diversity. Availability cloudrnaSPAdes is a feature release of a SPAdes assembler and available at https://cab.spbu.ru/software/cloudrnaspades/.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, NY, 10021, USA
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine, NY, 10021, USA
| | | | - Mikhail Raiko
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199004
| | | | - Hagen Tilgner
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, 10021, USA
- Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA
| | - Iman Hajirasouliha
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine, NY, 10021, USA
- Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, NY, 10021, USA
| |
Collapse
|
11
|
Baltoumas FA, Karatzas E, Paez-Espino D, Venetsianou NK, Aplakidou E, Oulas A, Finn RD, Ovchinnikov S, Pafilis E, Kyrpides NC, Pavlopoulos GA. Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. FRONTIERS IN BIOINFORMATICS 2023; 3:1157956. [PMID: 36959975 PMCID: PMC10029925 DOI: 10.3389/fbinf.2023.1157956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
Collapse
Affiliation(s)
- Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - David Paez-Espino
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Nefeli K. Venetsianou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Anastasis Oulas
- The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus
| | - Robert D. Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, United States
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece
| | - Nikos C. Kyrpides
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
- Center of New Biotechnologies and Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece
- Hellenic Army Academy, Vari, Greece
| |
Collapse
|
12
|
Qi Y, Gu S, Zhang Y, Guo L, Xu M, Cheng X, Wang O, Sun Y, Chen J, Fang X, Liu X, Deng L, Fan G. MetaTrass: A high-quality metagenome assembler of the human gut microbiome by cobarcoding sequencing reads. IMETA 2022; 1:e46. [PMID: 38867906 PMCID: PMC10989976 DOI: 10.1002/imt2.46] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 06/28/2022] [Accepted: 07/20/2022] [Indexed: 06/14/2024]
Abstract
Metagenomic evidence of great genetic diversity within the nonconserved regions of the human gut microbial genomes appeals for new methods to elucidate the species-level variability at high resolution. However, current approaches cannot satisfy this methodologically challenge. In this study, we proposed an efficient binning-first-and-assembly-later strategy, named MetaTrass, to recover high-quality species-resolved genomes based on public reference genomes and the single-tube long fragment read (stLFR) technology, which enables cobarcoding. MetaTrass can generate genomes with longer contiguity, higher completeness, and lower contamination than those produced by conventional assembly-first-and-binning-later strategies. From a simulation study on a mock microbial community, MetaTrass showed the potential to improve the contiguity of assembly from kb to Mb without accuracy loss, as compared to other methods based on the next-generation sequencing technology. From four human fecal samples, MetaTrass successfully retrieved 178 high-quality genomes, whereas only 58 ones were provided by the optimal performance of other conventional strategies. Most importantly, these high-quality genomes confirmed the high level of genetic diversity among different samples and unveiled much more. MetaTrass was designed to work with metagenomic reads sequenced by stLFR technology, but is also applicable to other types of cobarcoding libraries. With the high capability of assembling high-quality genomes of metagenomic data sets, MetaTrass seeks to facilitate the study of spatial characters and dynamics of complex microbial communities at enhanced resolution. The open-source code of MetaTrass is available at https://github.com/BGI-Qingdao/MetaTrass.
Collapse
Affiliation(s)
- Yanwei Qi
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Shengqiang Gu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- College of Life SciencesUniversity of Chinese Academy of SciencesBeijingChina
| | | | - Lidong Guo
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- College of Life SciencesUniversity of Chinese Academy of SciencesBeijingChina
| | - Mengyang Xu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
| | - Xiaofang Cheng
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- MGIBGI‐ShenzhenShenzhenChina
| | - Ou Wang
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- MGIBGI‐ShenzhenShenzhenChina
| | - Ying Sun
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
| | | | - Xiaodong Fang
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- BGI GenomicsBGI‐ShenzhenShenzhenChina
| | - Xin Liu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Li Deng
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Guangyi Fan
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
| |
Collapse
|
13
|
Faure R, Lavenier D. QuickDeconvolution: fast and scalable deconvolution of linked-read sequencing data. BIOINFORMATICS ADVANCES 2022; 2:vbac068. [PMID: 36699389 PMCID: PMC9710601 DOI: 10.1093/bioadv/vbac068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 08/22/2022] [Accepted: 09/21/2022] [Indexed: 01/28/2023]
Abstract
Motivation Recently introduced, linked-read technologies, such as the 10× chromium system, use microfluidics to tag multiple short reads from the same long fragment (50-200 kb) with a small sequence, called a barcode. They are inexpensive and easy to prepare, combining the accuracy of short-read sequencing with the long-range information of barcodes. The same barcode can be used for several different fragments, which complicates the analyses. Results We present QuickDeconvolution (QD), a new software for deconvolving a set of reads sharing a barcode, i.e. separating the reads from the different fragments. QD only takes sequencing data as input, without the need for a reference genome. We show that QD outperforms existing software in terms of accuracy, speed and scalability, making it capable of deconvolving previously inaccessible data sets. In particular, we demonstrate here the first example in the literature of a successfully deconvoluted animal sequencing dataset, a 33-Gb Drosophila melanogaster dataset. We show that the taxonomic assignment of linked reads can be improved by deconvoluting reads with QD before taxonomic classification. Availability and implementation Code and instructions are available on https://github.com/RolandFaure/QuickDeconvolution. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
14
|
Jiang F, Yang N, Huang H. Characterization and phylogenetic analysis of the mitochondrial genome sequence of Heniochus acuminatus. Mitochondrial DNA B Resour 2022; 7:1694-1695. [PMID: 36188664 PMCID: PMC9518272 DOI: 10.1080/23802359.2022.2049016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In this study, the complete mitochondrial genome of Heniochus acuminatus was first sequenced and annotated. The entire mitogenome is 16,584 bp in length, which consists of 13 protein-coding genes (PCGs), 22 transfer RNA (tRNA) genes, two ribosomal RNA (rRNA) genes, and a non-coding control region. The phylogenetic analysis by maximum-likelihood (ML) method revealed that H. acuminatus belongs to the Chaetodontidae family and is closely related to other Heniochus fish. The complete mitochondrial genome of H. acuminatus is helpful in population genetics and molecular systematics.
Collapse
Affiliation(s)
- Fangyan Jiang
- Key Laboratory of Utilization and Conservation for Tropical Marine Bioresources of Ministry of Education, Hainan Tropical Ocean University, Sanya, China
- Key Laboratory of Tropical Marine Fishery Resources Protection and Utilization of Hainan Province, Hainan Tropical Ocean University, Sanya, China
| | - Ning Yang
- Key Laboratory of Utilization and Conservation for Tropical Marine Bioresources of Ministry of Education, Hainan Tropical Ocean University, Sanya, China
- Key Laboratory of Tropical Marine Fishery Resources Protection and Utilization of Hainan Province, Hainan Tropical Ocean University, Sanya, China
| | - Hai Huang
- Key Laboratory of Utilization and Conservation for Tropical Marine Bioresources of Ministry of Education, Hainan Tropical Ocean University, Sanya, China
- Key Laboratory of Tropical Marine Fishery Resources Protection and Utilization of Hainan Province, Hainan Tropical Ocean University, Sanya, China
| |
Collapse
|
15
|
Meleshko D, Yang R, Marks P, Williams S, Hajirasouliha I. Efficient detection and assembly of non-reference DNA sequences with synthetic long reads. Nucleic Acids Res 2022; 50:e108. [PMID: 35924489 PMCID: PMC9561269 DOI: 10.1093/nar/gkac653] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 06/10/2022] [Accepted: 08/01/2022] [Indexed: 11/14/2022] Open
Abstract
Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, NY 10021, USA.,Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, NY 10021, USA
| | - Rui Yang
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, NY 10021, USA.,Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, NY 10021, USA
| | - Patrick Marks
- 10x Genomics Inc., Stoneridge Mall Road, Pleasanton, CA 94566, USA
| | - Stephen Williams
- 10x Genomics Inc., Stoneridge Mall Road, Pleasanton, CA 94566, USA
| | - Iman Hajirasouliha
- Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, NY 10021, USA.,Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, NY 10021, USA
| |
Collapse
|
16
|
Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J 2021; 19:6301-6314. [PMID: 34900140 PMCID: PMC8640167 DOI: 10.1016/j.csbj.2021.11.028] [Citation(s) in RCA: 97] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/16/2022] Open
Abstract
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.
Collapse
Key Words
- CNN, convolutional neural network
- DBG, De Bruijn graph
- GTDB, Genome Taxonomy Database
- Gene functional annotation
- Gene prediction
- Genome assembly
- HMM, Hidden Markov Model
- KEGG, Kyoto Encyclopedia of Genes and Genomes
- LCA, lowest common ancestor
- LPA, label propagation algorithm
- MAGs, metagenome-assembled genomes
- Metagenome binning
- Metagenome-assembled genomes
- Metagenomic sequencing
- Microbial abundance profiling
- OLC, overlap-layout consensus
- ONT, Oxford Nanopore Technologies
- ORFs, open reading frames
- PacBio, Pacific Biosciences
- QC, quality control
- SLR, synthetic long reads
- TNFs, tetranucleotide frequencies
- Taxonomic classification
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Debajyoti Chowdhury
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - William K. Cheung
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Aiping Lu
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhaoxiang Bian
- Institute of Brain and Gut Research, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Chinese Medicine Clinical Study Center, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
| |
Collapse
|
17
|
Guo L, Xu M, Wang W, Gu S, Zhao X, Chen F, Wang O, Xu X, Seim I, Fan G, Deng L, Liu X. SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme. BMC Bioinformatics 2021; 22:158. [PMID: 33765921 PMCID: PMC7993450 DOI: 10.1186/s12859-021-04081-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 03/16/2021] [Indexed: 12/30/2022] Open
Abstract
Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.
Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04081-z.
Collapse
Affiliation(s)
- Lidong Guo
- BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China.,BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Mengyang Xu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Wenchao Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China
| | - Shengqiang Gu
- BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
| | - Xia Zhao
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Fang Chen
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Inge Seim
- Integrative Biology Laboratory, College of Life Sciences, Nanjing Normal University, Nanjing, 210046, China.,School of Biology and Environmental Science, Queensland University of Technology, Brisbane, 4000, Australia
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Li Deng
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China. .,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,BGI-Shenzhen, Shenzhen, 518083, China. .,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China.
| | - Xin Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China. .,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,BGI-Shenzhen, Shenzhen, 518083, China. .,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China.
| |
Collapse
|
18
|
Zhang Z, Liu G, Chen Y, Xue W, Ji Q, Xu Q, Zhang H, Fan G, Huang H, Jiang L, Chen J. Comparison of different sequencing strategies for assembling chromosome-level genomes of extremophiles with variable GC content. iScience 2021; 24:102219. [PMID: 33748707 PMCID: PMC7961107 DOI: 10.1016/j.isci.2021.102219] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 01/20/2021] [Accepted: 02/18/2021] [Indexed: 01/23/2023] Open
Abstract
In this study, six bacterial isolates with variable GC, including Escherichia coli as mesophilic reference strain, were selected to compare hybrid assembly strategies based on next-generation sequencing (NGS) of short reads, single-tube long-fragment reads (stLFR) sequencing, and Oxford Nanopore Technologies (ONT) sequencing platforms. We obtained the complete genomes using the hybrid assembler Unicycler based on the NGS and ONT reads; others were de novo assembled using NGS, stLFR, and ONT reads by using different strategies. The contiguity, accuracy, completeness, sequencing costs, and DNA material requirements of the investigated strategies were compared systematically. Although all sequencing data could be assembled into accurate whole-genome sequences, the stLFR sequencing data yield a scaffold with more contiguity with more completeness of gene function than NGS sequencing assemblies. Our research provides a low-cost chromosome-level genome assembly strategy for large-scale sequencing of extremophile genomes with different GC contents.
Collapse
Affiliation(s)
- Zhidong Zhang
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Nanjing 211816, China
- Institute of Applied Microbiology, Xinjiang Academy of Agricultural Sciences/Xinjiang Key Laboratory of Special Environmental Microbiology, Urumqi, Xinjiang 830091, China
| | - Guilin Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Yao Chen
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Nanjing 211816, China
| | - Weizhen Xue
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Qianyue Ji
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Qiwu Xu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - He Zhang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - He Huang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing 210023, China
- School of Pharmaceutical Sciences, Nanjing Tech University, Nanjing 211816, China
| | - Ling Jiang
- College of Food Science and Light Industry, Nanjing Tech University, Nanjing 211816, China
| | - Jianwei Chen
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China
- Qingdao-Europe Advanced Institute for Life Sciences, BGI-Shenzhen, Qingdao 266555, China
- Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Universitetsparken 13, Copenhagen 2100, Denmark
| |
Collapse
|
19
|
Majidian S, Kahaei MH, de Ridder D. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinformatics 2020; 21:253. [PMID: 32552661 PMCID: PMC7302376 DOI: 10.1186/s12859-020-03584-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 06/05/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Haplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, to the best of our knowledge, no algorithms have yet been proposed for polyploids specifically exploiting linked reads. RESULTS The first haplotyping algorithm designed for linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes. CONCLUSIONS Hap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran.
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
20
|
Chen Z, Pham L, Wu TC, Mo G, Xia Y, Chang PL, Porter D, Phan T, Che H, Tran H, Bansal V, Shaffer J, Belda-Ferre P, Humphrey G, Knight R, Pevzner P, Pham S, Wang Y, Lei M. Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information. Genome Res 2020; 30:898-909. [PMID: 32540955 PMCID: PMC7370886 DOI: 10.1101/gr.260380.119] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 06/10/2020] [Indexed: 02/06/2023]
Abstract
Long-range sequencing information is required for haplotype phasing, de novo assembly, and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirements. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seq) technology, which enables a low-cost, high-accuracy, and high-throughput short-read second-generation sequencer to generate over 100 kb of long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcoded linked-reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate megabase-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.
Collapse
Affiliation(s)
- Zhoutao Chen
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Long Pham
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Tsai-Chin Wu
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Guoya Mo
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Yu Xia
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Peter L Chang
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Devin Porter
- Universal Sequencing Technology Corporation, Carlsbad, California 92011, USA
| | - Tan Phan
- Bioturing Incorporated, San Diego, California 92121, USA
| | - Huu Che
- Bioturing Incorporated, San Diego, California 92121, USA
| | - Hao Tran
- Bioturing Incorporated, San Diego, California 92121, USA.,Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City, 700 000 Vietnam
| | - Vikas Bansal
- Department of Pediatrics, University of California San Diego, La Jolla, California 92161, USA
| | - Justin Shaffer
- Center for Microbiome Innovation and Departments of Pediatrics, Bioengineering, and Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | - Pedro Belda-Ferre
- Center for Microbiome Innovation and Departments of Pediatrics, Bioengineering, and Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | - Greg Humphrey
- Center for Microbiome Innovation and Departments of Pediatrics, Bioengineering, and Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | - Rob Knight
- Center for Microbiome Innovation and Departments of Pediatrics, Bioengineering, and Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | - Pavel Pevzner
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | - Son Pham
- Bioturing Incorporated, San Diego, California 92121, USA
| | - Yong Wang
- Universal Sequencing Technology Corporation, Canton, Massachusetts 02021, USA
| | - Ming Lei
- Universal Sequencing Technology Corporation, Canton, Massachusetts 02021, USA
| |
Collapse
|