1
|
Vrček L, Bresson X, Laurent T, Schmitz M, Kawaguchi K, Šikić M. Geometric deep learning framework for de novo genome assembly. Genome Res 2025; 35:839-849. [PMID: 39472021 PMCID: PMC12047240 DOI: 10.1101/gr.279307.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 10/18/2024] [Indexed: 03/16/2025]
Abstract
The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different degrees of ploidy and aneuploidy. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.
Collapse
Affiliation(s)
- Lovro Vrček
- Genome Institute of Singapore, A*STAR, Singapore 138672;
- Faculty of Electrical Engineering and Computing, University of Zagreb, 10000, Zagreb, Croatia
| | - Xavier Bresson
- School of Computing, National University of Singapore, Singapore 117417
| | - Thomas Laurent
- Department of Mathematics, Loyola Marymount University, Los Angeles, California 90045, USA
| | - Martin Schmitz
- Genome Institute of Singapore, A*STAR, Singapore 138672
- School of Computing, National University of Singapore, Singapore 117417
| | - Kenji Kawaguchi
- School of Computing, National University of Singapore, Singapore 117417
| | - Mile Šikić
- Genome Institute of Singapore, A*STAR, Singapore 138672;
- Faculty of Electrical Engineering and Computing, University of Zagreb, 10000, Zagreb, Croatia
| |
Collapse
|
2
|
Qin Q, Popic V, Wienand K, Yu H, White E, Khorgade A, Shin A, Georgescu C, Campbell CD, Dondi A, Beerenwinkel N, Vazquez F, Al'Khafaji AM, Haas BJ. Accurate fusion transcript identification from long- and short-read isoform sequencing at bulk or single-cell resolution. Genome Res 2025; 35:967-986. [PMID: 40086881 PMCID: PMC12047241 DOI: 10.1101/gr.279200.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 01/06/2025] [Indexed: 03/16/2025]
Abstract
Gene fusions are found as cancer drivers in diverse adult and pediatric cancers. Accurate detection of fusion transcripts is essential in cancer clinical diagnostics and prognostics and for guiding therapeutic development. Most currently available methods for fusion transcript detection are compatible with Illumina RNA-seq involving highly accurate short-read sequences. Recent advances in long-read isoform sequencing enable the detection of fusion transcripts at unprecedented resolution in bulk and single-cell samples. Here, we developed a new computational tool, CTAT-LR-Fusion, to detect fusion transcripts from long-read RNA-seq with or without companion short reads, with applications to bulk or single-cell transcriptomes. We demonstrate that CTAT-LR-Fusion exceeds the fusion detection accuracy of alternative methods as benchmarked with simulated and genuine long-read RNA-seq. Using short- and long-read RNA-seq, we further apply CTAT-LR-Fusion to bulk transcriptomes of nine tumor cell lines and to tumor single cells derived from a melanoma sample and three metastatic high-grade serous ovarian carcinoma samples. In both bulk and single-cell RNA-seq, long isoform reads yield higher sensitivity for fusion detection than short reads with notable exceptions. By combining short and long reads in CTAT-LR-Fusion, we are able to further maximize the detection of fusion splicing isoforms and fusion-expressing tumor cells.
Collapse
Affiliation(s)
- Qian Qin
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Victoria Popic
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Kirsty Wienand
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Houlin Yu
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Emily White
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Akanksha Khorgade
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Asa Shin
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | | | | - Arthur Dondi
- Department of Biosystems Science and Engineering, ETH Zurich, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4056 Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4056 Basel, Switzerland
| | - Francisca Vazquez
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Aziz M Al'Khafaji
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
| | - Brian J Haas
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
| |
Collapse
|
3
|
Mink S, Attenberger C, Busch Y, Kiefer J, Peter W, Cadamuro J, Steiert TA, Franke A, Gassner C. Merging High-Throughput, Amplicon-Based Second and Third Generation Sequencing Data: An Integrative and Modular Data Analysis Framework for Haplotype Prediction and Output Evaluation. Int J Mol Sci 2025; 26:3443. [PMID: 40244459 PMCID: PMC11990026 DOI: 10.3390/ijms26073443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Revised: 03/30/2025] [Accepted: 04/02/2025] [Indexed: 04/18/2025] Open
Abstract
Despite providing highly accurate results, the short reads generated by second generation sequencing have major limitations in mapping complex genomic regions. Longer reads can resolve these issues and additionally phase distant variants. The third generation sequencing platform ONT currently achieves the longest sequencing reads but falls short in sequencing accuracy. Additionally, deriving phased haplotypes from amplicon-based NGS data remains a complex and time-consuming task that requires extensive bioinformatic expertise. We constructed an integrative, open-access modular data-analysis framework that allows for automated processing of high-throughput sequencing data from both second (Illumina) and third generation (ONT) sequencing platforms, combining the strengths of both technologies. Variant information is automatically evaluated and color-coded for discrepancies. Haplotypes are listed by frequency. All parts of the framework can be used independently. The framework's performance was validated using synthetic and tested with real-life data by analyzing partly homologous FUT1/2/3 sequencing data from 400 blood donors.
Collapse
Affiliation(s)
- Sylvia Mink
- Central Medical Laboratories, Carinagasse 41, 6800 Feldkirch, Austria
- Institute of Translational Medicine, Private University in the Principality of Liechtenstein, 9495 Triesen, Liechtenstein
| | - Christian Attenberger
- Faculty of Medical Science, Private University in the Principality of Liechtenstein, 9495 Triesen, Liechtenstein
| | - Yannik Busch
- Stefan-Morsch-Stiftung, 55765 Birkenfeld, Germany
| | | | - Wolfgang Peter
- Stefan-Morsch-Stiftung, 55765 Birkenfeld, Germany
- Institute for Transfusion Medicine, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50923 Cologne, Germany
| | - Janne Cadamuro
- Department of Laboratory Medicine, Paracelsus Medical University Salzburg, 5020 Salzburg, Austria
| | - Tim A. Steiert
- Institute of Clinical Molecular Biology, Christian-Albrechts-University and University Medical Center Schleswig-Holstein, 24118 Kiel, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Christian-Albrechts-University and University Medical Center Schleswig-Holstein, 24118 Kiel, Germany
| | - Christoph Gassner
- Institute of Translational Medicine, Private University in the Principality of Liechtenstein, 9495 Triesen, Liechtenstein
| |
Collapse
|
4
|
Chen X, Baker D, Dolzhenko E, Devaney JM, Noya J, Berlyoung AS, Brandon R, Hruska KS, Lochovsky L, Kruszka P, Newman S, Farrow E, Thiffault I, Pastinen T, Kasperaviciute D, Gilissen C, Vissers L, Hoischen A, Berger S, Vilain E, Délot E, Eberle MA. Genome-wide profiling of highly similar paralogous genes using HiFi sequencing. Nat Commun 2025; 16:2340. [PMID: 40057485 PMCID: PMC11890787 DOI: 10.1038/s41467-025-57505-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 02/21/2025] [Indexed: 05/13/2025] Open
Abstract
Variant calling is hindered in segmental duplications by sequence homology. We developed Paraphase, a HiFi-based informatics method that resolves highly similar genes by phasing all haplotypes of paralogous genes together. We applied Paraphase to 160 long (>10 kb) segmental duplication regions across the human genome with high (>99%) sequence similarity, encoding 316 genes. Analysis across five ancestral populations revealed highly variable copy numbers of these regions. We identified 23 paralog groups with exceptionally low within-group diversity, where extensive gene conversion and unequal crossing over contribute to highly similar gene copies. Furthermore, our analysis of 36 trios identified 7 de novo SNVs and 4 de novo gene conversion events, 2 of which are non-allelic. Finally, we summarized extensive genetic diversity in 9 medically relevant genes previously considered challenging to genotype. Paraphase provides a framework for resolving gene paralogs, enabling accurate testing in medically relevant genes and population-wide studies of previously inaccessible genes.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Emily Farrow
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
- UMKC School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA
- Department of Pediatrics, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Isabelle Thiffault
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
- UMKC School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA
- Department of Pathology and Laboratory Medicine, Children's Mercy Kansas City, Kansas City, MO, USA
| | - Tomi Pastinen
- Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA
- UMKC School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA
| | | | - Christian Gilissen
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
- Research Institute for Medical Innovation, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Lisenka Vissers
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
- Research Institute for Medical Innovation, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Alexander Hoischen
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
- Research Institute for Medical Innovation, Radboud University Medical Center, Nijmegen, The Netherlands
- Radboud Center for Infectious Diseases (RCI), Department of Internal Medicine, Radboud University Medical Center, Nijmegen, The Netherlands
- Radboud Expertise Center for Immunodeficiency and Autoinflammation and Radboud Center for Infectious Disease (RCI), Radboud University Medical Center, Nijmegen, The Netherlands
| | - Seth Berger
- Center for Genetics Medicine Research, Children's National Hospital, Washington, DC, USA
| | - Eric Vilain
- Institute for Clinical and Translational Science, University of California, Irvine, CA, USA
| | - Emmanuèle Délot
- Institute for Clinical and Translational Science, University of California, Irvine, CA, USA
| | | |
Collapse
|
5
|
Ren W, Fang Z, Dolzhenko E, Saunders CT, Cheng Z, Popic V, Peltz G. A Murine Database of Structural Variants Enables the Genetic Architecture of a Spontaneous Murine Lymphoma to be Characterized. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632219. [PMID: 39868308 PMCID: PMC11761040 DOI: 10.1101/2025.01.09.632219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
A more complete map of the pattern of genetic variation among inbred mouse strains is essential for characterizing the genetic architecture of the many available mouse genetic models of important biomedical traits. Although structural variants (SVs) are a major component of genetic variation, they have not been adequately characterized among inbred strains due to methodological limitations. To address this, we generated high-quality long-read sequencing data for 40 inbred strains; and designed a pipeline to optimally identify and validate different types of SVs. This generated a database for 40 inbred strains with 573,191SVs, which included 10,815 duplications and 2,115 inversions, that also has 70 million SNPs and 7.5 million insertions/deletions. Analysis of this SV database led to the discovery of a novel bi-genic model for susceptibility to a B cell lymphoma that spontaneously develops in SJL mice, which was initially described 55 years ago. The first genetic factor is a previously identified endogenous retrovirus encoded protein that stimulates CD4 T cells to produce the cytokines required for lymphoma growth. The second genetic factor is a newly found deletion SV, which ablates a protein whose promotes B lymphoma development in SJL mice. Characterizing the genetic architecture of SJL lymphoma susceptibility could provide new insight into the pathogenesis of a human lymphoma that has similarities with this murine lymphoma.
Collapse
Affiliation(s)
- Wenlong Ren
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford CA 94305
| | - Zhuoqing Fang
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford CA 94305
| | | | | | - Zhuanfen Cheng
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford CA 94305
| | | | - Gary Peltz
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford CA 94305
| |
Collapse
|
6
|
Kong T, Wang Y, Liu B. xRead: a coverage-guided approach for scalable construction of read overlapping graph. Gigascience 2025; 14:giaf007. [PMID: 39960665 PMCID: PMC11831799 DOI: 10.1093/gigascience/giaf007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 11/29/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced. FINDINGS Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies. CONCLUSIONS xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.
Collapse
Affiliation(s)
- Tangchao Kong
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
7
|
Chaabane F, Pillonel T, Bertelli C. MeSS and assembly_finder: a toolkit for in silico metagenomic sample generation. Bioinformatics 2024; 41:btae760. [PMID: 39739308 PMCID: PMC11755095 DOI: 10.1093/bioinformatics/btae760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 11/17/2024] [Accepted: 12/30/2024] [Indexed: 01/02/2025] Open
Abstract
SUMMARY The intrinsic complexity of the microbiota combined with technical variability render shotgun metagenomics challenging to analyze for routine clinical or research applications. In silico data generation offers a controlled environment allowing for example to benchmark bioinformatics tools, to optimize study design, statistical power, or to validate targeted applications. Here, we propose assembly_finder and the Metagenomic Sequence Simulator (MeSS), two easy-to-use Bioconda packages, as part of a benchmarking toolkit to download genomes and simulate shotgun metagenomics samples, respectively. Outperforming existing tools in speed while requiring less memory, MeSS reproducibly generates accurate complex communities based on a list of taxonomic ranks and their abundance. AVAILABILITY AND IMPLEMENTATION All code is released under MIT License and is available on https://github.com/metagenlab/MeSS and https://github.com/metagenlab/assembly_finder.
Collapse
Affiliation(s)
- Farid Chaabane
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Trestan Pillonel
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Claire Bertelli
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| |
Collapse
|
8
|
Calvo-Roitberg E, Daniels RF, Pai AA. Challenges in identifying mRNA transcript starts and ends from long-read sequencing data. Genome Res 2024; 34:1719-1734. [PMID: 39567236 DOI: 10.1101/gr.279559.124] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 08/16/2024] [Indexed: 11/22/2024]
Abstract
Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.
Collapse
Affiliation(s)
- Ezequiel Calvo-Roitberg
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, Massachusetts 01605, USA
| | - Rachel F Daniels
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, Massachusetts 01605, USA
| | - Athma A Pai
- RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, Massachusetts 01605, USA
| |
Collapse
|
9
|
Brown NK, Shivakumar VS, Langmead B. Improved pangenomic classification accuracy with chain statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.29.620953. [PMID: 39554056 PMCID: PMC11565826 DOI: 10.1101/2024.10.29.620953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an O ( r + n / d ) -space index for a collection of d sequences having a length- n BWT consisting of r maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignment-based methods.
Collapse
Affiliation(s)
- Nathaniel K. Brown
- Department of Computer Science, Johns Hopkins University, Baltimore MD 21218
| | | | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore MD 21218
| |
Collapse
|
10
|
Bilgrav Saether K, Eisfeldt J. Detecting transposable elements in long-read genomes using sTELLeR. Bioinformatics 2024; 40:btae686. [PMID: 39558574 PMCID: PMC11601167 DOI: 10.1093/bioinformatics/btae686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 11/05/2024] [Accepted: 11/14/2024] [Indexed: 11/20/2024] Open
Abstract
MOTIVATION Repeat elements, such as transposable elements (TE), are highly repetitive DNA sequences that compose around 50% of the genome. TEs such as Alu, SVA, HERV, and L1 elements can cause disease through disrupting genes, causing frameshift mutations or altering splicing patters. These are elements challenging to characterize using short-read genome sequencing, due to its read length and TEs repetitive nature. Long-read genome sequencing (lrGS) enables bridging of TEs, allowing increased resolution across repetitive DNA sequences. lrGS therefore present an opportunity for improved TE detection and analysis not only from a research perspective but also for future clinical detection. When choosing an lrGS TE caller, parameters such as runtime, CPU hours, sensitivity, precision, and compatibility with inclusion into pipelines are crucial for efficient detection. RESULTS We therefore developed sTELLeR, (s) Transposable ELement in Long (e) Read, for accurate, fast, and effective TE detection. Particularly, sTELLeR exhibit higher precision and sensitivity for calling of Alu elements than similar tools. The caller is 5-48× as fast and uses <2% of the CPU hours compared to competitive callers. The caller is haplotype aware and output results in a variant call format (VCF) file, enabling compatibility with other variant callers and downstream analysis. AVAILABILITY AND IMPLEMENTATION sTELLeR is a python-based tool and is available at https://github.com/kristinebilgrav/sTELLeR. Altogether, we show that sTELLeR is a fast, sensitive, and precise caller for detection of TE elements, and can easily be implemented into variant calling workflows.
Collapse
Affiliation(s)
- Kristine Bilgrav Saether
- Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm 171 76, Sweden
- Clinical Genomics Facility, Science for Life Laboratory, Stockholm 171 76, Sweden
| | - Jesper Eisfeldt
- Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm 171 76, Sweden
- Clinical Genomics Facility, Science for Life Laboratory, Stockholm 171 76, Sweden
- Department of Clinical Genetics and Genomics, Karolinska University Hospital, Stockholm 171 77, Sweden
| |
Collapse
|
11
|
Luo J, Zhang Z, Ma X, Yan C, Luo H. GTasm: a genome assembly method using graph transformers and HiFi reads. Front Genet 2024; 15:1495657. [PMID: 39525812 PMCID: PMC11543488 DOI: 10.3389/fgene.2024.1495657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 10/14/2024] [Indexed: 11/16/2024] Open
Abstract
Motivation Genome assembly aims to reconstruct the whole chromosome-scale genome sequence. Obtaining accurate and complete chromosome-scale genome sequence serve as an indispensable foundation for downstream genomics analyses. Due to the complex repeat regions contained in genome sequence, the assembly results commonly are fragmented. Long reads with high accuracy rate can greatly enhance the integrity of genome assembly results. Results Here we introduce GTasm, an assembly method that uses graph transformer network to find optimal assembly results based on assembly graphs. Based on assembly graph, GTasm first extracts features about vertices and edges. Then, GTasm scores the edges by graph transformer model, and adopt a heuristic algorithm to find optimal paths in the assembly graph, each path corresponding to a contig. The graph transformer model is trained using simulated HiFi reads from CHM13, and GTasm is compared with other assembly methods using real HIFI read set. Through experimental result, GTasm can produce well assembly results, and achieve good performance on NA50 and NGA50 evaluation indicators. Applying deep learning models to genome assembly can improve the continuity and accuracy of assembly results. The code is available from https://github.com/chu-xuezhe/GTasm.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Ziheng Zhang
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Xinliang Ma
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
12
|
Krannich T, Ternovoj D, Paraskevopoulou S, Fuchs S. CIEVaD: A Lightweight Workflow Collection for the Rapid and On-Demand Deployment of End-to-End Testing for Genomic Variant Detection. Viruses 2024; 16:1444. [PMID: 39339920 PMCID: PMC11437481 DOI: 10.3390/v16091444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 09/03/2024] [Accepted: 09/07/2024] [Indexed: 09/30/2024] Open
Abstract
The identification of genomic variants has become a routine task in the age of genome sequencing. In particular, small genomic variants of a single or few nucleotides are routinely investigated for their impact on an organism's phenotype. Hence, the precise and robust detection of the variants' exact genomic locations and changes in nucleotide composition is vital in many biological applications. Although a plethora of methods exist for the many key steps of variant detection, thoroughly testing the detection process and evaluating its results is still a cumbersome procedure. In this work, we present a collection of easy-to-apply and highly modifiable workflows to facilitate the generation of synthetic test data, as well as to evaluate the accordance of a user-provided set of variants with the test data. The workflows are implemented in Nextflow and are open-source and freely available on Github under the GPL-3.0 license.
Collapse
Affiliation(s)
- Thomas Krannich
- Genome Competence Center, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Dimitri Ternovoj
- Genome Competence Center, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | | | - Stephan Fuchs
- Genome Competence Center, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| |
Collapse
|
13
|
Sapoval N, Liu Y, Curry KD, Kille B, Huang W, Kokroko N, Nute MG, Tyshaieva A, Dilthey A, Molloy EK, Treangen TJ. Lightweight taxonomic profiling of long-read metagenomic datasets with Lemur and Magnet. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.01.596961. [PMID: 38895276 PMCID: PMC11185576 DOI: 10.1101/2024.06.01.596961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
The advent of long-read sequencing of microbiomes necessitates the development of new taxonomic profilers tailored to long-read shotgun metagenomic datasets. Here, we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling for long-read shotgun metagenomic datasets. Lemur is a marker-gene-based method that leverages an EM algorithm to reduce false positive calls while preserving true positives; Magnet is a whole-genome read-mapping-based method that provides detailed presence and absence calls for bacterial genomes. We demonstrate that Lemur and Magnet can run in minutes to hours on a laptop with 32 GB of RAM, even for large inputs, a crucial feature given the portability of long-read sequencing machines. Furthermore, the marker gene database used by Lemur is only 4 GB and contains information from over 300,000 RefSeq genomes. Lemur and Magnet are open-source and available at https://github.com/treangenlab/lemur and https://github.com/treangenlab/magnet.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Yunxi Liu
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Kristen D. Curry
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Wenyu Huang
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Natalie Kokroko
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Alona Tyshaieva
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Alexander Dilthey
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Erin K. Molloy
- Department of Bioengineerings, Rice University, Houston, TX 77005, USA
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, USA
- Department of Bioengineerings, Rice University, Houston, TX 77005, USA
| |
Collapse
|
14
|
Luo C, Liu YH, Zhou XM. VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 2024; 15:6956. [PMID: 39138168 PMCID: PMC11322167 DOI: 10.1038/s41467-024-51282-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 07/31/2024] [Indexed: 08/15/2024] Open
Abstract
Structural variants (SVs) significantly contribute to human genome diversity and play a crucial role in precision medicine. Although advancements in single-molecule long-read sequencing offer a groundbreaking resource for SV detection, identifying SV breakpoints and sequences accurately and robustly remains challenging. We introduce VolcanoSV, an innovative hybrid SV detection pipeline that utilizes both a reference genome and local de novo assembly to generate a phased diploid assembly. VolcanoSV uses phased SNPs and unique k-mer similarity analysis, enabling precise haplotype-resolved SV discovery. VolcanoSV is adept at constructing comprehensive genetic maps encompassing SNPs, small indels, and all types of SVs, making it well-suited for human genomics studies. Our extensive experiments demonstrate that VolcanoSV surpasses state-of-the-art assembly-based tools in the detection of insertion and deletion SVs, exhibiting superior recall, precision, F1 scores, and genotype accuracy across a diverse range of datasets, including low-coverage (10x) datasets. VolcanoSV outperforms assembly-based tools in the identification of complex SVs, including translocations, duplications, and inversions, in both simulated and real cancer data. Moreover, VolcanoSV is robust to various evaluation parameters and accurately identifies breakpoints and SV sequences.
Collapse
Affiliation(s)
- Can Luo
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA
| | - Yichen Henry Liu
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Xin Maizie Zhou
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA.
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Data Science Institute, Vanderbilt University, Nashville, TN, USA.
| |
Collapse
|
15
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
16
|
Kim J, Steinegger M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods 2024; 21:971-973. [PMID: 38769467 DOI: 10.1038/s41592-024-02273-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 04/11/2024] [Indexed: 05/22/2024]
Abstract
Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli ( https://metabuli.steineggerlab.com ), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.
Collapse
Affiliation(s)
- Jaebeom Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Martin Steinegger
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
17
|
Su Y, Yu Z, Jin S, Ai Z, Yuan R, Chen X, Xue Z, Guo Y, Chen D, Liang H, Liu Z, Liu W. Comprehensive assessment of mRNA isoform detection methods for long-read sequencing data. Nat Commun 2024; 15:3972. [PMID: 38730241 PMCID: PMC11087464 DOI: 10.1038/s41467-024-48117-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 04/19/2024] [Indexed: 05/12/2024] Open
Abstract
The advancement of Long-Read Sequencing (LRS) techniques has significantly increased the length of sequencing to several kilobases, thereby facilitating the identification of alternative splicing events and isoform expressions. Recently, numerous computational tools for isoform detection using long-read sequencing data have been developed. Nevertheless, there remains a deficiency in comparative studies that systemically evaluate the performance of these tools, which are implemented with different algorithms, under various simulations that encompass potential influencing factors. In this study, we conducted a benchmark analysis of thirteen methods implemented in nine tools capable of identifying isoform structures from long-read RNA-seq data. We evaluated their performances using simulated data, which represented diverse sequencing platforms generated by an in-house simulator, RNA sequins (sequencing spike-ins) data, as well as experimental data. Our findings demonstrate IsoQuant as a highly effective tool for isoform detection with LRS, with Bambu and StringTie2 also exhibiting strong performance. These results offer valuable guidance for future research on alternative splicing analysis and the ongoing improvement of tools for isoform detection using LRS data.
Collapse
Affiliation(s)
- Yaqi Su
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
| | - Zhejian Yu
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Siqian Jin
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Zhipeng Ai
- Division of Human Reproduction and Developmental Genetics, Women's Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310006, Zhejiang, China
| | - Ruihong Yuan
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Xinyi Chen
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Ziwei Xue
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Yixin Guo
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Di Chen
- Center for Reproductive Medicine of the Second Affiliated Hospital Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China
- Centre for Regeneration and Cell Therapy of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Hongqing Liang
- Division of Human Reproduction and Developmental Genetics, Women's Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310006, Zhejiang, China
| | - Zuozhu Liu
- Zhejiang University-Angel Align Inc. R&D Center for Intelligent Healthcare, Zhejiang University-University of Illinois at Urbana-Champaign Institute (ZJU-UIUC Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China
| | - Wanlu Liu
- Department of Orthopedic Surgery of the Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310009, Zhejiang, China.
- Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), International Campus, Zhejiang University, Haining, 314400, Zhejiang, China.
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
- Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
18
|
Cooley NP, Wright ES. Many purported pseudogenes in bacterial genomes are bona fide genes. BMC Genomics 2024; 25:365. [PMID: 38622536 PMCID: PMC11017572 DOI: 10.1186/s12864-024-10137-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/17/2024] [Indexed: 04/17/2024] Open
Abstract
BACKGROUND Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. RESULTS Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R2 = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors. CONCLUSIONS Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality.
Collapse
Affiliation(s)
- Nicholas P Cooley
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Erik S Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
19
|
Liu YH, Luo C, Golding SG, Ioffe JB, Zhou XM. Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data. Nat Commun 2024; 15:2447. [PMID: 38503752 PMCID: PMC10951360 DOI: 10.1038/s41467-024-46614-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Accepted: 03/04/2024] [Indexed: 03/21/2024] Open
Abstract
Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Alignment-based methods, favored for their computational efficiency and lower coverage requirements, are prominent. Alternative approaches, relying solely on available reads for de novo genome assembly and employing assembly-based tools for SV detection via comparison to a reference genome, demand significantly more computational resources. However, the lack of comprehensive benchmarking constrains our comprehension and hampers further algorithm development. Here we systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers. Assembly-based tools excel in detecting large SVs, especially insertions, and exhibit robustness to evaluation parameter changes and coverage fluctuations. Conversely, alignment-based tools demonstrate superior genotyping accuracy at low sequencing coverage (5-10×) and excel in detecting complex SVs, like translocations, inversions, and duplications. Our evaluation provides performance insights, highlighting the absence of a universally superior tool. We furnish guidelines across 31 criteria combinations, aiding users in selecting the most suitable tools for diverse scenarios and offering directions for further method development.
Collapse
Affiliation(s)
- Yichen Henry Liu
- Department of Computer Science, Vanderbilt University, 37235, Nashville, TN, USA
| | - Can Luo
- Department of Biomedical Engineering, Vanderbilt University, 37235, Nashville, TN, USA
| | - Staunton G Golding
- Department of Biomedical Engineering, Vanderbilt University, 37235, Nashville, TN, USA
| | - Jacob B Ioffe
- Department of Computer Science, Vanderbilt University, 37235, Nashville, TN, USA
| | - Xin Maizie Zhou
- Department of Computer Science, Vanderbilt University, 37235, Nashville, TN, USA.
- Department of Biomedical Engineering, Vanderbilt University, 37235, Nashville, TN, USA.
- Data Science Institute, Vanderbilt University, 37235, Nashville, TN, USA.
| |
Collapse
|
20
|
Hui X, Yang J, Sun J, Liu F, Pan W. MCSS: microbial community simulator based on structure. Front Microbiol 2024; 15:1358257. [PMID: 38516019 PMCID: PMC10956353 DOI: 10.3389/fmicb.2024.1358257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 02/20/2024] [Indexed: 03/23/2024] Open
Abstract
De novo assembly plays a pivotal role in metagenomic analysis, and the incorporation of third-generation sequencing technology can significantly improve the integrity and accuracy of assembly results. Recently, with advancements in sequencing technology (Hi-Fi, ultra-long), several long-read-based bioinformatic tools have been developed. However, the validation of the performance and reliability of these tools is a crucial concern. To address this gap, we present MCSS (microbial community simulator based on structure), which has the capability to generate simulated microbial community and sequencing datasets based on the structure attributes of real microbiome communities. The evaluation results indicate that it can generate simulated communities that exhibit both diversity and similarity to actual community structures. Additionally, MCSS generates synthetic PacBio Hi-Fi and Oxford Nanopore Technologies (ONT) long reads for the species within the simulated community. This innovative tool provides a valuable resource for benchmarking and refining metagenomic analysis methods. Code available at: https://github.com/panlab-bio/mcss.
Collapse
Affiliation(s)
- Xingqi Hui
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| | - Jinbao Yang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jinhuan Sun
- Key Laboratory of Plant Molecular Physiology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Botany, Chinese Academy of Sciences, Beijing, China
| | - Fang Liu
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- National Key Laboratory of Cotton Bio-Breeding and Integrated Utilization, Institute of Cotton Research, Chinese Academy of Agricultural Sciences (ICR, CAAS), Anyang, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| |
Collapse
|
21
|
Rautiainen M. Ribotin: automated assembly and phasing of rDNA morphs. Bioinformatics 2024; 40:btae124. [PMID: 38441320 PMCID: PMC10948282 DOI: 10.1093/bioinformatics/btae124] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 01/19/2024] [Accepted: 03/01/2024] [Indexed: 03/20/2024] Open
Abstract
MOTIVATION The ribosomal DNA (rDNA) arrays are highly repetitive and homogenous regions which exist in all life. Due to their repetitiveness, current assembly methods do not fully assemble the rDNA arrays in humans and many other eukaryotes, and so variation within the rDNA arrays cannot be effectively studied. RESULTS Here, we present the tool ribotin to assemble full length rDNA copies, or morphs. Ribotin uses a combination of highly accurate long reads and extremely long nanopore reads to resolve the variation between rDNA morphs. We show that ribotin successfully recovers the most abundant morphs in human and nonhuman genomes. We also find that genome wide consensus sequences of the rDNA arrays frequently produce a mosaic sequence that does not exist in the genome. AVAILABILITY AND IMPLEMENTATION Ribotin is available on https://github.com/maickrau/ribotin and as a package on bioconda.
Collapse
Affiliation(s)
- Mikko Rautiainen
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| |
Collapse
|
22
|
Qin Q, Popic V, Yu H, White E, Khorgade A, Shin A, Wienand K, Dondi A, Beerenwinkel N, Vazquez F, Al’Khafaji AM, Haas BJ. CTAT-LR-fusion: accurate fusion transcript identification from long and short read isoform sequencing at bulk or single cell resolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.24.581862. [PMID: 38464114 PMCID: PMC10925146 DOI: 10.1101/2024.02.24.581862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Gene fusions are found as cancer drivers in diverse adult and pediatric cancers. Accurate detection of fusion transcripts is essential in cancer clinical diagnostics, prognostics, and for guiding therapeutic development. Most currently available methods for fusion transcript detection are compatible with Illumina RNA-seq involving highly accurate short read sequences. Recent advances in long read isoform sequencing enable the detection of fusion transcripts at unprecedented resolution in bulk and single cell samples. Here we developed a new computational tool CTAT-LR-fusion to detect fusion transcripts from long read RNA-seq with or without companion short reads, with applications to bulk or single cell transcriptomes. We demonstrate that CTAT-LR-fusion exceeds fusion detection accuracy of alternative methods as benchmarked with simulated and real long read RNA-seq. Using short and long read RNA-seq, we further apply CTAT-LR-fusion to bulk transcriptomes of nine tumor cell lines, and to tumor single cells derived from a melanoma sample and three metastatic high grade serous ovarian carcinoma samples. In both bulk and in single cell RNA-seq, long isoform reads yielded higher sensitivity for fusion detection than short reads with notable exceptions. By combining short and long reads in CTAT-LR-fusion, we are able to further maximize detection of fusion splicing isoforms and fusion-expressing tumor cells. CTAT-LR-fusion is available at https://github.com/TrinityCTAT/CTAT-LR-fusion/wiki.
Collapse
Affiliation(s)
- Qian Qin
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Victoria Popic
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Houlin Yu
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Emily White
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Akanksha Khorgade
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Asa Shin
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Kirsty Wienand
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Arthur Dondi
- ETH Zurich, Department of Biosystems Science and Engineering, Schanzenstrasse 44, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Schanzenstrasse 44, 4056 Basel, Switzerland
| | - Niko Beerenwinkel
- ETH Zurich, Department of Biosystems Science and Engineering, Schanzenstrasse 44, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Schanzenstrasse 44, 4056 Basel, Switzerland
| | - Francisca Vazquez
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Aziz M. Al’Khafaji
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| | - Brian J. Haas
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142 USA
| |
Collapse
|
23
|
Hou Y, Wang L, Pan W. Comparison of Hi-C-Based Scaffolding Tools on Plant Genomes. Genes (Basel) 2023; 14:2147. [PMID: 38136968 PMCID: PMC10742964 DOI: 10.3390/genes14122147] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 11/03/2023] [Accepted: 11/13/2023] [Indexed: 12/24/2023] Open
Abstract
De novo genome assembly holds paramount significance in the field of genomics. Scaffolding, as a pivotal component within the genome assembly process, is instrumental in determining the orientation and arrangement of contigs, ultimately facilitating the generation of a chromosome-level assembly. Scaffolding is contingent on supplementary linkage information, including paired-end reads, bionano, physical mapping, genetic mapping, and Hi-C (an abbreviation for High-throughput Chromosome Conformation Capture). In recent years, Hi-C has emerged as the predominant source of linkage information in scaffolding, attributed to its capacity to offer long-range signals, leading to the development of numerous Hi-C-based scaffolding tools. However, to the best of our knowledge, there has been a paucity of comprehensive studies assessing and comparing the efficacy of these tools. In order to address this gap, we meticulously selected six tools, namely LACHESIS, pin_hic, YaHS, SALSA2, 3d-DNA, and ALLHiC, and conducted a comparative analysis of their performance across haploid, diploid, and polyploid genomes. This endeavor has yielded valuable insights in advancing the field of genome scaffolding research.
Collapse
Affiliation(s)
- Yuze Hou
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, China;
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Li Wang
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, China;
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| |
Collapse
|
24
|
Sun Y, Wang M, Cao L, Seim I, Zhou L, Chen J, Wang H, Zhong Z, Chen H, Fu L, Li M, Li C, Sun S. Mosaic environment-driven evolution of the deep-sea mussel Gigantidas platifrons bacterial endosymbiont. MICROBIOME 2023; 11:253. [PMID: 37974296 PMCID: PMC10652631 DOI: 10.1186/s40168-023-01695-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
BACKGROUND The within-species diversity of symbiotic bacteria represents an important genetic resource for their environmental adaptation, especially for horizontally transmitted endosymbionts. Although strain-level intraspecies variation has recently been detected in many deep-sea endosymbionts, their ecological role in environmental adaptation, their genome evolution pattern under heterogeneous geochemical environments, and the underlying molecular forces remain unclear. RESULTS Here, we conducted a fine-scale metagenomic analysis of the deep-sea mussel Gigantidas platifrons bacterial endosymbiont collected from distinct habitats: hydrothermal vent and methane seep. Endosymbiont genomes were assembled using a pipeline that distinguishes within-species variation and revealed highly heterogeneous compositions in mussels from different habitats. Phylogenetic analysis separated the assemblies into three distinct environment-linked clades. Their functional differentiation follows a mosaic evolutionary pattern. Core genes, essential for central metabolic function and symbiosis, were conserved across all clades. Clade-specific genes associated with heavy metal resistance, pH homeostasis, and nitrate utilization exhibited signals of accelerated evolution. Notably, transposable elements and plasmids contributed to the genetic reshuffling of the symbiont genomes and likely accelerated adaptive evolution through pseudogenization and the introduction of new genes. CONCLUSIONS The current study uncovers the environment-driven evolution of deep-sea symbionts mediated by mobile genetic elements. Its findings highlight a potentially common and critical role of within-species diversity in animal-microbiome symbioses. Video Abstract.
Collapse
Affiliation(s)
- Yan Sun
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Minxiao Wang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Lei Cao
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Inge Seim
- Integrative Biology Laboratory, College of Life Sciences, Nanjing Normal University, Nanjing, 210046, China
- School of Biology and Environmental Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia
| | - Li Zhou
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Jianwei Chen
- BGI Research-Qingdao, BGI, Qingdao, 266555, China
| | - Hao Wang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Zhaoshan Zhong
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Hao Chen
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Lulu Fu
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Mengna Li
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Chaolun Li
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China.
- South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou, 510301, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Song Sun
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
25
|
Liu J, Liu F, Pan W. Improving the Completeness of Chromosome-Level Assembly by Recalling Sequences from Lost Contigs. Genes (Basel) 2023; 14:1926. [PMID: 37895275 PMCID: PMC10606404 DOI: 10.3390/genes14101926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/13/2023] [Accepted: 09/20/2023] [Indexed: 10/29/2023] Open
Abstract
For a long time, the construction of complete reference genomes for complex eukaryotic genomes has been hindered by the limitations of sequencing technologies. Recently, the Pacific Biosciences (PacBio) HiFi data and Oxford Nanopore Technologies (ONT) Ultra-Long data, leveraging their respective advantages in accuracy and length, have provided an opportunity for generating complete chromosome sequences. Nevertheless, for the majority of genomes, the chromosome-level assemblies generated using existing methods still miss a high proportion of sequences due to losing small contigs in the step of assembly and scaffolding. To address this shortcoming, in this paper, we propose a novel method that is able to identify and fill the gaps in the chromosome-level assembly by recalling the sequences in the lost small contigs. Experimental results on both real and simulated datasets demonstrate that this method is able to improve the completeness of the chromosome-level assembly.
Collapse
Affiliation(s)
- Junyang Liu
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou 450001, China;
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen 518120, China
| | - Fang Liu
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou 450001, China;
- National Key Laboratory of Cotton Bio-Breeding and Integrated Utilization, Institute of Cotton Research, Chinese Academy of Agricultural Sciences (ICR, CAAS), Anyang 455000, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen 518120, China
| |
Collapse
|
26
|
Mestre-Tomás J, Liu T, Pardo-Palacios F, Conesa A. SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554392. [PMID: 37662216 PMCID: PMC10473693 DOI: 10.1101/2023.08.23.554392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Long-read RNA-seq has emerged as a powerful tool for transcript discovery, even in well-annotated organisms. However, assessing the accuracy of different methods in identifying annotated and novel transcripts remains a challenge. Here, we present SQANTI-SIM, a versatile utility that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3. By selectively excluding specific transcripts from the reference dataset, SQANTI-SIM effectively emulates scenarios involving unannotated transcripts. Furthermore, the tool provides customizable features and supports the simulation of additional types of data, representing the first multi-omics simulation tool for the lrRNA-seq field. We demonstrate the effectiveness of SQANTI-SIM by benchmarking five transcriptome reconstruction pipelines using the simulated data.
Collapse
Affiliation(s)
- Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Tianyuan Liu
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Francisco Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| |
Collapse
|
27
|
Luo J, Guan T, Chen G, Yu Z, Zhai H, Yan C, Luo H. SLHSD: hybrid scaffolding method based on short and long reads. Brief Bioinform 2023; 24:7152317. [PMID: 37141142 DOI: 10.1093/bib/bbad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 01/08/2023] [Accepted: 04/12/2023] [Indexed: 05/05/2023] Open
Abstract
In genome assembly, scaffolding can obtain more complete and continuous scaffolds. Current scaffolding methods usually adopt one type of read to construct a scaffold graph and then orient and order contigs. However, scaffolding with the strengths of two or more types of reads seems to be a better solution to some tricky problems. Combining the advantages of different types of data is significant for scaffolding. Here, a hybrid scaffolding method (SLHSD) is present that simultaneously leverages the precision of short reads and the length advantage of long reads. Building an optimal scaffold graph is an important foundation for getting scaffolds. SLHSD uses a new algorithm that combines long and short read alignment information to determine whether to add an edge and how to calculate the edge weight in a scaffold graph. In addition, SLHSD develops a strategy to ensure that edges with high confidence can be added to the graph with priority. Then, a linear programming model is used to detect and remove remaining false edges in the graph. We compared SLHSD with other scaffolding methods on five datasets. Experimental results show that SLHSD outperforms other methods. The open-source code of SLHSD is available at https://github.com/luojunwei/SLHSD.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Ting Guan
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Guolin Chen
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Zhonghua Yu
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Haixia Zhai
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| |
Collapse
|