1
|
Urban JM, Gerbi SA, Spradling AC. Chromosome-scale scaffolds of the fungus gnat genome reveal multi-Mb-scale chromosome-folding interactions, centromeric enrichments of retrotransposons, and candidate telomere sequences. BMC Genomics 2025; 26:443. [PMID: 40325439 PMCID: PMC12051294 DOI: 10.1186/s12864-025-11573-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2025] [Accepted: 04/04/2025] [Indexed: 05/07/2025] Open
Abstract
BACKGROUND The lower Dipteran fungus gnat, Bradysia (aka Sciara) coprophila, has compelling chromosome biology. Paternal chromosomes are eliminated during male meiosis I and both maternal X sister chromatids are retained in male meiosis II. Embryos start with three copies of the X chromosome, but 1-2 copies are eliminated from somatic cells as part of sex determination, and one is eliminated in the germline to restore diploidy. In addition, there is gene amplification in larval polytene chromosomes, and the X polytene chromosome folds back on itself mediated by extremely long-range interactions between three loci. These developmentally normal events present opportunities to study chromosome behaviors that are unusual in other systems. Moreover, little is known about the centromeric and telomeric sequences of lower Dipterans in general, and there are recent claims of horizontally-transferred genes in fungus gnats. Overall, there is a pressing need to learn more about the fungus gnat chromosome sequences. RESULTS We produced the first chromosome-scale models of the X and autosomal chromosomes where each somatic chromosome is represented by a single scaffold. Extensive analysis supports the chromosome identity and structural accuracy of the scaffolds, demonstrating they are co-linear with historical polytene maps, consistent with evolutionary expectations, and have accurate centromere positions, chromosome lengths, and copy numbers. The positions of alleged horizontally-transferred genes in the nuclear chromosomes were broadly confirmed by genomic analyses of the chromosome scaffolds using Hi-C and single-molecule long-read datasets. The chromosomal context of repeats shows family-specific biases, such as retrotransposons correlated with the centromeres. Moreover, scaffold termini were enriched with arrays of retrotransposon-related sequence as well as nucleosome-length (~ 175 bp) satellite repeats. Finally, the Hi-C data captured Mb-scale physical interactions on the X chromosome that are seen in polytene spreads, and we characterize these interesting "fold-back regions" at the sequence level for the first time. CONCLUSIONS The chromosome scaffolds were shown to be of exceptional quality, including loci harboring horizontally-transferred genes. Repeat analyses demonstrate family-specific biases and telomere repeat candidates. Hi-C analyses revealed the sequences of ultra-long-range interactions on the X chromosome. The chromosome-scale scaffolds pave the way for further studies of the unusual chromosome movements in Bradysia coprophila.
Collapse
Affiliation(s)
- John M Urban
- Carnegie Institution for Science, Department of Embryology, Howard Hughes Medical Institute Research Laboratories, 3520 San Martin Drive, Baltimore, MD, 21218, USA.
| | - Susan A Gerbi
- Division of Biology and Medicine, Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI, 02912, USA
| | - Allan C Spradling
- Carnegie Institution for Science, Department of Embryology, Howard Hughes Medical Institute Research Laboratories, 3520 San Martin Drive, Baltimore, MD, 21218, USA
| |
Collapse
|
2
|
Chen X, Urban JM, Wurlitzer J, Wei X, Han J, E O'Connor S, Rudolf JD, Köllner TG, Chen F. Canonical terpene synthases in arthropods: Intraphylum gene transfer. Proc Natl Acad Sci U S A 2024; 121:e2413007121. [PMID: 39671179 DOI: 10.1073/pnas.2413007121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 11/11/2024] [Indexed: 12/14/2024] Open
Abstract
Insects employ terpenoids for communication both within and between species. While terpene synthases derived from isoprenyl diphosphate synthase have been shown to catalyze terpenoid biosynthesis in some insects, canonical terpene synthases (TPS) commonly found in plants, fungi, and bacteria were previously unidentified in insects. This study reveals the presence of TPS genes in insects, likely originating via horizontal gene transfer from noninsect arthropods. By examining 361 insect genomes, we identified TPS genes in five species of the Sciaridae family (fungus gnats). Additionally, TPS genes were found in Collembola (springtails) and Acariformes (mites) among diverse noninsect arthropods. Selected TPS enzymes from Sciaridae, Collembola, and Acariformes display monoterpene, sesquiterpene, and/or diterpene synthase activities. Through comprehensive protein database search and phylogenetic analysis, the TPS genes in Sciaridae were found to be most closely related to those in Acariformes, suggesting transfer of TPS genes from Acariformes to Sciaridae. In the model Sciaridae Bradysia coprophila, all five TPS genes are most highly expressed in adult males, suggesting a sex- and developmental stage-specific role of their terpenoid products. The finding of TPS genes in insects and their possible evolutionary origin through intraphylum gene transfer within arthropods sheds light on metabolic innovation in insects.
Collapse
Affiliation(s)
- Xinlu Chen
- Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996
| | - John M Urban
- HHMI Research Laboratories, Carnegie Institution for Science, Baltimore, MD 21218
- Department of Embryology, Carnegie Institution for Science, Baltimore, MD 21218
| | - Jens Wurlitzer
- Department of Natural Product Biosynthesis, Max-Planck-Institute for Chemical Ecology, Jena 07745, Germany
| | - Xiuting Wei
- Department of Chemistry, University of Florida, Gainesville, FL 32611
| | - Jin Han
- Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996
| | - Sarah E O'Connor
- Department of Natural Product Biosynthesis, Max-Planck-Institute for Chemical Ecology, Jena 07745, Germany
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida, Gainesville, FL 32611
| | - Tobias G Köllner
- Department of Natural Product Biosynthesis, Max-Planck-Institute for Chemical Ecology, Jena 07745, Germany
| | - Feng Chen
- Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996
| |
Collapse
|
3
|
Mukherjee K, Dole-Muinos D, Ajayi A, Rossi M, Prosperi M, Boucher C. Finding Overlapping Rmaps via Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34890332 DOI: 10.1109/tcbb.2021.3132534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as O, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and Comet) to demonstrate the increase in the performance of these methods. When OMclust was combined with Comet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Ramps, and reduced the CPU time by more than 35x.
Collapse
|
4
|
Urban JM, Foulk MS, Bliss JE, Coleman CM, Lu N, Mazloom R, Brown SJ, Spradling AC, Gerbi SA. High contiguity de novo genome assembly and DNA modification analyses for the fungus fly, Sciara coprophila, using single-molecule sequencing. BMC Genomics 2021; 22:643. [PMID: 34488624 PMCID: PMC8419958 DOI: 10.1186/s12864-021-07926-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2021] [Accepted: 08/08/2021] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND The lower Dipteran fungus fly, Sciara coprophila, has many unique biological features that challenge the rule of genome DNA constancy. For example, Sciara undergoes paternal chromosome elimination and maternal X chromosome nondisjunction during spermatogenesis, paternal X elimination during embryogenesis, intrachromosomal DNA amplification of DNA puff loci during larval development, and germline-limited chromosome elimination from all somatic cells. Paternal chromosome elimination in Sciara was the first observation of imprinting, though the mechanism remains a mystery. Here, we present the first draft genome sequence for Sciara coprophila to take a large step forward in addressing these features. RESULTS We assembled the Sciara genome using PacBio, Nanopore, and Illumina sequencing. To find an optimal assembly using these datasets, we generated 44 short-read and 50 long-read assemblies. We ranked assemblies using 27 metrics assessing contiguity, gene content, and dataset concordance. The highest-ranking assemblies were scaffolded using BioNano optical maps. RNA-seq datasets from multiple life stages and both sexes facilitated genome annotation. A set of 66 metrics was used to select the first draft assembly for Sciara. Nearly half of the Sciara genome sequence was anchored into chromosomes, and all scaffolds were classified as X-linked or autosomal by coverage. CONCLUSIONS We determined that X-linked genes in Sciara males undergo dosage compensation. An entire bacterial genome from the Rickettsia genus, a group known to be endosymbionts in insects, was co-assembled with the Sciara genome, opening the possibility that Rickettsia may function in sex determination in Sciara. Finally, the signal level of the PacBio and Nanopore data support the presence of cytosine and adenine modifications in the Sciara genome, consistent with a possible role in imprinting.
Collapse
Affiliation(s)
- John M Urban
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Sidney Frank Hall for Life Sciences, 185 Meeting Street, Providence, RI, 02912, USA.
- Department of Embryology, Carnegie Institution for Science, Howard Hughes Medical Institute Research Laboratories, 3520 San Martin Drive, Baltimore, MD, 21218, USA.
| | - Michael S Foulk
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Sidney Frank Hall for Life Sciences, 185 Meeting Street, Providence, RI, 02912, USA
- Present Address: Department of Biology, Mercyhurst University, Erie, PA, 16546, USA
| | - Jacob E Bliss
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Sidney Frank Hall for Life Sciences, 185 Meeting Street, Providence, RI, 02912, USA
| | - C Michelle Coleman
- KSU Bioinformatics Center, Kansas State University Division of Biology, Ackert Hall, Manhattan, Kansas, 66502, USA
| | - Nanyan Lu
- KSU Bioinformatics Center, Kansas State University Division of Biology, Ackert Hall, Manhattan, Kansas, 66502, USA
| | - Reza Mazloom
- KSU Bioinformatics Center, Kansas State University Division of Biology, Ackert Hall, Manhattan, Kansas, 66502, USA
| | - Susan J Brown
- KSU Bioinformatics Center, Kansas State University Division of Biology, Ackert Hall, Manhattan, Kansas, 66502, USA
| | - Allan C Spradling
- Department of Embryology, Carnegie Institution for Science, Howard Hughes Medical Institute Research Laboratories, 3520 San Martin Drive, Baltimore, MD, 21218, USA
| | - Susan A Gerbi
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Sidney Frank Hall for Life Sciences, 185 Meeting Street, Providence, RI, 02912, USA.
| |
Collapse
|
5
|
Walve R, Puglisi SJ, Salmela L. Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2454-2462. [PMID: 34057895 DOI: 10.1109/tcbb.2021.3085086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods.
Collapse
|
6
|
Mukherjee K, Rossi M, Salmela L, Boucher C. Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph. Algorithms Mol Biol 2021; 16:6. [PMID: 34034751 PMCID: PMC8147420 DOI: 10.1186/s13015-021-00182-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/13/2021] [Indexed: 11/10/2022] Open
Abstract
Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as RMAPPER, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome RMAPPER was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper .
Collapse
|
7
|
Raeisi Dehkordi S, Luebeck J, Bafna V. FaNDOM: Fast nested distance-based seeding of optical maps. PATTERNS (NEW YORK, N.Y.) 2021; 2:100248. [PMID: 34027500 PMCID: PMC8134938 DOI: 10.1016/j.patter.2021.100248] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 03/08/2021] [Accepted: 04/01/2021] [Indexed: 12/25/2022]
Abstract
Optical mapping (OM) provides single-molecule readouts of fluorescently labeled sequence motifs on long fragments of DNA, resolved to nucleotide-level coordinates. With the advent of microfluidic technologies for analysis of DNA molecules, it is possible to inexpensively generate long OM data ( > 150 kbp) at high coverage. In addition to scaffolding for de novo assembly, OM data can be aligned to a reference genome for identification of genomic structural variants. We introduce FaNDOM (Fast Nested Distance Seeding of Optical Maps)-an optical map alignment tool that greatly reduces the search space of the alignment process. On four benchmark human datasets, FaNDOM was significantly (4-14×) faster than competing tools while maintaining comparable sensitivity and specificity. We used FaNDOM to map variants in three cancer cell lines and identified many biologically interesting structural variants, including deletions, duplications, gene fusions and gene-disrupting rearrangements. FaNDOM is publicly available at https://github.com/jluebeck/FaNDOM.
Collapse
Affiliation(s)
- Siavash Raeisi Dehkordi
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Jens Luebeck
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
8
|
Salmela L, Mukherjee K, Puglisi SJ, Muggli MD, Boucher C. Fast and accurate correction of optical mapping data via spaced seeds. Bioinformatics 2020; 36:682-689. [PMID: 31504206 PMCID: PMC7005598 DOI: 10.1093/bioinformatics/btz663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Revised: 07/25/2019] [Accepted: 08/30/2019] [Indexed: 11/24/2022] Open
Abstract
Motivation Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. Availability and implementation Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
9
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
10
|
Abstract
BACKGROUND The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. RESULTS We propose OPTICALKERMIT which directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler, OPTICALKERMIT produces an assembly with almost three times higher NGA50 with a lower number of misassemblies on real A. thaliana reads. CONCLUSIONS OPTICALKERMIT successfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.
Collapse
Affiliation(s)
- Miika Leinonen
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Pietari Kalmin katu 5, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Pietari Kalmin katu 5, Helsinki, Finland.
| |
Collapse
|
11
|
Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning optical maps to de Bruijn graphs. Bioinformatics 2020; 35:3250-3256. [PMID: 30698651 DOI: 10.1093/bioinformatics/btz069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/31/2018] [Accepted: 01/25/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Tamer Kahveci
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| |
Collapse
|
12
|
Bouwens A, Deen J, Vitale R, D’Huys L, Goyvaerts V, Descloux A, Borrenberghs D, Grussmayer K, Lukes T, Camacho R, Su J, Ruckebusch C, Lasser T, Van De Ville D, Hofkens J, Radenovic A, Frans Janssen KP. Identifying microbial species by single-molecule DNA optical mapping and resampling statistics. NAR Genom Bioinform 2020; 2:lqz007. [PMID: 33575560 PMCID: PMC7671359 DOI: 10.1093/nargab/lqz007] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Accepted: 09/12/2019] [Indexed: 12/13/2022] Open
Abstract
Single-molecule DNA mapping has the potential to serve as a powerful complement to high-throughput sequencing in metagenomic analysis. Offering longer read lengths and forgoing the need for complex library preparation and amplification, mapping stands to provide an unbiased view into the composition of complex viromes and/or microbiomes. To fully enable mapping-based metagenomics, sensitivity and specificity of DNA map analysis and identification need to be improved. Using detailed simulations and experimental data, we first demonstrate how fluorescence imaging of surface stretched, sequence specifically labeled DNA fragments can yield highly sensitive identification of targets. Second, a new analysis technique is introduced to increase specificity of the analysis, allowing even closely related species to be resolved. Third, we show how an increase in resolution improves sensitivity. Finally, we demonstrate that these methods are capable of identifying species with long genomes such as bacteria with high sensitivity.
Collapse
Affiliation(s)
- Arno Bouwens
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Jochem Deen
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Raffaele Vitale
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
- LASIR CNRS, Université de Lille, 59655 Villeneuve d’Ascq, France
| | - Laurens D’Huys
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Vince Goyvaerts
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Adrien Descloux
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | | | - Kristin Grussmayer
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Tomas Lukes
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Rafael Camacho
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Jia Su
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Cyril Ruckebusch
- LASIR CNRS, Université de Lille, 59655 Villeneuve d’Ascq, France
| | - Theo Lasser
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Dimitri Van De Ville
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- Center for Neuroprosthetics, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- Department of Radiology and Medical Informatics, Université de Genève, 1205 Genève, Switzerland
| | - Johan Hofkens
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Aleksandra Radenovic
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | | |
Collapse
|
13
|
Muggli MD, Puglisi SJ, Boucher C. Kohdista: an efficient method to index and query possible Rmap alignments. Algorithms Mol Biol 2019; 14:25. [PMID: 31867049 PMCID: PMC6907254 DOI: 10.1186/s13015-019-0160-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 11/19/2019] [Indexed: 11/23/2022] Open
Abstract
Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.
Collapse
|
14
|
Leung AKY, Liu MCJ, Li L, Lai YYY, Chu C, Kwok PY, Ho PL, Yip KY, Chan TF. OMMA enables population-scale analysis of complex genomic features and phylogenomic relationships from nanochannel-based optical maps. Gigascience 2019; 8:giz079. [PMID: 31289833 PMCID: PMC6615982 DOI: 10.1093/gigascience/giz079] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 01/13/2019] [Accepted: 06/16/2019] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Optical mapping is an emerging technology that complements sequencing-based methods in genome analysis. It is widely used in improving genome assemblies and detecting structural variations by providing information over much longer (up to 1 Mb) reads. Current standards in optical mapping analysis involve assembling optical maps into contigs and aligning them to a reference, which is limited to pairwise comparison and becomes bias-prone when analyzing multiple samples. FINDINGS We present a new method, OMMA, that extends optical mapping to the study of complex genomic features by simultaneously interrogating optical maps across many samples in a reference-independent manner. OMMA captures and characterizes complex genomic features, e.g., multiple haplotypes, copy number variations, and subtelomeric structures when applied to 154 human samples across the 26 populations sequenced in the 1000 Genomes Project. For small genomes such as pathogenic bacteria, OMMA accurately reconstructs the phylogenomic relationships and identifies functional elements across 21 Acinetobacter baumannii strains. CONCLUSIONS With the increasing data throughput of optical mapping system, the use of this technology in comparative genome analysis across many samples will become feasible. OMMA is a timely solution that can address such computational need. The OMMA software is available at https://github.com/TF-Chan-Lab/OMTools.
Collapse
Affiliation(s)
| | - Melissa Chun-Jiao Liu
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Queen Mary Hospital, Pok Fu Lam, Hong Kong
| | - Le Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Yvonne Yuk-Yin Lai
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Catherine Chu
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Pui-Yan Kwok
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Pak-Leung Ho
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Queen Mary Hospital, Pok Fu Lam, Hong Kong
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong
- State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|
15
|
Abstract
The computational reconstruction of genome sequences from shotgun sequencing data has been greatly simplified by the advent of sequencing technologies that generate long reads. In the case of relatively small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally without the need for further experiments. However, large and complex genomes, such as those of most animals and plants, continue to pose significant challenges. In such genomes, assembly software produces incomplete and fragmented reconstructions that require additional experimentally derived information and manual intervention in order to reconstruct individual chromosome arms. Recent technologies originally designed to capture chromatin structure have been shown to effectively complement sequencing data, leading to much more contiguous reconstructions of genomes than previously possible. Here, we survey these technologies and the algorithms used to assemble and analyze large eukaryotic genomes, placed within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
16
|
Mukherjee K, Washimkar D, Muggli MD, Salmela L, Boucher C. Error correcting optical mapping data. Gigascience 2018; 7:5005021. [PMID: 29846578 PMCID: PMC6007263 DOI: 10.1093/gigascience/giy061] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Accepted: 05/16/2018] [Indexed: 12/31/2022] Open
Abstract
Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| | - Darshan Washimkar
- Department of Computer Science, Colorado State University, Fort Collins
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| |
Collapse
|