1
|
Kwon D, Park N, Wy S, Lee D, Chai HH, Cho IC, Lee J, Kwon K, Kim H, Moon Y, Kim J, Park W, Kim J. A chromosome-level genome assembly of the Korean crossbred pig Nanchukmacdon (Sus scrofa). Sci Data 2023; 10:761. [PMID: 37923776 PMCID: PMC10624824 DOI: 10.1038/s41597-023-02661-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 10/17/2023] [Indexed: 11/06/2023] Open
Abstract
As plentiful high-quality genome assemblies have been accumulated, reference-guided genome assembly can be a good approach to reconstruct a high-quality assembly. Here, we present a chromosome-level genome assembly of the Korean crossbred pig called Nanchukmacdon (the NCMD assembly) using the reference-guided assembly approach with short and long reads. The NCMD assembly contains 20 chromosome-level scaffolds with a total size of 2.38 Gbp (N50: 138.77 Mbp). Its BUSCO score is 93.1%, which is comparable to the pig reference assembly, and a total of 20,588 protein-coding genes, 8,651 non-coding genes, and 996.14 Mbp of repetitive elements are annotated. The NCMD assembly was also used to close many gaps in the pig reference assembly. This NCMD assembly and annotation provide foundational resources for the genomic analyses of pig and related species.
Collapse
Affiliation(s)
- Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Nayoung Park
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Suyeon Wy
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Daehwan Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Han-Ha Chai
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea
| | - In-Cheol Cho
- Subtropical Livestock Research Institute, National Institute of Animal Science, RDA, Jeju, 63242, Republic of Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Kisang Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Heesun Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Youngbeen Moon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Juyeon Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Woncheoul Park
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea.
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea.
| |
Collapse
|
2
|
Mukherjee K, Dole-Muinos D, Ajayi A, Rossi M, Prosperi M, Boucher C. Finding Overlapping Rmaps via Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34890332 DOI: 10.1109/tcbb.2021.3132534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as O, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and Comet) to demonstrate the increase in the performance of these methods. When OMclust was combined with Comet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Ramps, and reduced the CPU time by more than 35x.
Collapse
|
3
|
Walve R, Puglisi SJ, Salmela L. Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2454-2462. [PMID: 34057895 DOI: 10.1109/tcbb.2021.3085086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods.
Collapse
|
4
|
Mukherjee K, Rossi M, Salmela L, Boucher C. Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph. Algorithms Mol Biol 2021; 16:6. [PMID: 34034751 PMCID: PMC8147420 DOI: 10.1186/s13015-021-00182-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/13/2021] [Indexed: 11/10/2022] Open
Abstract
Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as RMAPPER, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome RMAPPER was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper .
Collapse
|
5
|
Vašinek M, Běhálek M, Gajdoš P, Fillerová R, Kriegová E. Determining Optical Mapping Errors by Simulations. Bioinformatics 2021; 37:3391-3397. [PMID: 33983386 DOI: 10.1093/bioinformatics/btab259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 04/19/2021] [Accepted: 04/23/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Optical mapping is a complementary technology to traditional DNA sequencing technologies, such as next-generation sequencing (NGS). It provides genome-wide, high-resolution restriction maps from single, stained molecules of DNA. It can be used to detect large and small structural variants, copy number variations, and complex rearrangements. Optical mapping is affected by different kinds of errors in comparison with traditional DNA sequencing technologies. It is important to understand the source of these errors and how they affect the obtained data. This paper proposes a novel approach to modeling errors in the data obtained from the Bionano Genomics Inc. Saphyr system with Direct Label and Stain (DLS) chemistry. Some studies have already adressed this issue for older instruments with nicking enzymes, but we are unaware of a study that addresses this new system. RESULTS The main result is a framework for studying errors in the data obtained from the Saphyr instrument with DLS chemistry. The framework's main component is a simulation that computes how major sources of errors for this instrument (a false site, a missing site, and resolution errors) affect the distribution of fragment lengths in optical maps. The simulation is parametrized by variables describing these errors and we are using a differential evolution algorithm to evaluate parameters that best fit the data from the instrument. Results of the experiments manifest that this approach can be used to study errors in the optical mapping data analysis. AVAILABILITY Source codes supporting the presented results are available at: https://github.com/mvasinek/olgen-om-error-prediction. The data underlying this article are available on the Bionano Genomics Inc. website, at: https://bionanogenomics.com/library/datasets/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michal Vašinek
- Department of Computer Science, Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, 708 00, Czech Republic
| | - Marek Běhálek
- Department of Computer Science, Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, 708 00, Czech Republic
| | - Petr Gajdoš
- Department of Computer Science, Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, 708 00, Czech Republic
| | - Regina Fillerová
- Department of Immunology, Faculty of Medicine and Dentistry, Palacky University and University Hospital, Olomouc, 779 00, Czech Republic
| | - Eva Kriegová
- Department of Immunology, Faculty of Medicine and Dentistry, Palacky University and University Hospital, Olomouc, 779 00, Czech Republic
| |
Collapse
|
6
|
Chromonomer: A Tool Set for Repairing and Enhancing Assembled Genomes Through Integration of Genetic Maps and Conserved Synteny. G3-GENES GENOMES GENETICS 2020; 10:4115-4128. [PMID: 32912931 PMCID: PMC7642942 DOI: 10.1534/g3.120.401485] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
The pace of the sequencing and computational assembly of novel reference genomes is accelerating. Though DNA sequencing technologies and assembly software tools continue to improve, biological features of genomes such as repetitive sequence as well as molecular artifacts that often accompany sequencing library preparation can lead to fragmented or chimeric assemblies. If left uncorrected, defects like these trammel progress on understanding genome structure and function, or worse, positively mislead this research. Fortunately, integration of additional, independent streams of information, such as a marker-dense genetic map and conserved orthologous gene order from related taxa, can be used to scaffold together unlinked, disordered fragments and to restructure a reference genome where it is incorrectly joined. We present a tool set for automating these processes, one that additionally tracks any changes to the assembly and to the genetic map, and which allows the user to scrutinize these changes with the help of web-based, graphical visualizations. Chromonomer takes a user-defined reference genome, a map of genetic markers, and, optionally, conserved synteny information to construct an improved reference genome of chromosome models: a “chromonome”. We demonstrate Chromonomer’s performance on genome assemblies and genetic maps that have disparate characteristics and levels of quality.
Collapse
|
7
|
Salmela L, Mukherjee K, Puglisi SJ, Muggli MD, Boucher C. Fast and accurate correction of optical mapping data via spaced seeds. Bioinformatics 2020; 36:682-689. [PMID: 31504206 PMCID: PMC7005598 DOI: 10.1093/bioinformatics/btz663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Revised: 07/25/2019] [Accepted: 08/30/2019] [Indexed: 11/24/2022] Open
Abstract
Motivation Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. Availability and implementation Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
8
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
9
|
Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning optical maps to de Bruijn graphs. Bioinformatics 2020; 35:3250-3256. [PMID: 30698651 DOI: 10.1093/bioinformatics/btz069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/31/2018] [Accepted: 01/25/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Tamer Kahveci
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| |
Collapse
|
10
|
Mukherjee K, Washimkar D, Muggli MD, Salmela L, Boucher C. Error correcting optical mapping data. Gigascience 2018; 7:5005021. [PMID: 29846578 PMCID: PMC6007263 DOI: 10.1093/gigascience/giy061] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Accepted: 05/16/2018] [Indexed: 12/31/2022] Open
Abstract
Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| | - Darshan Washimkar
- Department of Computer Science, Colorado State University, Fort Collins
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| |
Collapse
|