101
|
Li Q, Li H, Huang W, Xu Y, Zhou Q, Wang S, Ruan J, Huang S, Zhang Z. A chromosome-scale genome assembly of cucumber (Cucumis sativus L.). Gigascience 2019; 8:giz072. [PMID: 31216035 PMCID: PMC6582320 DOI: 10.1093/gigascience/giz072] [Citation(s) in RCA: 116] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Revised: 03/13/2019] [Accepted: 05/23/2019] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Accurate and complete reference genome assemblies are fundamental for biological research. Cucumber is an important vegetable crop and model system for sex determination and vascular biology. Low-coverage Sanger sequences and high-coverage short Illumina sequences have been used to assemble draft cucumber genomes, but the incompleteness and low quality of these genomes limit their use in comparative genomics and genetic research. A high-quality and complete cucumber genome assembly is therefore essential. FINDINGS We assembled single-molecule real-time (SMRT) long reads to generate an improved cucumber reference genome. This version contains 174 contigs with a total length of 226.2 Mb and an N50 of 8.9 Mb, and provides 29.0 Mb more sequence data than previous versions. Using 10X Genomics and high-throughput chromosome conformation capture (Hi-C) data, 89 contigs (∼211.0 Mb) were directly linked into 7 pseudo-chromosome sequences. The newly assembled regions show much higher guanine-cytosine or adenine-thymine content than found previously, which is likely to have been inaccessible to Illumina sequencing. The new assembly contains 1,374 full-length long terminal retrotransposons and 1,078 novel genes including 239 tandemly duplicated genes. For example, we found 4 tandemly duplicated tyrosylprotein sulfotransferases, in contrast to the single copy of the gene found previously and in most other plants. CONCLUSION This high-quality genome presents novel features of the cucumber genome and will serve as a valuable resource for genetic research in cucumber and plant comparative genomics.
Collapse
Affiliation(s)
- Qing Li
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
| | - Hongbo Li
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
| | - Wu Huang
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Yuanchao Xu
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
| | - Qian Zhou
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Shenhao Wang
- College of Horticulture, Northwest A&F University, Yangling, Shanxi 712100, China
| | - Jue Ruan
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Sanwen Huang
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Zhonghua Zhang
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, No.12, Haidian District, Beijing 100081, China
| |
Collapse
|
102
|
Abstract
The computational reconstruction of genome sequences from shotgun sequencing data has been greatly simplified by the advent of sequencing technologies that generate long reads. In the case of relatively small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally without the need for further experiments. However, large and complex genomes, such as those of most animals and plants, continue to pose significant challenges. In such genomes, assembly software produces incomplete and fragmented reconstructions that require additional experimentally derived information and manual intervention in order to reconstruct individual chromosome arms. Recent technologies originally designed to capture chromatin structure have been shown to effectively complement sequencing data, leading to much more contiguous reconstructions of genomes than previously possible. Here, we survey these technologies and the algorithms used to assemble and analyze large eukaryotic genomes, placed within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
103
|
Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat Genet 2019; 51:1052-1059. [DOI: 10.1038/s41588-019-0427-6] [Citation(s) in RCA: 171] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Accepted: 04/25/2019] [Indexed: 01/15/2023]
|
104
|
Wallberg A, Bunikis I, Pettersson OV, Mosbech MB, Childers AK, Evans JD, Mikheyev AS, Robertson HM, Robinson GE, Webster MT. A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds. BMC Genomics 2019; 20:275. [PMID: 30961563 PMCID: PMC6454739 DOI: 10.1186/s12864-019-5642-0] [Citation(s) in RCA: 152] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 03/24/2019] [Indexed: 01/27/2023] Open
Abstract
Background The ability to generate long sequencing reads and access long-range linkage information is revolutionizing the quality and completeness of genome assemblies. Here we use a hybrid approach that combines data from four genome sequencing and mapping technologies to generate a new genome assembly of the honeybee Apis mellifera. We first generated contigs based on PacBio sequencing libraries, which were then merged with linked-read 10x Chromium data followed by scaffolding using a BioNano optical genome map and a Hi-C chromatin interaction map, complemented by a genetic linkage map. Results Each of the assembly steps reduced the number of gaps and incorporated a substantial amount of additional sequence into scaffolds. The new assembly (Amel_HAv3) is significantly more contiguous and complete than the previous one (Amel_4.5), based mainly on Sanger sequencing reads. N50 of contigs is 120-fold higher (5.381 Mbp compared to 0.053 Mbp) and we anchor > 98% of the sequence to chromosomes. All of the 16 chromosomes are represented as single scaffolds with an average of three sequence gaps per chromosome. The improvements are largely due to the inclusion of repetitive sequence that was unplaced in previous assemblies. In particular, our assembly is highly contiguous across centromeres and telomeres and includes hundreds of AvaI and AluI repeats associated with these features. Conclusions The improved assembly will be of utility for refining gene models, studying genome function, mapping functional genetic variation, identification of structural variants, and comparative genomics. Electronic supplementary material The online version of this article (10.1186/s12864-019-5642-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andreas Wallberg
- Department of Medical Biochemistry and Microbiology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Ignas Bunikis
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Olga Vinnere Pettersson
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Mai-Britt Mosbech
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Anna K Childers
- USDA-ARS Insect Genetics and Biochemistry Research Unit, Fargo, ND, USA.,USDA-ARS Bee Research Lab, Beltsville, MD, USA
| | - Jay D Evans
- USDA-ARS Bee Research Lab, Beltsville, MD, USA
| | | | - Hugh M Robertson
- Department of Entomology and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Gene E Robinson
- Department of Entomology and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Matthew T Webster
- Department of Medical Biochemistry and Microbiology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
105
|
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 2019; 36:1374-1381. [DOI: 10.1093/bioinformatics/btz102] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 01/07/2019] [Accepted: 02/18/2019] [Indexed: 12/25/2022] Open
Abstract
Abstract
Motivation
Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information.
Results
We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.
Availability and implementation
The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Antoine Limasset
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
| | - Jean-François Flot
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
- Interuniversity Institute of Bioinformatics in Brussels – (IB) 2, Brussels, Belgium
| | | |
Collapse
|
106
|
Xu CQ, Liu H, Zhou SS, Zhang DX, Zhao W, Wang S, Chen F, Sun YQ, Nie S, Jia KH, Jiao SQ, Zhang RG, Yun QZ, Guan W, Wang X, Gao Q, Bennetzen JL, Maghuly F, Porth I, Van de Peer Y, Wang XR, Ma Y, Mao JF. Genome sequence of Malania oleifera, a tree with great value for nervonic acid production. Gigascience 2019; 8:giy164. [PMID: 30689848 PMCID: PMC6377399 DOI: 10.1093/gigascience/giy164] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 11/12/2018] [Accepted: 12/17/2018] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Malania oleifera, a member of the Olacaceae family, is an IUCN red listed tree, endemic and restricted to the Karst region of southwest China. This tree's seed is valued for its high content of precious fatty acids (especially nervonic acid). However, studies on its genetic makeup and fatty acid biogenesis are severely hampered by a lack of molecular and genetic tools. FINDINGS We generated 51 Gb and 135 Gb of raw DNA sequences, using Pacific Biosciences (PacBio) single-molecule real-time and 10× Genomics sequencing, respectively. A final genome assembly, with a scaffold N50 size of 4.65 Mb and a total length of 1.51 Gb, was obtained by primary assembly based on PacBio long reads plus scaffolding with 10× Genomics reads. Identified repeats constituted ∼82% of the genome, and 24,064 protein-coding genes were predicted with high support. The genome has low heterozygosity and shows no evidence for recent whole genome duplication. Metabolic pathway genes relating to the accumulation of long-chain fatty acid were identified and studied in detail. CONCLUSIONS Here, we provide the first genome assembly and gene annotation for M. oleifera. The availability of these resources will be of great importance for conservation biology and for the functional genomics of nervonic acid biosynthesis.
Collapse
Affiliation(s)
- Chao-Qun Xu
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Hui Liu
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Shan-Shan Zhou
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Dong-Xu Zhang
- College of Life Science, Datong University, Datong, Shanxi, 037009, China
| | - Wei Zhao
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Sihai Wang
- Yunnan Key Laboratory of Forest Plant Cultivation and Utilization, State Forestry Administration Key Laboratory of Yunnan Rare and Endangered Species Conservation and Propagation, Yunnan Academy of Forestry, Kunming, Yunnan, 650201, China
| | - Fu Chen
- The Camellia Institute, Yunnan Academy of Forestry, Guangnan, Yunnan, 663300, China
| | - Yan-Qiang Sun
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Shuai Nie
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Kai-Hua Jia
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Si-Qian Jiao
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Ren-Gang Zhang
- Beijing Ori-Gene Science and Technology Co. Ltd, Beijing, 102206, China
| | - Quan-Zheng Yun
- Beijing Ori-Gene Science and Technology Co. Ltd, Beijing, 102206, China
| | - Wenbin Guan
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Xuewen Wang
- The Camellia Institute, Yunnan Academy of Forestry, Guangnan, Yunnan, 663300, China
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Qiong Gao
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| | - Jeffrey L Bennetzen
- The Camellia Institute, Yunnan Academy of Forestry, Guangnan, Yunnan, 663300, China
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Fatemeh Maghuly
- Plant Biotechnology Unit (PBU), Dept. Biotechnology, BOKU-VIBT, University of Natural Resources and Life Sciences, Muthgasse 18, Vienna 1190, Austria
| | - Ilga Porth
- Département des sciences du bois et de la forêt, 1030, Avenue de la Médecine, Université Laval, Québec (Québec) G1V 0A6, Canada
- Institute for System and Integrated Biology, Pavillon Charles-Eugène-Marchand, 1030, Avenue de la Médecine, Université Laval, Québec (Québec) G1V 0A6, Canada
- Centre d'Étude de la Forêt, 1030, Avenue de la Médecine, Université Laval, Québec (Québec) G1V 0A6, Canada
| | - Yves Van de Peer
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium
- VIB Center for Plant Systems Biology, Ghent 9052, Belgium
- Centre for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology Genetics, University of Pretoria, Private bag X20, Pretoria 0028, South Africa
| | - Xiao-Ru Wang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
- Department of Ecology and Environmental Science, UPSC, Umeå University, Umeå SE-901 87, Sweden
| | - Yongpeng Ma
- Yunnan Key Laboratory for Integrative Conservation of Plant Species with Extremely Small Population, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
| | - Jian-Feng Mao
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, National Engineering Laboratory for Tree Breeding, School of Nature Conservation, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China
| |
Collapse
|
107
|
Abstract
Advances in long read and long range sequencing technologies have enabled chromosome length resolution for de novo genome assemblies even in the absence of complementary resources such as physical maps. Herein, I introduce a few methods for quality control and discuss potential pitfalls when assembling insect genomes with long reads.
Collapse
Affiliation(s)
- Surya Saha
- Sol Genomics Network, Boyce Thompson Institute, Ithaca, NY, USA.
| |
Collapse
|
108
|
The Genome of the North American Brown Bear or Grizzly: Ursus arctos ssp. horribilis. Genes (Basel) 2018; 9:genes9120598. [PMID: 30513700 PMCID: PMC6315469 DOI: 10.3390/genes9120598] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 11/23/2018] [Accepted: 11/28/2018] [Indexed: 11/17/2022] Open
Abstract
The grizzly bear (Ursus arctos ssp. horribilis) represents the largest population of brown bears in North America. Its genome was sequenced using a microfluidic partitioning library construction technique, and these data were supplemented with sequencing from a nanopore-based long read platform. The final assembly was 2.33 Gb with a scaffold N50 of 36.7 Mb, and the genome is of comparable size to that of its close relative the polar bear (2.30 Gb). An analysis using 4104 highly conserved mammalian genes indicated that 96.1% were found to be complete within the assembly. An automated annotation of the genome identified 19,848 protein coding genes. Our study shows that the combination of the two sequencing modalities that we used is sufficient for the construction of highly contiguous reference quality mammalian genomes. The assembled genome sequence and the supporting raw sequence reads are available from the NCBI (National Center for Biotechnology Information) under the bioproject identifier PRJNA493656, and the assembly described in this paper is version QXTK01000000.
Collapse
|
109
|
Lopez JV, Kamel B, Medina M, Collins T, Baums IB. Multiple Facets of Marine Invertebrate Conservation Genomics. Annu Rev Anim Biosci 2018; 7:473-497. [PMID: 30485758 DOI: 10.1146/annurev-animal-020518-115034] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Conservation genomics aims to preserve the viability of populations and the biodiversity of living organisms. Invertebrate organisms represent 95% of animal biodiversity; however, few genomic resources currently exist for the group. The subset of marine invertebrates includes the most ancient metazoan lineages and possesses codes for unique gene products and possible keys to adaptation. The benefits of supporting invertebrate conservation genomics research (e.g., likely discovery of novel genes, protein regulatory mechanisms, genomic innovations, and transposable elements) outweigh the various hurdles (rare, small, or polymorphic starting materials). Here we review best conservation genomics practices in the laboratory and in silico when applied to marine invertebrates and also showcase unique features in several case studies of acroporid corals, crown-of-thorns starfish, apple snails, and abalone. Marine conservation genomics should also address how diversity can lead to unique marine innovations, the impact of deleterious variation, and how genomic monitoring and profiling could positively affect broader conservation goals (e.g., value of baseline data for in situ/ex situ genomic stocks).
Collapse
Affiliation(s)
- Jose V Lopez
- Department of Biological Sciences, Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, Dania Beach, Florida 33004, USA;
| | - Bishoy Kamel
- Department of Biology, Center for Evolutionary and Theoretical Immunology, University of New Mexico, Albuquerque, New Mexico 87131, USA;
| | - Mónica Medina
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; ,
| | - Timothy Collins
- Department of Biological Sciences, Florida International University, Miami, Florida 33199, USA;
| | - Iliana B Baums
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; ,
| |
Collapse
|
110
|
Tigmint: correcting assembly errors using linked reads from large molecules. BMC Bioinformatics 2018; 19:393. [PMID: 30367597 PMCID: PMC6204047 DOI: 10.1186/s12859-018-2425-6] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 10/09/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap. RESULTS To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. CONCLUSIONS Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.
Collapse
|
111
|
Numanagić I, Gökkaya AS, Zhang L, Berger B, Alkan C, Hach F. Fast characterization of segmental duplications in genome assemblies. Bioinformatics 2018; 34:i706-i714. [PMID: 30423092 PMCID: PMC6129265 DOI: 10.1093/bioinformatics/bty586] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Motivation Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner. Results Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation SEDEF is available at https://github.com/vpc-ccg/sedef.
Collapse
Affiliation(s)
- Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Alim S Gökkaya
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Lillian Zhang
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, Canada
| |
Collapse
|
112
|
Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS. Long reads: their purpose and place. Hum Mol Genet 2018; 27:R234-R241. [PMID: 29767702 PMCID: PMC6061690 DOI: 10.1093/hmg/ddy177] [Citation(s) in RCA: 202] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Accepted: 05/08/2018] [Indexed: 12/20/2022] Open
Abstract
In recent years long-read technologies have moved from being a niche and specialist field to a point of relative maturity likely to feature frequently in the genomic landscape. Analogous to next generation sequencing, the cost of sequencing using long-read technologies has materially dropped whilst the instrument throughput continues to increase. Together these changes present the prospect of sequencing large numbers of individuals with the aim of fully characterizing genomes at high resolution. In this article, we will endeavour to present an introduction to long-read technologies showing: what long reads are; how they are distinct from short reads; why long reads are useful and how they are being used. We will highlight the recent developments in this field, and the applications and potential of these technologies in medical research, and clinical diagnostics and therapeutics.
Collapse
Affiliation(s)
- Martin O Pollard
- Human Genetics - Wellcome Sanger Institute, Hinxton, Cambridge, UK
- University of Cambridge - Department of Medicine, Addenbrookes Hospital, Box 157, Hills Road, Cambridge, UK
| | - Deepti Gurdasani
- Human Genetics - Wellcome Sanger Institute, Hinxton, Cambridge, UK
- University of Cambridge - Department of Medicine, Addenbrookes Hospital, Box 157, Hills Road, Cambridge, UK
| | - Alexander J Mentzer
- Human Genetics - Wellcome Sanger Institute, Hinxton, Cambridge, UK
- Wellcome Centre for Human Genetics, Roosevelt Drive, Oxford, UK
| | - Tarryn Porter
- Human Genetics - Wellcome Sanger Institute, Hinxton, Cambridge, UK
- University of Cambridge - Department of Medicine, Addenbrookes Hospital, Box 157, Hills Road, Cambridge, UK
| | - Manjinder S Sandhu
- Human Genetics - Wellcome Sanger Institute, Hinxton, Cambridge, UK
- University of Cambridge - Department of Medicine, Addenbrookes Hospital, Box 157, Hills Road, Cambridge, UK
| |
Collapse
|
113
|
Coombe L, Zhang J, Vandervalk BP, Chu J, Jackman SD, Birol I, Warren RL. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers. BMC Bioinformatics 2018; 19:234. [PMID: 29925315 PMCID: PMC6011487 DOI: 10.1186/s12859-018-2243-x] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 06/13/2018] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
Collapse
Affiliation(s)
- Lauren Coombe
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | - Jessica Zhang
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | | | - Justin Chu
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | | | - Inanc Birol
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | - René L. Warren
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| |
Collapse
|
114
|
Campbell CR, Poelstra JW, Yoder AD. What is Speciation Genomics? The roles of ecology, gene flow, and genomic architecture in the formation of species. Biol J Linn Soc Lond 2018. [DOI: 10.1093/biolinnean/bly063] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
| | - J W Poelstra
- Department of Biology, Duke University, Durham, NC, USA
| | - Anne D Yoder
- Department of Biology, Duke University, Durham, NC, USA
| |
Collapse
|
115
|
Tørresen OK, Brieuc MSO, Solbakken MH, Sørhus E, Nederbragt AJ, Jakobsen KS, Meier S, Edvardsen RB, Jentoft S. Genomic architecture of haddock (Melanogrammus aeglefinus) shows expansions of innate immune genes and short tandem repeats. BMC Genomics 2018; 19:240. [PMID: 29636006 PMCID: PMC5894186 DOI: 10.1186/s12864-018-4616-y] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 03/22/2018] [Indexed: 02/06/2023] Open
Abstract
Background Increased availability of genome assemblies for non-model organisms has resulted in invaluable biological and genomic insight into numerous vertebrates, including teleosts. Sequencing of the Atlantic cod (Gadus morhua) genome and the genomes of many of its relatives (Gadiformes) demonstrated a shared loss of the major histocompatibility complex (MHC) II genes 100 million years ago. An improved version of the Atlantic cod genome assembly shows an extreme density of tandem repeats compared to other vertebrate genome assemblies. Highly contiguous assemblies are therefore needed to further investigate the unusual immune system of the Gadiformes, and whether the high density of tandem repeats found in Atlantic cod is a shared trait in this group. Results Here, we have sequenced and assembled the genome of haddock (Melanogrammus aeglefinus) – a relative of Atlantic cod – using a combination of PacBio and Illumina reads. Comparative analyses reveal that the haddock genome contains an even higher density of tandem repeats outside and within protein coding sequences than Atlantic cod. Further, both species show an elevated number of tandem repeats in genes mainly involved in signal transduction compared to other teleosts. A characterization of the immune gene repertoire demonstrates a substantial expansion of MCHI in Atlantic cod compared to haddock. In contrast, the Toll-like receptors show a similar pattern of gene losses and expansions. For the NOD-like receptors (NLRs), another gene family associated with the innate immune system, we find a large expansion common to all teleosts, with possible lineage-specific expansions in zebrafish, stickleback and the codfishes. Conclusions The generation of a highly contiguous genome assembly of haddock revealed that the high density of short tandem repeats as well as expanded immune gene families is not unique to Atlantic cod – but possibly a feature common to all, or most, codfishes. A shared expansion of NLR genes in teleosts suggests that the NLRs have a more substantial role in the innate immunity of teleosts than other vertebrates. Moreover, we find that high copy number genes combined with variable genome assembly qualities may impede complete characterization of these genes, i.e. the number of NLRs in different teleost species might be underestimates. Electronic supplementary material The online version of this article (10.1186/s12864-018-4616-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ole K Tørresen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway.
| | - Marine S O Brieuc
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
| | - Monica H Solbakken
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
| | - Elin Sørhus
- Institute of Marine Research, Bergen, Norway
| | - Alexander J Nederbragt
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway.,Biomedical Informatics Research Group, Department of Informatics, University of Oslo, Oslo, Norway
| | - Kjetill S Jakobsen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway
| | | | | | - Sissel Jentoft
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway.
| |
Collapse
|
116
|
Jones SJ, Haulena M, Taylor GA, Chan S, Bilobram S, Warren RL, Hammond SA, Mungall KL, Choo C, Kirk H, Pandoh P, Ally A, Dhalla N, Tam AKY, Troussard A, Paulino D, Coope RJN, Mungall AJ, Moore R, Zhao Y, Birol I, Ma Y, Marra M, Jones SJM. The Genome of the Northern Sea Otter (Enhydra lutris kenyoni). Genes (Basel) 2017; 8:genes8120379. [PMID: 29232880 PMCID: PMC5748697 DOI: 10.3390/genes8120379] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 11/28/2017] [Accepted: 12/01/2017] [Indexed: 11/21/2022] Open
Abstract
The northern sea otter inhabits coastal waters of the northern Pacific Ocean and is the largest member of the Mustelidae family. DNA sequencing methods that utilize microfluidic partitioned and non-partitioned library construction were used to establish the sea otter genome. The final assembly provided 2.426 Gbp of highly contiguous assembled genomic sequences with a scaffold N50 length of over 38 Mbp. We generated transcriptome data derived from a lymphoma to aid in the determination of functional elements. The assembled genome sequence and underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the BioProject accession number PRJNA388419.
Collapse
Affiliation(s)
- Samantha J Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | | | - Gregory A Taylor
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Simon Chan
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Steven Bilobram
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Karen L Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Caleb Choo
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Heather Kirk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Pawan Pandoh
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Adrian Ally
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Noreen Dhalla
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Angela K Y Tam
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Armelle Troussard
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Daniel Paulino
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Robin J N Coope
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Andrew J Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Richard Moore
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Yongjun Zhao
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Yussanne Ma
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Marco Marra
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.
| |
Collapse
|