1
|
Abstract
Presently, inferring the long-range structure of the DNA templates is limited by short read lengths. Accurate template counts suffer from distortions occurring during PCR amplification. We explore the utility of introducing random mutations in identical or nearly identical templates to create distinguishable patterns that are inherited during subsequent copying. We simulate the applications of this process under assumptions of error-free sequencing and perfect mapping, using cytosine deamination as a model for mutation. The simulations demonstrate that within readily achievable conditions of nucleotide conversion and sequence coverage, we can accurately count the number of otherwise identical molecules as well as connect variants separated by long spans of identical sequence. We discuss many potential applications, such as transcript profiling, isoform assembly, haplotype phasing, and de novo genome assembly.
Collapse
|
2
|
Narzisi G, O'Rawe JA, Iossifov I, Fang H, Lee YH, Wang Z, Wu Y, Lyon GJ, Wigler M, Schatz MC. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods 2014; 11:1033-6. [PMID: 25128977 PMCID: PMC4180789 DOI: 10.1038/nmeth.3069] [Citation(s) in RCA: 153] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 07/11/2014] [Indexed: 12/30/2022]
Abstract
We present an open-source algorithm, Scalpel, which combines mapping and assembly for sensitive and specific discovery of indels in exome-capture data. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We analyze 593 families from the Simons Simplex Collection and demonstrate Scalpel's power to detect long (≥20bp) transmitted events, and enrichment for de novo likely gene-disrupting indels in autistic children.
Collapse
Affiliation(s)
- Giuseppe Narzisi
- 1] Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] New York Genome Center, New York, USA
| | - Jason A O'Rawe
- 1] Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] Stony Brook University, Stony Brook, New York, USA
| | - Ivan Iossifov
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Han Fang
- 1] Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] Stony Brook University, Stony Brook, New York, USA
| | - Yoon-Ha Lee
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Zihua Wang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Yiyang Wu
- 1] Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] Stony Brook University, Stony Brook, New York, USA
| | - Gholson J Lyon
- 1] Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] Stony Brook University, Stony Brook, New York, USA
| | - Michael Wigler
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Michael C Schatz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| |
Collapse
|
3
|
Levy-Sakin M, Grunwald A, Kim S, Gassman NR, Gottfried A, Antelman J, Kim Y, Ho S, Samuel R, Michalet X, Lin RR, Dertinger T, Kim AS, Chung S, Colyer RA, Weinhold E, Weiss S, Ebenstein Y. Toward single-molecule optical mapping of the epigenome. ACS NANO 2014; 8:14-26. [PMID: 24328256 PMCID: PMC4022788 DOI: 10.1021/nn4050694] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The past decade has seen an explosive growth in the utilization of single-molecule techniques for the study of complex systems. The ability to resolve phenomena otherwise masked by ensemble averaging has made these approaches especially attractive for the study of biological systems, where stochastic events lead to inherent inhomogeneity at the population level. The complex composition of the genome has made it an ideal system to study at the single-molecule level, and methods aimed at resolving genetic information from long, individual, genomic DNA molecules have been in use for the last 30 years. These methods, and particularly optical-based mapping of DNA, have been instrumental in highlighting genomic variation and contributed significantly to the assembly of many genomes including the human genome. Nanotechnology and nanoscopy have been a strong driving force for advancing genomic mapping approaches, allowing both better manipulation of DNA on the nanoscale and enhanced optical resolving power for analysis of genomic information. During the past few years, these developments have been adopted also for epigenetic studies. The common principle for these studies is the use of advanced optical microscopy for the detection of fluorescently labeled epigenetic marks on long, extended DNA molecules. Here we will discuss recent single-molecule studies for the mapping of chromatin composition and epigenetic DNA modifications, such as DNA methylation.
Collapse
Affiliation(s)
- Michal Levy-Sakin
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
| | - Assaf Grunwald
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
| | - Soohong Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Natalie R. Gassman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Anna Gottfried
- Institute of Organic Chemistry, RWTH Aachen University, Aachen, Germany
| | - Josh Antelman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Younggyu Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Sam Ho
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Robin Samuel
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Xavier Michalet
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Ron R. Lin
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Thomas Dertinger
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Andrew S. Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Sangyoon Chung
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Ryan A. Colyer
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Elmar Weinhold
- Institute of Organic Chemistry, RWTH Aachen University, Aachen, Germany
| | - Shimon Weiss
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
- Corresponding authors: (Y. Ebenstein), (S. Weiss)
| | - Yuval Ebenstein
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
- Corresponding authors: (Y. Ebenstein), (S. Weiss)
| |
Collapse
|
4
|
Vezzi F, Narzisi G, Mishra B. Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One 2012; 7:e52210. [PMID: 23284938 PMCID: PMC3532452 DOI: 10.1371/journal.pone.0052210] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2012] [Accepted: 11/16/2012] [Indexed: 11/19/2022] Open
Abstract
In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. This situation has prompted the scientific community to rely on crowd-sourcing through international competitions, such as Assemblathons or GAGE, with the intention of identifying the best assembler(s) and their features. Somewhat circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed [Formula: see text], which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers - thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.
Collapse
Affiliation(s)
- Francesco Vezzi
- School of Computer Science and Communication, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden.
| | | | | |
Collapse
|
5
|
AGORA: Assembly Guided by Optical Restriction Alignment. BMC Bioinformatics 2012; 13:189. [PMID: 22856673 PMCID: PMC3431216 DOI: 10.1186/1471-2105-13-189] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Accepted: 06/28/2012] [Indexed: 11/10/2022] Open
Abstract
Background Genome assembly is difficult due to repeated sequences within the genome, which create ambiguities and cause the final assembly to be broken up into many separate sequences (contigs). Long range linking information, such as mate-pairs or mapping data, is necessary to help assembly software resolve repeats, thereby leading to a more complete reconstruction of genomes. Prior work has used optical maps for validating assemblies and scaffolding contigs, after an initial assembly has been produced. However, optical maps have not previously been used within the genome assembly process. Here, we use optical map information within the popular de Bruijn graph assembly paradigm to eliminate paths in the de Bruijn graph which are not consistent with the optical map and help determine the correct reconstruction of the genome. Results We developed a new algorithm called AGORA: Assembly Guided by Optical Restriction Alignment. AGORA is the first algorithm to use optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided. Our simulations on bacterial genomes show that AGORA is effective at producing assemblies closely matching the reference sequences. Additionally, we show that noise in the optical map can have a strong impact on the final assembly quality for some complex genomes, and we also measure how various characteristics of the starting de Bruijn graph may impact the quality of the final assembly. Lastly, we show that a proper choice of restriction enzyme for the optical map may substantially improve the quality of the final assembly. Conclusions Our work shows that optical maps can be used effectively to assemble genomes within the de Bruijn graph assembly framework. Our experiments also provide insights into the characteristics of the mapping data that most affect the performance of our algorithm, indicating the potential benefit of more accurate optical mapping technologies, such as nano-coding.
Collapse
|
6
|
Kim S, Gottfried A, Lin RR, Dertinger T, Kim AS, Chung S, Colyer RA, Weinhold E, Weiss S, Ebenstein Y. Enzymatically incorporated genomic tags for optical mapping of DNA-binding proteins. Angew Chem Int Ed Engl 2012; 51:3578-81. [PMID: 22344826 DOI: 10.1002/anie.201107714] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2011] [Revised: 12/19/2011] [Indexed: 11/08/2022]
Affiliation(s)
- Soohong Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Kim S, Gottfried A, Lin RR, Dertinger T, Kim AS, Chung S, Colyer RA, Weinhold E, Weiss S, Ebenstein Y. Enzymatically Incorporated Genomic Tags for Optical Mapping of DNA-Binding Proteins. Angew Chem Int Ed Engl 2012. [DOI: 10.1002/ange.201107714] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
8
|
Abstract
The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.
Collapse
|
9
|
Schatz MC, Phillippy AM, Sommer DD, Delcher AL, Puiu D, Narzisi G, Salzberg SL, Pop M. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 2011; 14:213-24. [PMID: 22199379 DOI: 10.1093/bib/bbr074] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
Collapse
|
10
|
Scholz MB, Lo CC, Chain PSG. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol 2011; 23:9-15. [PMID: 22154470 DOI: 10.1016/j.copbio.2011.11.013] [Citation(s) in RCA: 194] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2011] [Revised: 11/09/2011] [Accepted: 11/10/2011] [Indexed: 12/24/2022]
Abstract
The recent technological advances in next generation sequencing have brought the field closer to the goal of reconstructing all genomes within a community by presenting high throughput sequencing at much lower costs. While these next-generation sequencing technologies have allowed a massive increase in available raw sequence data, there are a number of new informatics challenges and difficulties that must be addressed to improve the current state, and fulfill the promise of, metagenomics.
Collapse
Affiliation(s)
- Matthew B Scholz
- Genome Science Group, Los Alamos National Laboratory, Los Alamos, NM 87545, United States
| | | | | |
Collapse
|
11
|
Comparing de novo genome assembly: the long and short of it. PLoS One 2011; 6:e19175. [PMID: 21559467 PMCID: PMC3084767 DOI: 10.1371/journal.pone.0019175] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2010] [Accepted: 03/29/2011] [Indexed: 01/30/2023] Open
Abstract
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.
Collapse
|