1
|
Abstract
The current genomic revolution was made possible by joint advances in genome sequencing technologies and computational approaches for analyzing sequence data. The close interaction between biologists and computational scientists is perhaps most apparent in the development of approaches for sequencing entire genomes, a feat that would not be possible without sophisticated computational tools called genome assemblers (short for genome sequence assemblers). Here, we survey the key developments in algorithms for assembling genome sequences since the development of the first DNA sequencing methods more than 35 years ago.
Collapse
Affiliation(s)
- Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada;
| | | |
Collapse
|
2
|
Abstract
Background Structural variations in human genomes, such as deletions, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions. Result We present a first approach specifically designed to cluster short-read paired-end data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account. Conclusion We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.
Collapse
Affiliation(s)
- Roland Wittler
- Genome Informatics, Faculty of Technology and Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, 33594 Bielefeld, Germany.
| |
Collapse
|
3
|
Discovery of mutations in Saccharomyces cerevisiae by pooled linkage analysis and whole-genome sequencing. Genetics 2010; 186:1127-37. [PMID: 20923977 DOI: 10.1534/genetics.110.123232] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Many novel and important mutations arise in model organisms and human patients that can be difficult or impossible to identify using standard genetic approaches, especially for complex traits. Working with a previously uncharacterized dominant Saccharomyces cerevisiae mutant with impaired vacuole inheritance, we developed a pooled linkage strategy based on next-generation DNA sequencing to specifically identify functional mutations from among a large excess of polymorphisms, incidental mutations, and sequencing errors. The VAC6-1 mutation was verified to correspond to PHO81-R701S, the highest priority candidate reported by VAMP, the new software platform developed for these studies. Sequence data further revealed the large extent of strain background polymorphisms and structural alterations present in the host strain, which occurred by several mechanisms including a novel Ty insertion. The results provide a snapshot of the ongoing genomic changes that ultimately result in strain divergence and evolution, as well as a general model for the discovery of functional mutations in many organisms.
Collapse
|
4
|
Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res 2010; 20:675-84. [PMID: 20305016 PMCID: PMC2860169 DOI: 10.1101/gr.096966.109] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2009] [Accepted: 09/23/2009] [Indexed: 11/24/2022]
Abstract
We describe a statistical and comparative-genomic approach for quantifying error rates of genome sequence assemblies. The method exploits not substitutions but the pattern of insertions and deletions (indels) in genome-scale alignments for closely related species. Using two- or three-way alignments, the approach estimates the amount of aligned sequence containing clusters of nucleotides that were wrongly inserted or deleted during sequencing or assembly. Thus, the method is well-suited to assessing fine-scale sequence quality within single assemblies, between different assemblies of a single set of reads, and between genome assemblies for different species. When applying this approach to four primate genome assemblies, we found that average gap error rates per base varied considerably, by up to sixfold. As expected, bacterial artificial chromosome (BAC) sequences contained lower, but still substantial, predicted numbers of errors, arguing for caution in regarding BACs as the epitome of genome fidelity. We then mapped short reads, at approximately 10-fold statistical coverage, from a Bornean orangutan onto the Sumatran orangutan genome assembly originally constructed from capillary reads. This resulted in a reduced gap error rate and a separation of error-prone from high-fidelity sequence. Over 5000 predicted indel errors in protein-coding sequence were corrected in a hybrid assembly. Our approach contributes a new fine-scale quality metric for assemblies that should facilitate development of improved genome sequencing and assembly strategies.
Collapse
Affiliation(s)
- Stephen Meader
- Medical Research Council Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom
| | - LaDeana W. Hillier
- The Genome Center at Washington University, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Devin Locke
- The Genome Center at Washington University, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Chris P. Ponting
- Medical Research Council Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom
| | - Gerton Lunter
- Medical Research Council Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom
- The Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
| |
Collapse
|
5
|
Abstract
Research into genome assembly algorithms has experienced a resurgence due to new challenges created by the development of next generation sequencing technologies. Several genome assemblers have been published in recent years specifically targeted at the new sequence data; however, the ever-changing technological landscape leads to the need for continued research. In addition, the low cost of next generation sequencing data has led to an increased use of sequencing in new settings. For example, the new field of metagenomics relies on large-scale sequencing of entire microbial communities instead of isolate genomes, leading to new computational challenges. In this article, we outline the major algorithmic approaches for genome assembly and describe recent developments in this domain.
Collapse
Affiliation(s)
- Mihai Pop
- Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park, MD 20742, USA.
| |
Collapse
|
6
|
Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res 2009; 19:1270-8. [PMID: 19447966 DOI: 10.1101/gr.088633.108] [Citation(s) in RCA: 202] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent studies show that along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. The Human Genome Structural Variation Project aims to identify and classify deletions, insertions, and inversions (>5 Kbp) in a small number of normal individuals with a fosmid-based paired-end sequencing approach using traditional sequencing technologies. The realization of new ultra-high-throughput sequencing platforms now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Unfortunately, existing algorithms for identifying structural variation (SV) among individuals have not been designed to handle the short read lengths and the errors implied by the "next-gen" sequencing (NGS) technologies. In this paper, we give combinatorial formulations for the SV detection between a reference genome sequence and a next-gen-based, paired-end, whole genome shotgun-sequenced individual. We describe efficient algorithms for each of the formulations we give, which all turn out to be fast and quite reliable; they are also applicable to all next-gen sequencing methods (Illumina, 454 Life Sciences [Roche], ABI SOLiD, etc.) and traditional capillary sequencing technology. We apply our algorithms to identify SV among individual genomes very recently sequenced by Illumina technology.
Collapse
Affiliation(s)
- Fereydoun Hormozdiari
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6
| | | | | | | |
Collapse
|
7
|
Axelrod N, Lin Y, Ng PC, Stockwell TB, Crabtree J, Huang J, Kirkness E, Strausberg RL, Frazier ME, Venter JC, Kravitz S, Levy S. The HuRef Browser: a web resource for individual human genomics. Nucleic Acids Res 2008; 37:D1018-24. [PMID: 19036787 PMCID: PMC2686481 DOI: 10.1093/nar/gkn939] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The HuRef Genome Browser is a web application for the navigation and analysis of the previously published genome of a human individual, termed HuRef. The browser provides a comparative view between the NCBI human reference sequence and the HuRef assembly, and it enables the navigation of the HuRef genome in the context of HuRef, NCBI and Ensembl annotations. Single nucleotide polymorphisms, indels, inversions, structural and copy-number variations are shown in the context of existing functional annotations on either genome in the comparative view. Demonstrated here are some potential uses of the browser to enable a better understanding of individual human genetic variation. The browser provides full access to the underlying reads with sequence and quality information, the genome assembly and the evidence supporting the identification of DNA polymorphisms. The HuRef Browser is a unique and versatile tool for browsing genome assemblies and studying individual human sequence variation in a diploid context. The browser is available online at http://huref.jcvi.org.
Collapse
Affiliation(s)
- Nelson Axelrod
- J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Akagi K, Li J, Stephens RM, Volfovsky N, Symer DE. Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 2008; 18:869-80. [PMID: 18381897 PMCID: PMC2413154 DOI: 10.1101/gr.075770.107] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2007] [Accepted: 03/27/2008] [Indexed: 12/13/2022]
Abstract
Numerous inbred mouse strains comprise models for human diseases and diversity, but the molecular differences between them are mostly unknown. Several mammalian genomes have been assembled, providing a framework for identifying structural variations. To identify variants between inbred mouse strains at a single nucleotide resolution, we aligned 26 million individual sequence traces from four laboratory mouse strains to the C57BL/6J reference genome. We discovered and analyzed over 10,000 intermediate-length genomic variants (from 100 nucleotides to 10 kilobases), distinguishing these strains from the C57BL/6J reference. Approximately 85% of such variants are due to recent mobilization of endogenous retrotransposons, predominantly L1 elements, greatly exceeding that reported in humans. Many genes' structures and expression are altered directly by polymorphic L1 retrotransposons, including Drosha (also called Rnasen), Parp8, Scn1a, Arhgap15, and others, including novel genes. L1 polymorphisms are distributed nonrandomly across the genome, as they are excluded significantly from the X chromosome and from genes associated with the cell cycle, but are enriched in receptor genes. Thus, recent endogenous L1 retrotransposition has diversified genomic structures and transcripts extensively, distinguishing mouse lineages and driving a major portion of natural genetic variation.
Collapse
Affiliation(s)
- Keiko Akagi
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, Frederick, Maryland 21702, USA
| | - Jingfeng Li
- Basic Research Laboratory, Center for Cancer Research, National Cancer Institute, Frederick, Maryland 21702, USA
| | - Robert M. Stephens
- Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick, Inc., Frederick, Maryland 21702, USA
| | - Natalia Volfovsky
- Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick, Inc., Frederick, Maryland 21702, USA
| | - David E. Symer
- Basic Research Laboratory, Center for Cancer Research, National Cancer Institute, Frederick, Maryland 21702, USA
- Laboratory of Biochemistry and Molecular Biology, Center for Cancer Research, National Cancer Institute, Frederick, Maryland 21702, USA
| |
Collapse
|
9
|
Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. The diploid genome sequence of an individual human. PLoS Biol 2008; 5:e254. [PMID: 17803354 PMCID: PMC1964779 DOI: 10.1371/journal.pbio.0050254] [Citation(s) in RCA: 1129] [Impact Index Per Article: 66.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2007] [Accepted: 07/30/2007] [Indexed: 01/20/2023] Open
Abstract
Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
Collapse
Affiliation(s)
- Samuel Levy
- J. Craig Venter Institute, Rockville, Maryland, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol 2008; 9:R55. [PMID: 18341692 PMCID: PMC2397507 DOI: 10.1186/gb-2008-9-3-r55] [Citation(s) in RCA: 185] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2007] [Revised: 01/10/2008] [Accepted: 03/14/2008] [Indexed: 01/08/2023] Open
Abstract
A collection of software tools is combined for the first time in an automated pipeline for detecting large-scale genome assembly errors and for validating genome assemblies. We present the first collection of tools aimed at automated genome assembly validation. This work formalizes several mechanisms for detecting mis-assemblies, and describes their implementation in our automated validation pipeline, called amosvalidate. We demonstrate the application of our pipeline in both bacterial and eukaryotic genome assemblies, and highlight several assembly errors in both draft and finished genomes. The software described is compatible with common assembly formats and is released, open-source, at .
Collapse
Affiliation(s)
- Adam M Phillippy
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.
| | | | | |
Collapse
|
11
|
Choi JH, Kim S, Tang H, Andrews J, Gilbert DG, Colbourne JK. A machine-learning approach to combined evidence validation of genome assemblies. ACTA ACUST UNITED AC 2008; 24:744-50. [PMID: 18204064 DOI: 10.1093/bioinformatics/btm608] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
MOTIVATION While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. RESULTS We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.
Collapse
Affiliation(s)
- Jeong-Hyeon Choi
- The Center for Genomics and Bioinformatics, School of Informatics and Department of Biology, Indiana University, IN 47405, USA
| | | | | | | | | | | |
Collapse
|
12
|
Abstract
MOTIVATION Many genomes are sequenced by a collaboration of several centers, and then each center produces an assembly using their own assembly software. The collaborators then pick the draft assembly that they judge to be the best and the information contained in the other assemblies is usually not used. METHODS We have developed a technique that we call assembly reconciliation that can merge draft genome assemblies. It takes one draft assembly, detects apparent errors, and, when possible, patches the problem areas using pieces from alternative draft assemblies. It also closes gaps in places where one of the alternative assemblies has spanned the gap correctly. RESULTS Using the Assembly Reconciliation technique, we produced reconciled assemblies of six Drosophila species in collaboration with Agencourt Bioscience and The J. Craig Venter Institute. These assemblies are now the official (CAF1) assemblies used for analysis. We also produced a reconciled assembly of Rhesus Macaque genome, and this assembly is available from our website http://www.genome.umd.edu. AVAILABILITY The reconciliation software is available for download from http://www.genome.umd.edu/software.htm
Collapse
Affiliation(s)
- Aleksey V Zimin
- IPST, University of Maryland, College Park, Agencourt Bioscience Inc., Beverly, MA.
| | | | | | | |
Collapse
|
13
|
Schatz MC, Phillippy AM, Shneiderman B, Salzberg SL. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biol 2007; 8:R34. [PMID: 17349036 PMCID: PMC1868940 DOI: 10.1186/gb-2007-8-3-r34] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2006] [Revised: 01/10/2007] [Accepted: 03/09/2007] [Indexed: 11/14/2022] Open
Abstract
Genome sequencing remains an inexact science, and genome sequences can contain significant errors if they are not carefully examined. Hawkeye is our new visual analytics tool for genome assemblies, designed to aid in identifying and correcting assembly errors. Users can analyze all levels of an assembly along with summary statistics and assembly metrics, and are guided by a ranking component towards likely mis-assemblies. Hawkeye is freely available and released as part of the open source AMOS project http://amos.sourceforge.net/hawkeye.
Collapse
Affiliation(s)
- Michael C Schatz
- Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building, University of Maryland, College Park, Maryland, 20742, USA
| | - Adam M Phillippy
- Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building, University of Maryland, College Park, Maryland, 20742, USA
| | - Ben Shneiderman
- Department of Computer Science and Human-Computer Interaction Lab, A.V. Williams Building, University of Maryland, College Park, Maryland, 20742, USA
| | - Steven L Salzberg
- Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building, University of Maryland, College Park, Maryland, 20742, USA
| |
Collapse
|