3
|
Luo J, Wang J, Zhang Z, Wu FX, Li M, Pan Y. EPGA: de novo assembly using the distributions of reads and insert size. ACTA ACUST UNITED AC 2014; 31:825-33. [PMID: 25406329 DOI: 10.1093/bioinformatics/btu762] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
MOTIVATION In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results. RESULTS In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.
Collapse
Affiliation(s)
- Junwei Luo
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Zhen Zhang
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Fang-Xiang Wu
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Min Li
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Yi Pan
- School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
4
|
Bresler M, Sheehan S, Chan AH, Song YS. Telescoper: de novo assembly of highly repetitive regions. ACTA ACUST UNITED AC 2013; 28:i311-i317. [PMID: 22962446 PMCID: PMC3436826 DOI: 10.1093/bioinformatics/bts399] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Motivation: With advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging. Results: In this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used. Availability: Telescoper is publicly available for download at sourceforge.net/p/telescoper. Contact:yss@eecs.berkeley.edu Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ma'ayan Bresler
- Department of EECS, University of California, Berkeley, CA 94720, USA
| | | | | | | |
Collapse
|
5
|
Abstract
MOTIVATION Recent studies in genomics have highlighted the significance of structural variation in determining individual variation. Current methods for identifying structural variation, however, are predominantly focused on either assembling whole genomes from scratch, or identifying the relatively small changes between a genome and a reference sequence. While significant progress has been made in recent years on both de novo assembly and resequencing (read mapping) methods, few attempts have been made to bridge the gap between them. RESULTS In this paper, we present a computational method for incorporating a reference sequence into an assembly algorithm. We propose a novel graph construction that builds upon the well-known de Bruijn graph to incorporate the reference, and describe a simple algorithm, based on iterative message passing, which uses this information to significantly improve assembly results. We validate our method by applying it to a series of 5 Mb simulation genomes derived from both mammalian and bacterial references. The results of applying our method to this simulation data are presented along with a discussion of the benefits and drawbacks of this technique.
Collapse
Affiliation(s)
- Nathaniel Parrish
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, USA
| | - Benjamin Sudakov
- Department of Mathematics, University of California Los Angeles, Los Angeles, California, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, USA
| |
Collapse
|
6
|
Pham SK, Antipov D, Sirotkin A, Tesler G, Pevzner PA, Alekseyev MA. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly. J Comput Biol 2012; 20:359-71. [PMID: 22803627 DOI: 10.1089/cmb.2012.0098] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.
Collapse
Affiliation(s)
- Son K Pham
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | | | | | | | | |
Collapse
|
7
|
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19:455-77. [PMID: 22506599 DOI: 10.1089/cmb.2012.0021] [Citation(s) in RCA: 16443] [Impact Index Per Article: 1370.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.
Collapse
Affiliation(s)
- Anton Bankevich
- Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|