1
|
Bae H, Min S, Choi HS, Yoon S. DNA Privacy: Analyzing Malicious DNA Sequences Using Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:888-898. [PMID: 32809941 DOI: 10.1109/tcbb.2020.3017191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Recent advances in next-generation sequencing technologies have led to the successful insertion of video information into DNA using synthesized oligonucleotides. Several attempts have been made to embed larger data into living organisms. This process of embedding messages is called steganography and it is used for hiding and watermarking data to protect intellectual property. In contrast, steganalysis is a group of algorithms that serves to detect hidden information from covert media. Various methods have been developed to detect messages embedded in conventional covert channels. However, conventional steganalysis algorithms are mostly limited to common covert media. Most common detection approaches, such as frequency analysis-based methods, often overlook important signals when directly applied to DNA steganography and are easily bypassed by recently developed steganography techniques. To address the limitations of conventional approaches, a sequence-learning-based malicious DNA sequence analysis method based on neural networks has been proposed. The proposed method learns intrinsic distributions and identifies distribution variations using a classification score to predict whether a sequence is to be a coding or non-coding sequence. Based on our experiments and results, we have developed a framework to safeguard security against DNA steganography.
Collapse
|
2
|
Tahir M, Sardaraz M, Mehmood Z, Khan MS. ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200614171832] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
To assess the error profile in NGS data, generated from high throughput
sequencing machines.
Background:
Short-read sequencing data from Next Generation Sequencing (NGS) are currently
being generated by a number of research projects. Depicting the errors produced by NGS
platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It
has high significance in various analyses, such as genome sequence assembly, SNPs calling,
evolutionary studies, and haplotype inference. The systematic and random errors show incidence
profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454
pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and
Oxford Nanopore sequencing. Advances in NGS deliver galactic data with the addition of errors.
Some ratio of these errors may emulate genuine true biological signals i.e., mutation, and may
subsequently negate the results. Various independent applications have been proposed to correct
the sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models
are missing.
Objective:
In this paper, an effcient error estimation computational model called ESREEM is
proposed to assess the error rates in NGS data.
Methods:
The proposed model prospects the analysis that there exists a true linear regression
association between the number of reads containing errors and the number of reads sequenced. The
model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM).
Result:
The proposed model is evaluated on several benchmark datasets and the results obtained are
compared with state-of-the-art algorithms.
Conclusions:
Experimental results analyses show that the proposed model efficiently estimates errors
and runs in less time as compared to others.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| | - Muhammad Sardaraz
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| | - Zahid Mehmood
- Department of Software Engineering, University of Engineering and Technology, Taxila,Pakistan
| | - Muhammad Saud Khan
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| |
Collapse
|
3
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
4
|
Huang YT, Huang YW. An efficient error correction algorithm using FM-index. BMC Bioinformatics 2017; 18:524. [PMID: 29179672 PMCID: PMC5704532 DOI: 10.1186/s12859-017-1940-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed. RESULTS We develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads. CONCLUSIONS FMOE is freely available at https://github.com/ythuang0522/FMOC .
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan.
| | - Yu-Wen Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan
| |
Collapse
|
5
|
Savel D, LaFramboise T, Grama A, Koyuturk M. Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1378-1388. [PMID: 27362987 PMCID: PMC5754272 DOI: 10.1109/tcbb.2016.2586060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction techniques utilizing substring frequencies have been developed to mitigate this effect. In this paper, we present a novel and effective method called Pluribus, for correcting sequencing errors using a generalized suffix trie. Pluribus utilizes multiple manifestations of an error in the trie to accurately identify errors and suggest corrections. We show that Pluribus produces the least number of false positives across a diverse set of real sequencing datasets when compared to other methods. Furthermore, Pluribus can be used in conjunction with other contemporary error correction methods to achieve higher levels of accuracy than either tool alone. These increases in error correction accuracy are also realized in the quality of the contigs that are generated during assembly. We explore, in-depth, the behavior of Pluribus , to explain the observed improvement in accuracy and assembly performance. Pluribus is freely available at http://compbio. CASE edu/pluribus/.
Collapse
|
6
|
Lee B, Moon T, Yoon S, Weissman T. DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 2017; 12:e0181463. [PMID: 28749987 PMCID: PMC5531809 DOI: 10.1371/journal.pone.0181463] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/30/2017] [Indexed: 11/29/2022] Open
Abstract
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
Collapse
Affiliation(s)
- Byunghan Lee
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Taesup Moon
- College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
- * E-mail: (TM); (SY)
| | - Sungroh Yoon
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Neurology and Neurological Sciences, Stanford University, Stanford, California, United States of America
- * E-mail: (TM); (SY)
| | - Tsachy Weissman
- Electrical Engineering, Stanford University, Stanford, California, United States of America
| |
Collapse
|
7
|
Ahola V, Wahlberg N, Frilander MJ. Butterfly Genomics: Insights from the Genome ofMelitaea cinxia. ANN ZOOL FENN 2017. [DOI: 10.5735/086.054.0123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Virpi Ahola
- Department of Biosciences, P.O. Box 65, FI-00014 University of Helsinki, Finland
| | - Niklas Wahlberg
- Department of Biology, Lund University, Sölvegatan 37, SE-223 62 Lund, Sweden
| | - Mikko J. Frilander
- Institute of Biotechnology, P.O. Box 56, FI-00014 University of Helsinki, Finland
| |
Collapse
|
8
|
Garrido-Cardenas JA, Garcia-Maroto F, Alvarez-Bermejo JA, Manzano-Agugliaro F. DNA Sequencing Sensors: An Overview. SENSORS 2017; 17:s17030588. [PMID: 28335417 PMCID: PMC5375874 DOI: 10.3390/s17030588] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Revised: 03/09/2017] [Accepted: 03/11/2017] [Indexed: 12/23/2022]
Abstract
The first sequencing of a complete genome was published forty years ago by the double Nobel Prize in Chemistry winner Frederick Sanger. That corresponded to the small sized genome of a bacteriophage, but since then there have been many complex organisms whose DNA have been sequenced. This was possible thanks to continuous advances in the fields of biochemistry and molecular genetics, but also in other areas such as nanotechnology and computing. Nowadays, sequencing sensors based on genetic material have little to do with those used by Sanger. The emergence of mass sequencing sensors, or new generation sequencing (NGS) meant a quantitative leap both in the volume of genetic material that was able to be sequenced in each trial, as well as in the time per run and its cost. One can envisage that incoming technologies, already known as fourth generation sequencing, will continue to cheapen the trials by increasing DNA reading lengths in each run. All of this would be impossible without sensors and detection systems becoming smaller and more precise. This article provides a comprehensive overview on sensors for DNA sequencing developed within the last 40 years.
Collapse
|
9
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
10
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
11
|
Akogwu I, Wang N, Zhang C, Gong P. A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis. Hum Genomics 2016; 10 Suppl 2:20. [PMID: 27461106 PMCID: PMC4965716 DOI: 10.1186/s40246-016-0068-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. METHODS Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. RESULTS Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. CONCLUSIONS This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
Collapse
Affiliation(s)
- Isaac Akogwu
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Nan Wang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Chaoyang Zhang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
12
|
Zhu X, Wang J, Peng B, Shete S. Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics 2016; 17:177. [PMID: 27102907 PMCID: PMC4840868 DOI: 10.1186/s12859-016-1052-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2015] [Accepted: 04/14/2016] [Indexed: 01/24/2023] Open
Abstract
Background Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows. Results We performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples. Conclusions The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1052-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xuan Zhu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Jian Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Bo Peng
- Department of Bioinformatics & Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Sanjay Shete
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA. .,Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
| |
Collapse
|
13
|
Feng S, Lo CC, Li PE, Chain PSG. ADEPT, a dynamic next generation sequencing data error-detection program with trimming. BMC Bioinformatics 2016; 17:109. [PMID: 26928302 PMCID: PMC4772517 DOI: 10.1186/s12859-016-0967-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2015] [Accepted: 02/22/2016] [Indexed: 01/16/2023] Open
Abstract
Background Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. Results In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the true positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. Conclusions ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0967-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shihai Feng
- Genome Science Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA.
| | - Chien-Chi Lo
- Genome Science Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA.
| | - Po-E Li
- Genome Science Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA.
| | - Patrick S G Chain
- Genome Science Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA.
| |
Collapse
|
14
|
Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
16
|
Pathogen Discovery. Mol Microbiol 2016. [DOI: 10.1128/9781555819071.ch7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
17
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
18
|
Abstract
BACKGROUND Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources. METHODS We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need. RESULTS Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%. CONCLUSIONS Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.
Collapse
|
19
|
Abstract
Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
Collapse
|
20
|
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015; 4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. FINDINGS We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. CONCLUSIONS Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Collapse
|
21
|
Kowalski T, Grabowski S, Deorowicz S. Indexing Arbitrary-Length k-Mers in Sequencing Reads. PLoS One 2015; 10:e0133198. [PMID: 26182400 PMCID: PMC4504488 DOI: 10.1371/journal.pone.0133198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/24/2015] [Indexed: 11/25/2022] Open
Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Collapse
Affiliation(s)
- Tomasz Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
22
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
23
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|
24
|
Salehi F, Baronio R, Idrogo-Lam R, Vu H, Hall LV, Kaiser P, Lathrop RH. CHOPER filters enable rare mutation detection in complex mutagenesis populations by next-generation sequencing. PLoS One 2015; 10:e0116877. [PMID: 25692681 PMCID: PMC4333345 DOI: 10.1371/journal.pone.0116877] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 12/08/2014] [Indexed: 01/12/2023] Open
Abstract
Next-generation sequencing (NGS) has revolutionized genetics and enabled the accurate identification of many genetic variants across many genomes. However, detection of biologically important low-frequency variants within genetically heterogeneous populations remains challenging, because they are difficult to distinguish from intrinsic NGS sequencing error rates. Approaches to overcome these limitations are essential to detect rare mutations in large cohorts, virus or microbial populations, mitochondria heteroplasmy, and other heterogeneous mixtures such as tumors. Modifications in library preparation can overcome some of these limitations, but are experimentally challenging and restricted to skilled biologists. This paper describes a novel quality filtering and base pruning pipeline, called Complex Heterogeneous Overlapped Paired-End Reads (CHOPER), designed to detect sequence variants in a complex population with high sequence similarity derived from All-Codon-Scanning (ACS) mutagenesis. A novel fast alignment algorithm, designed for the specified application, has O(n) time complexity. CHOPER was applied to a p53 cancer mutant reactivation study derived from ACS mutagenesis. Relative to error filtering based on Phred quality scores, CHOPER improved accuracy by about 13% while discarding only half as many bases. These results are a step toward extending the power of NGS to the analysis of genetically heterogeneous populations.
Collapse
Affiliation(s)
- Faezeh Salehi
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- * E-mail: (FS); (PK)
| | - Roberta Baronio
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Ryan Idrogo-Lam
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Huy Vu
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Linda V. Hall
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Peter Kaiser
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- Chao Family Comprehensive Cancer Center, University of California Irvine, Irvine, CA, 92697, United States of America
- * E-mail: (FS); (PK)
| | - Richard H. Lathrop
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- Chao Family Comprehensive Cancer Center, University of California Irvine, Irvine, CA, 92697, United States of America
| |
Collapse
|
25
|
Schulz MH, Weese D, Holtgrewe M, Dimitrova V, Niu S, Reinert K, Richard H. Fiona: a parallel and automatic strategy for read error correction. ACTA ACUST UNITED AC 2015; 30:i356-63. [PMID: 25161220 PMCID: PMC4147893 DOI: 10.1093/bioinformatics/btu440] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Motivation: Automatic error correction of high-throughput sequencing data can have a dramatic impact on the amount of usable base pairs and their quality. It has been shown that the performance of tasks such as de novo genome assembly and SNP calling can be dramatically improved after read error correction. While a large number of methods specialized for correcting substitution errors as found in Illumina data exist, few methods for the correction of indel errors, common to technologies like 454 or Ion Torrent, have been proposed. Results: We present Fiona, a new stand-alone read error–correction method. Fiona provides a new statistical approach for sequencing error detection and optimal error correction and estimates its parameters automatically. Fiona is able to correct substitution, insertion and deletion errors and can be applied to any sequencing technology. It uses an efficient implementation of the partial suffix array to detect read overlaps with different seed lengths in parallel. We tested Fiona on several real datasets from a variety of organisms with different read lengths and compared its performance with state-of-the-art methods. Fiona shows a constantly higher correction accuracy over a broad range of datasets from 454 and Ion Torrent sequencers, without compromise in speed. Conclusion: Fiona is an accurate parameter-free read error–correction method that can be run on inexpensive hardware and can make use of multicore parallelization whenever available. Fiona was implemented using the SeqAn library for sequence analysis and is publicly available for download at http://www.seqan.de/projects/fiona. Contact: mschulz@mmci.uni-saarland.de or hugues.richard@upmc.fr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marcel H Schulz
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - David Weese
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Manuel Holtgrewe
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Viktoria Dimitrova
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Sijia Niu
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Knut Reinert
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Hugues Richard
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| |
Collapse
|
26
|
Rare biosphere exploration using high-throughput sequencing: research progress and perspectives. CONSERV GENET 2014. [DOI: 10.1007/s10592-014-0678-9] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
27
|
Ahola V, Lehtonen R, Somervuo P, Salmela L, Koskinen P, Rastas P, Välimäki N, Paulin L, Kvist J, Wahlberg N, Tanskanen J, Hornett EA, Ferguson LC, Luo S, Cao Z, de Jong MA, Duplouy A, Smolander OP, Vogel H, McCoy RC, Qian K, Chong WS, Zhang Q, Ahmad F, Haukka JK, Joshi A, Salojärvi J, Wheat CW, Grosse-Wilde E, Hughes D, Katainen R, Pitkänen E, Ylinen J, Waterhouse RM, Turunen M, Vähärautio A, Ojanen SP, Schulman AH, Taipale M, Lawson D, Ukkonen E, Mäkinen V, Goldsmith MR, Holm L, Auvinen P, Frilander MJ, Hanski I. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nat Commun 2014; 5:4737. [PMID: 25189940 PMCID: PMC4164777 DOI: 10.1038/ncomms5737] [Citation(s) in RCA: 158] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/17/2014] [Indexed: 12/30/2022] Open
Abstract
Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) of the Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a widely recognized model species in metapopulation biology and eco-evolutionary research, which has the putative ancestral karyotype of n=31. Using a phylogenetic analyses of Nymphalidae and of other Lepidoptera, combined with orthologue-level comparisons of chromosomes, we conclude that the ancestral lepidopteran karyotype has been n=31 for at least 140 My. We show that fusion chromosomes have retained the ancestral chromosome segments and very few rearrangements have occurred across the fusion sites. The same, shortest ancestral chromosomes have independently participated in fusion events in species with smaller karyotypes. The short chromosomes have higher rearrangement rate than long ones. These characteristics highlight distinctive features of the evolutionary dynamics of butterflies and moths. Butterflies and moths (Lepidoptera) vary in chromosome number. Here, the authors sequence the genome of the Glanville fritillary butterfly, Melitaea cinxia, show it has the ancestral lepidopteran karyotype and provide insight into how chromosomal fusions have shaped karyotype evolution in butterflies and moths.
Collapse
Affiliation(s)
- Virpi Ahola
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Rainer Lehtonen
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [3] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland [4] Center of Excellence in Cancer Genetics, University of Helsinki, FI-00014 Helsinki, Finland [5] [6]
| | - Panu Somervuo
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
| | - Leena Salmela
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Patrik Koskinen
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Pasi Rastas
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Niko Välimäki
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Lars Paulin
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Jouni Kvist
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Niklas Wahlberg
- Department of Biology, University of Turku, FI-20014 Turku, Finland
| | - Jaakko Tanskanen
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
| | - Emily A Hornett
- 1] Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK [2] Department of Biology, Pennsylvania State University, Pennsylvania 16802, USA
| | | | - Shiqi Luo
- College of Life Sciences, Peking University, Beijing 100871, P.R. China
| | - Zijuan Cao
- College of Life Sciences, Peking University, Beijing 100871, P.R. China
| | - Maaike A de Jong
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] School of Biological Sciences, University of Bristol, Bristol BS8 1UG, UK
| | - Anne Duplouy
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | | | - Heiko Vogel
- Department of Entomology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
| | - Rajiv C McCoy
- Department of Biology, Stanford University, Stanford, California 94305, USA
| | - Kui Qian
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Wong Swee Chong
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Qin Zhang
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Freed Ahmad
- Department of Information Technology, University of Turku, FI-20014 Turku, Finland
| | - Jani K Haukka
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Aruj Joshi
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Jarkko Salojärvi
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | | | - Ewald Grosse-Wilde
- Department of Evolutionary Neuroethology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
| | - Daniel Hughes
- 1] European Bioinformatics Institute, Hinxton CB10 1SD, UK [2] Baylor College of Medicine, Human Genome Sequencing Center, Houston, Texas 77030-3411, USA
| | - Riku Katainen
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Esa Pitkänen
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Johannes Ylinen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Robert M Waterhouse
- 1] Department of Genetic Medicine and Development, University of Geneva Medical School &Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland [2] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA [3] The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Mikko Turunen
- Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland
| | - Anna Vähärautio
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Department of Pathology, University of Helsinki, FI-00014 Helsinki, Finland [3] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
| | - Sami P Ojanen
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Alan H Schulman
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
| | - Minna Taipale
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
| | - Daniel Lawson
- European Bioinformatics Institute, Hinxton CB10 1SD, UK
| | - Esko Ukkonen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Marian R Goldsmith
- Department of Biological Sciences, University of Rhode Island, Kingston, Rhode Island 02881-0816, USA
| | - Liisa Holm
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
| | - Petri Auvinen
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Mikko J Frilander
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Ilkka Hanski
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| |
Collapse
|
28
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
29
|
Abstract
Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec. Contact:lordec@lirmm.fr. Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| | - Eric Rivals
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| |
Collapse
|
30
|
Lim EC, Müller J, Hagmann J, Henz SR, Kim ST, Weigel D. Trowel: a fast and accurate error correction module for Illumina sequencing reads. ACTA ACUST UNITED AC 2014; 30:3264-5. [PMID: 25075116 DOI: 10.1093/bioinformatics/btu513] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The ability to accurately read the order of nucleotides in DNA and RNA is fundamental for modern biology. Errors in next-generation sequencing can lead to many artifacts, from erroneous genome assemblies to mistaken inferences about RNA editing. Uneven coverage in datasets also contributes to false corrections. RESULT We introduce Trowel, a massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications.The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size. AVAILABILITY AND IMPLEMENTATION Trowel is written in C++ and is provided under the General Public License v3.0 (GPLv3). It is available at http://trowel-ec.sourceforge.net. CONTACT euncheon.lim@tue.mpg.de or weigel@tue.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eun-Cheon Lim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jonas Müller
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jörg Hagmann
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Stefan R Henz
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Sang-Tae Kim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| |
Collapse
|
31
|
Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. ACTA ACUST UNITED AC 2014; 30:2723-32. [PMID: 24919879 DOI: 10.1093/bioinformatics/btu368] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Bioinformatics tools, such as assemblers and aligners, are expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of stand-alone tools whose sole purpose is to detect and remove sequencing errors. A good error-correcting tool would be a transparent component in a bioinformatics pipeline, simply taking sequence data in any of the standard formats and producing a higher quality version of the same data containing far fewer errors. It should not only be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), but it has to be both fast enough and scalable enough to be usable on the large datasets being produced by current sequencing technologies, and work on data derived from both haploid and diploid organisms. RESULTS This article presents Blue, an error-correction algorithm based on k-mer consensus and context. Blue can correct substitution, deletion and insertion errors, as well as uncalled bases. It accepts both FASTQ and FASTA formats, and corrects quality scores for corrected bases. Blue also maintains the pairing of reads, both within a file and between pairs of files, making it compatible with downstream tools that depend on read pairing. Blue is memory efficient, scalable and faster than other published tools, and usable on large sequencing datasets. On the tests undertaken, Blue also proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors. One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes. AVAILABILITY AND IMPLEMENTATION The code for Blue and its related tools are available from http://www.bioinformatics.csiro.au/Blue. These programs are written in C# and run natively under Windows and under Mono on Linux.
Collapse
Affiliation(s)
- Paul Greenfield
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Konsta Duesing
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Alexie Papanicolaou
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Denis C Bauer
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| |
Collapse
|
32
|
Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. FRONTIERS IN PLANT SCIENCE 2014; 5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]
Abstract
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Collapse
Affiliation(s)
- Claudia Knief
- Institute of Crop Science and Resource Conservation—Molecular Biology of the Rhizosphere, Faculty of Agriculture, University of BonnBonn, Germany
| |
Collapse
|
33
|
Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014; 15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Current-generation sequencing technologies are able to produce low-cost, high-throughput reads. However, the produced reads are imperfect and may contain various sequencing errors. Although many error correction methods have been developed in recent years, none explicitly targets homopolymer-length errors in the 454 sequencing reads. RESULTS We present HECTOR, a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. In this algorithm, for the first time we have investigated a novel homopolymer spectrum based approach to handle homopolymer insertions or deletions, which are the dominant sequencing errors in 454 pyrosequencing reads. We have evaluated the performance of HECTOR, in terms of correction quality, runtime and parallel scalability, using both simulated and real pyrosequencing datasets. This performance has been further compared to that of Coral, a state-of-the-art error corrector which is based on multiple sequence alignment and Acacia, a recently published error corrector for amplicon pyrosequences. Our evaluations reveal that HECTOR demonstrates comparable correction quality to Coral, but runs 3.7× faster on average. In addition, HECTOR performs well even when the coverage of the dataset is low. CONCLUSION Our homopolymer spectrum based approach is theoretically capable of processing arbitrary-length homopolymer-length errors, with a linear time complexity. HECTOR employs a multi-threaded design based on a master-slave computing model. Our experimental results show that HECTOR is a practical 454 pyrosequencing read error corrector which is competitive in terms of both correction quality and speed. The source code and all simulated data are available at: http://hector454.sourceforge.net.
Collapse
Affiliation(s)
- Adrianto Wirawan
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany.
| | | | | | | | | |
Collapse
|
34
|
Wang C, Grohme MA, Mali B, Schill RO, Frohme M. Towards decrypting cryptobiosis--analyzing anhydrobiosis in the tardigrade Milnesium tardigradum using transcriptome sequencing. PLoS One 2014; 9:e92663. [PMID: 24651535 PMCID: PMC3961413 DOI: 10.1371/journal.pone.0092663] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2013] [Accepted: 02/25/2014] [Indexed: 11/18/2022] Open
Abstract
Background Many tardigrade species are capable of anhydrobiosis; however, mechanisms underlying their extreme desiccation resistance remain elusive. This study attempts to quantify the anhydrobiotic transcriptome of the limno-terrestrial tardigrade Milnesium tardigradum. Results A prerequisite for differential gene expression analysis was the generation of a reference hybrid transcriptome atlas by assembly of Sanger, 454 and Illumina sequence data. The final assembly yielded 79,064 contigs (>100 bp) after removal of ribosomal RNAs. Around 50% of them could be annotated by SwissProt and NCBI non-redundant protein sequences. Analysis using CEGMA predicted 232 (93.5%) out of the 248 highly conserved eukaryotic genes in the assembly. We used this reference transcriptome for mapping and quantifying the expression of transcripts regulated under anhdydrobiosis in a time-series during dehydration and rehydration. 834 of the transcripts were found to be differentially expressed in a single stage (dehydration/inactive tun/rehydration) and 184 were overlapping in two stages while 74 were differentially expressed in all three stages. We have found interesting patterns of differentially expressed transcripts that are in concordance with a common hypothesis of metabolic shutdown during anhydrobiosis. This included down-regulation of several proteins of the DNA replication and translational machinery and protein degradation. Among others, heat shock proteins Hsp27 and Hsp30c were up-regulated in response to dehydration and rehydration. In addition, we observed up-regulation of ployubiquitin-B upon rehydration together with a higher expression level of several DNA repair proteins during rehydration than in the dehydration stage. Conclusions Most of the transcripts identified to be differentially expressed had distinct cellular function. Our data suggest a concerted molecular adaptation in M. tardigradum that permits extreme forms of ametabolic states such as anhydrobiosis. It is temping to surmise that the desiccation tolerance of tradigrades can be achieved by a constitutive cellular protection system, probably in conjunction with other mechanisms such as rehydration-induced cellular repair.
Collapse
Affiliation(s)
- Chong Wang
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- * E-mail:
| | - Markus A. Grohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Brahim Mali
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Ralph O. Schill
- Biological Institute, Zoology, University of Stuttgart, Stuttgart, Germany
| | - Marcus Frohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| |
Collapse
|
35
|
Heo Y, Wu XL, Chen D, Ma J, Hwu WM. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. ACTA ACUST UNITED AC 2014; 30:1354-62. [PMID: 24451628 DOI: 10.1093/bioinformatics/btu030] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Rapid advances in next-generation sequencing (NGS) technology have led to exponential increase in the amount of genomic information. However, NGS reads contain far more errors than data from traditional sequencing methods, and downstream genomic analysis results can be improved by correcting the errors. Unfortunately, all the previous error correction methods required a large amount of memory, making it unsuitable to process reads from large genomes with commodity computers. RESULTS We present a novel algorithm that produces accurate correction results with much less memory compared with previous solutions. The algorithm, named BLoom-filter-based Error correction Solution for high-throughput Sequencing reads (BLESS), uses a single minimum-sized Bloom filter, and is also able to tolerate a higher false-positive rate, thus allowing us to correct errors with a 40× memory usage reduction on average compared with previous methods. Meanwhile, BLESS can extend reads like DNA assemblers to correct errors at the end of reads. Evaluations using real and simulated reads showed that BLESS could generate more accurate results than existing solutions. After errors were corrected using BLESS, 69% of initially unaligned reads could be aligned correctly. Additionally, de novo assembly results became 50% longer with 66% fewer assembly errors. AVAILABILITY AND IMPLEMENTATION Freely available at http://sourceforge.net/p/bless-ec CONTACT dchen@illinois.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Heo
- Department of Electrical and Computer Engineering, Department of Bioengineering and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | |
Collapse
|
36
|
McElroy K, Thomas T, Luciani F. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions. MICROBIAL INFORMATICS AND EXPERIMENTATION 2014; 4:1. [PMID: 24428920 PMCID: PMC3902414 DOI: 10.1186/2042-5783-4-1] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 01/07/2014] [Indexed: 12/15/2022]
Abstract
Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads.
Collapse
Affiliation(s)
- Kerensa McElroy
- Centre for Marine Bio-Innovation and School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW 2052, Australia.
| | | | | |
Collapse
|
37
|
|
38
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
39
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
40
|
Dorn C, Grunert M, Sperling SR. Application of high-throughput sequencing for studying genomic variations in congenital heart disease. Brief Funct Genomics 2013; 13:51-65. [PMID: 24095982 DOI: 10.1093/bfgp/elt040] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Congenital heart diseases (CHD) represent the most common birth defect in human. The majority of cases are caused by a combination of complex genetic alterations and environmental influences. In the past, many disease-causing mutations have been identified; however, there is still a large proportion of cardiac malformations with unknown precise origin. High-throughput sequencing technologies established during the last years offer novel opportunities to further study the genetic background underlying the disease. In this review, we provide a roadmap for designing and analyzing high-throughput sequencing studies focused on CHD, but also with general applicability to other complex diseases. The three main next-generation sequencing (NGS) platforms including their particular advantages and disadvantages are presented. To identify potentially disease-related genomic variations and genes, different filtering steps and gene prioritization strategies are discussed. In addition, available control datasets based on NGS are summarized. Finally, we provide an overview of current studies already using NGS technologies and showing that these techniques will help to further unravel the complex genetics underlying CHD.
Collapse
Affiliation(s)
- Cornelia Dorn
- Department of Cardiovascular Genetics, Experimental and Clinical Research Center (ECRC), Charité-University Medicine Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Lindenberger Weg 80, 13125 Berlin, Germany. Department of Biochemistry, Free University Berlin, Berlin, Germany. Tel.: +49-(0)30-450540123; Fax: +49-(0)30-84131699;
| | | | | |
Collapse
|
41
|
Farrer RA, Henk DA, MacLean D, Studholme DJ, Fisher MC. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects. Sci Rep 2013; 3:1512. [PMID: 23518929 PMCID: PMC3604800 DOI: 10.1038/srep01512] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2012] [Accepted: 02/25/2013] [Indexed: 12/16/2022] Open
Abstract
Sequence alignments form the basis for many comparative and population genomic studies. Alignment tools provide a range of accuracies dependent on the divergence between the sequences and the alignment methods. Despite widespread use, there is no standard method for assessing the accuracy of a dataset and alignment strategy after resequencing. We present a framework and tool for determining the overall accuracies of an input read dataset, alignment and SNP-calling method providing an isolate in that dataset has a corresponding, or closely related reference sequence available. In addition to this tool for comparing False Discovery Rates (FDR), we include a method for determining homozygous and heterozygous positions from an alignment using binomial probabilities for an expected error rate. We benchmark this method against other SNP callers using our FDR method with three fungal genomes, finding that it was able achieve a high level of accuracy. These tools are available at http://cfdr.sourceforge.net/.
Collapse
Affiliation(s)
- Rhys A Farrer
- Department of Infectious Disease Epidemiology, St Mary's Hospital, Imperial College London, London, UK.
| | | | | | | | | |
Collapse
|
42
|
Abstract
MOTIVATION High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS We propose RACER (Rapid and Accurate Correction of Errors in Reads), a new software program for correcting errors in sequencing data. RACER has better error-correcting performance than existing programs, is faster and requires less memory. To support our claims, we performed extensive comparison with the existing leading programs on a variety of real datasets. AVAILABILITY RACER is freely available for non-commercial use at www.csd.uwo.ca/∼ilie/RACER/.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, ON, Canada
| | | |
Collapse
|
43
|
Abstract
The extremely high error rates reported by Keegan et al. in ‘A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE’ (PLoS Comput Biol 2012;8:e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.
Collapse
|
44
|
Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. ACTA ACUST UNITED AC 2012. [PMID: 23202746 DOI: 10.1093/bioinformatics/bts690] [Citation(s) in RCA: 180] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
MOTIVATION The imperfect sequence data produced by next-generation sequencing technologies have motivated the development of a number of short-read error correctors in recent years. The majority of methods focus on the correction of substitution errors, which are the dominant error source in data produced by Illumina sequencing technology. Existing tools either score high in terms of recall or precision but not consistently high in terms of both measures. RESULTS In this article, we present Musket, an efficient multistage k-mer-based corrector for Illumina short-read data. We use the k-mer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multi-threaded using a master-slave model and demonstrates superior parallel scalability compared with all other evaluated correctors as well as a highly competitive overall execution time. AVAILABILITY Musket is available at http://musket.sourceforge.net.
Collapse
Affiliation(s)
- Yongchao Liu
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz 55099, Germany.
| | | | | |
Collapse
|
45
|
Carneiro AR, Ramos RTJ, Barbosa HPM, Schneider MPC, Barh D, Azevedo V, Silva A. Quality of prokaryote genome assembly: Indispensable issues of factors affecting prokaryote genome assembly quality. Gene 2012; 505:365-7. [DOI: 10.1016/j.gene.2012.06.016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2012] [Revised: 06/09/2012] [Accepted: 06/11/2012] [Indexed: 12/21/2022]
|
46
|
Wang XV, Blades N, Ding J, Sultana R, Parmigiani G. Estimation of sequencing error rates in short reads. BMC Bioinformatics 2012; 13:185. [PMID: 22846331 PMCID: PMC3495688 DOI: 10.1186/1471-2105-13-185] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2011] [Accepted: 07/13/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. RESULTS We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html. CONCLUSIONS The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.
Collapse
Affiliation(s)
- Xin Victoria Wang
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | | | | | | | | |
Collapse
|
47
|
Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012; 14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
48
|
Burriesci MS, Lehnert EM, Pringle JR. Fulcrum: condensing redundant reads from high-throughput sequencing studies. ACTA ACUST UNITED AC 2012; 28:1324-7. [PMID: 22419786 DOI: 10.1093/bioinformatics/bts123] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Ultra-high-throughput sequencing produces duplicate and near-duplicate reads, which can consume computational resources in downstream applications. A tool that collapses such reads should reduce storage and assembly complications and costs. RESULTS We developed Fulcrum to collapse identical and near-identical Illumina and 454 reads (such as those from PCR clones) into single error-corrected sequences; it can process paired-end as well as single-end reads. Fulcrum is customizable and can be deployed on a single machine, a local network or a commercially available MapReduce cluster, and it has been optimized to maximize ease-of-use, cross-platform compatibility and future scalability. Sequence datasets have been collapsed by up to 71%, and the reduced number and improved quality of the resulting sequences allow assemblers to produce longer contigs while using less memory.
Collapse
Affiliation(s)
- Matthew S Burriesci
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120, USA
| | | | | |
Collapse
|
49
|
Pickrell WO, Rees MI, Chung SK. Next Generation Sequencing Methodologies - An Overview. CHALLENGES AND OPPORTUNITIES OF NEXT-GENERATION SEQUENCING FOR BIOMEDICAL RESEARCH 2012; 89:1-26. [DOI: 10.1016/b978-0-12-394287-6.00001-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
50
|
Wijaya E, Frith MC, Asai K, Horton P. RecountDB: a database of mapped and count corrected transcribed sequences. Nucleic Acids Res 2011; 40:D1089-92. [PMID: 22139942 PMCID: PMC3245132 DOI: 10.1093/nar/gkr1172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The field of gene expression analysis continues to benefit from next-generation sequencing generated data, which enables transcripts to be measured with unmatched accuracy and resolution. But the high-throughput reads from these technologies also contain many errors, which can compromise the ability to accurately detect and quantify rare transcripts. Fortunately, techniques exist to ameliorate the affects of sequencer error. We present RecountDB, a secondary database derived from primary data in NCBI's short read archive. RecountDB holds sequence counts from RNA-seq and 5′ capped transcription start site experiments, corrected and mapped to the relevant genome. Via a searchable and browseable interface users can obtain corrected data in formats useful for transcriptomic analysis. The database is currently populated with 2265 entries from 45 organisms and continuously growing. RecountDB is publicly available at: http://recountdb.cbrc.jp.
Collapse
Affiliation(s)
- Edward Wijaya
- Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8562, Japan
| | | | | | | |
Collapse
|