101
|
Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18:374. [PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/11/2017] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
Collapse
|
102
|
Whole-Genome Sequences of Two Carbapenem-Resistant Klebsiella quasipneumoniae Strains Isolated from a Tertiary Hospital in Johor, Malaysia. GENOME ANNOUNCEMENTS 2017; 5:5/32/e00768-17. [PMID: 28798179 PMCID: PMC5552988 DOI: 10.1128/genomea.00768-17] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
We report the whole-genome sequences of two carbapenem-resistant clinical isolates of Klebsiella quasipneumoniae subsp. similipneumoniae obtained from two different patients. Both strains contained three different extended-spectrum β-lactamase genes and showed strikingly high pairwise average nucleotide identity of 99.99% despite being isolated 3 years apart from the same hospital.
Collapse
|
103
|
Malhotra R, Jha M, Poss M, Acharya R. A random forest classifier for detecting rare variants in NGS data from viral populations. Comput Struct Biotechnol J 2017; 15:388-395. [PMID: 28819548 PMCID: PMC5548337 DOI: 10.1016/j.csbj.2017.07.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Revised: 07/01/2017] [Accepted: 07/03/2017] [Indexed: 11/28/2022] Open
Abstract
We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.
Collapse
Affiliation(s)
- Raunaq Malhotra
- The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Manjari Jha
- The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Mary Poss
- Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
| | - Raj Acharya
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
104
|
Dlugosz M, Deorowicz S. RECKONER: read error corrector based on KMC. Bioinformatics 2017; 33:1086-1089. [PMID: 28062451 DOI: 10.1093/bioinformatics/btw746] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 11/24/2016] [Indexed: 11/12/2022] Open
Abstract
Summary Presence of sequencing errors in data produced by next-generation sequencers affects quality of downstream analyzes. Accuracy of them can be improved by performing error correction of sequencing reads. We introduce a new correction algorithm capable of processing eukaryotic close to 500 Mbp-genome-size, high error-rated data using less than 4 GB of RAM in about 35 min on 16-core computer. Availability and Implementation Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner . Contact sebastian.deorowicz@polsl.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
105
|
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res 2017; 27:768-777. [PMID: 28232478 PMCID: PMC5411771 DOI: 10.1101/gr.214346.116] [Citation(s) in RCA: 413] [Impact Index Per Article: 51.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2016] [Accepted: 02/14/2017] [Indexed: 01/19/2023]
Abstract
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.
Collapse
Affiliation(s)
- Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Justin Chu
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Sarah Yeo
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Golnaz Jahesh
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Hamza Khan
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Rene L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| |
Collapse
|
106
|
Somervuo P, Yu DW, Xu CC, Ji Y, Hultman J, Wirta H, Ovaskainen O. Quantifying uncertainty of taxonomic placement in
DNA
barcoding and metabarcoding. Methods Ecol Evol 2017. [DOI: 10.1111/2041-210x.12721] [Citation(s) in RCA: 62] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Panu Somervuo
- Department of Biosciences University of Helsinki P.O. Box 65 Helsinki FI‐00014 Finland
| | - Douglas W. Yu
- State Key Laboratory of Genetic Resources and Evolution Kunming Institute of Zoology Chinese Academy of Sciences 32 Jiaochang East Road Kunming Yunnan 650223 China
- School of Biological Sciences University of East Anglia Norwich Research Park Norwich Norfolk NR47TJ UK
| | - Charles C.Y. Xu
- State Key Laboratory of Genetic Resources and Evolution Kunming Institute of Zoology Chinese Academy of Sciences 32 Jiaochang East Road Kunming Yunnan 650223 China
- Groningen Institute for Evolutionary Life Sciences University of Groningen P.O. Box 11103 9700 CC Groningen The Netherlands
| | - Yinqiu Ji
- State Key Laboratory of Genetic Resources and Evolution Kunming Institute of Zoology Chinese Academy of Sciences 32 Jiaochang East Road Kunming Yunnan 650223 China
| | - Jenni Hultman
- Department of Food and Environmental Sciences University of Helsinki P.O. Box 56 Helsinki FI‐00014 Finland
| | - Helena Wirta
- Department of Agricultural Sciences University of Helsinki P.O. Box 27 Helsinki FI‐00014 Finland
| | - Otso Ovaskainen
- Department of Biosciences University of Helsinki P.O. Box 65 Helsinki FI‐00014 Finland
- Centre for Biodiversity Dynamics Department of Biology Norwegian University of Science and Technology N‐7491 Trondheim Norway
| |
Collapse
|
107
|
Gerhard GS, Bann DV, Broach J, Goldenberg D. Pitfalls of exome sequencing: a case study of the attribution of HABP2 rs7080536 in familial non-medullary thyroid cancer. NPJ Genom Med 2017; 2:8. [PMID: 28884020 PMCID: PMC5584869 DOI: 10.1038/s41525-017-0011-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Revised: 02/07/2017] [Accepted: 02/28/2017] [Indexed: 02/06/2023] Open
Abstract
Next-generation sequencing using exome capture is a common approach used for analysis of familial cancer syndromes. Despite the development of robust computational algorithms, the accrued experience of analyzing exome data sets and published guidelines, the analytical process remains an ad hoc series of important decisions and interpretations that require significant oversight. Processes and tools used for sequence data generation have matured and are standardized to a significant degree. For the remainder of the analytical pipeline, however, the results can be highly dependent on the choices made and careful review of results. We used primary exome sequence data, generously provided by the corresponding author, from a family with highly penetrant familial non-medullary thyroid cancer reported to be caused by HABP2 rs7080536 to review the importance of several key steps in the application of exome sequencing for discovery of new familial cancer genes. Differences in allele frequencies across populations, probabilities of familial segregation, functional impact predictions, corroborating biological support, and inconsistent replication studies can play major roles in influencing interpretation of results. In the case of HABP2 rs7080536 and familial non-medullary thyroid cancer, these factors led to the conclusion of an association that most data and our re-analysis fail to support, although larger studies from diverse populations will be needed to definitively determine its role.
Collapse
Affiliation(s)
- Glenn S. Gerhard
- Lewis Katz School of Medicine at Temple University, Philadelphia, PA 19140 USA
| | | | - James Broach
- Penn State College of Medicine, Hershey, PA 17033 USA
| | | |
Collapse
|
108
|
Leray M, Knowlton N. Random sampling causes the low reproducibility of rare eukaryotic OTUs in Illumina COI metabarcoding. PeerJ 2017; 5:e3006. [PMID: 28348924 PMCID: PMC5364921 DOI: 10.7717/peerj.3006] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 01/20/2017] [Indexed: 12/15/2022] Open
Abstract
DNA metabarcoding, the PCR-based profiling of natural communities, is becoming the method of choice for biodiversity monitoring because it circumvents some of the limitations inherent to traditional ecological surveys. However, potential sources of bias that can affect the reproducibility of this method remain to be quantified. The interpretation of differences in patterns of sequence abundance and the ecological relevance of rare sequences remain particularly uncertain. Here we used one artificial mock community to explore the significance of abundance patterns and disentangle the effects of two potential biases on data reproducibility: indexed PCR primers and random sampling during Illumina MiSeq sequencing. We amplified a short fragment of the mitochondrial Cytochrome c Oxidase Subunit I (COI) for a single mock sample containing equimolar amounts of total genomic DNA from 34 marine invertebrates belonging to six phyla. We used seven indexed broad-range primers and sequenced the resulting library on two consecutive Illumina MiSeq runs. The total number of Operational Taxonomic Units (OTUs) was ∼4 times higher than expected based on the composition of the mock sample. Moreover, the total number of reads for the 34 components of the mock sample differed by up to three orders of magnitude. However, 79 out of 86 of the unexpected OTUs were represented by <10 sequences that did not appear consistently across replicates. Our data suggest that random sampling of rare OTUs (e.g., small associated fauna such as parasites) accounted for most of variation in OTU presence–absence, whereas biases associated with indexed PCRs accounted for a larger amount of variation in relative abundance patterns. These results suggest that random sampling during sequencing leads to the low reproducibility of rare OTUs. We suggest that the strategy for handling rare OTUs should depend on the objectives of the study. Systematic removal of rare OTUs may avoid inflating diversity based on common β descriptors but will exclude positive records of taxa that are functionally important. Our results further reinforce the need for technical replicates (parallel PCR and sequencing from the same sample) in metabarcoding experimental designs. Data reproducibility should be determined empirically as it will depend upon the sequencing depth, the type of sample, the sequence analysis pipeline, and the number of replicates. Moreover, estimating relative biomasses or abundances based on read counts remains elusive at the OTU level.
Collapse
Affiliation(s)
- Matthieu Leray
- National Museum of Natural History, Smithsonian Institution, Washington, D.C., USA; Smithsonian Tropical Research Institute, Smithsonian Institution, Panama City, Balboa, Ancon, Republic of Panama
| | - Nancy Knowlton
- National Museum of Natural History, Smithsonian Institution , Washington , D.C. , USA
| |
Collapse
|
109
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
110
|
Next generation sequencing of gonadal transcriptome suggests standard maternal inheritance of mitochondrial DNA in Eurhomalea rufa (Veneridae). Mar Genomics 2017. [DOI: 10.1016/j.margen.2016.11.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
111
|
Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res 2016; 27:300-309. [PMID: 27986821 PMCID: PMC5287235 DOI: 10.1101/gr.211748.116] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 12/14/2016] [Indexed: 01/04/2023]
Abstract
We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows–Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.
Collapse
Affiliation(s)
- Dirk D Dolle
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Zhicheng Liu
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Matthew Cotten
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Shane A McCarthy
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Thomas M Keane
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
112
|
Gunn L, Finn S, Hurley D, Bai L, Wall E, Iversen C, Threlfall JE, Fanning S. Molecular Characterization of Salmonella Serovars Anatum and Ealing Associated with Two Historical Outbreaks, Linked to Contaminated Powdered Infant Formula. Front Microbiol 2016; 7:1664. [PMID: 27818652 PMCID: PMC5073096 DOI: 10.3389/fmicb.2016.01664] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Accepted: 10/05/2016] [Indexed: 11/13/2022] Open
Abstract
Powdered infant formula (PIF) is not intended to be produced as a sterile product unless explicitly stated and on occasion may become contaminated during production with pathogens such as Salmonella enterica. This retrospective study focused on two historically reported salmonellosis outbreaks associated with PIF from the United Kingdom and France, in 1985 and 1996/1997. In this paper, the molecular characterization of the two outbreaks associated Salmonella serovars Anatum and Ealing is reported. Initially the isolates were analyzed using pulsed-field gel electrophoresis (PFGE), which revealed the clonal nature of the two outbreaks. Following from this two representative isolates, one from each serovar was selected for whole genome sequencing (WGS), wherein analysis focused on the Salmonella pathogenicity islands. Furthermore, the ability of these isolates to survive the host intercellular environment was determined using an ex vivo gentamicin protection assay. Results suggest a high level of genetic diversity that may have contributed to survival and virulence of isolates from these outbreaks.
Collapse
Affiliation(s)
- Lynda Gunn
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College Dublin Dublin, Ireland
| | - Sarah Finn
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College Dublin Dublin, Ireland
| | - Daniel Hurley
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College Dublin Dublin, Ireland
| | - Li Bai
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College DublinDublin, Ireland; Key Laboratory of Food Safety Risk Assessment, Ministry of Health, China National Center for Food Safety Risk AssessmentBeijing, China
| | - Ellen Wall
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College Dublin Dublin, Ireland
| | - Carol Iversen
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College Dublin Dublin, Ireland
| | | | - Séamus Fanning
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, University College DublinDublin, Ireland; Key Laboratory of Food Safety Risk Assessment, Ministry of Health, China National Center for Food Safety Risk AssessmentBeijing, China; Institute for Global Food Security, Queen's University BelfastBelfast, UK
| |
Collapse
|
113
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
114
|
Hou D, Chen C, Seely EJ, Chen S, Song Y. High-Throughput Sequencing-Based Immune Repertoire Study during Infectious Disease. Front Immunol 2016; 7:336. [PMID: 27630639 PMCID: PMC5005336 DOI: 10.3389/fimmu.2016.00336] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 08/19/2016] [Indexed: 11/13/2022] Open
Abstract
The selectivity of the adaptive immune response is based on the enormous diversity of T and B cell antigen-specific receptors. The immune repertoire, the collection of T and B cells with functional diversity in the circulatory system at any given time, is dynamic and reflects the essence of immune selectivity. In this article, we review the recent advances in immune repertoire study of infectious diseases, which were achieved by traditional techniques and high-throughput sequencing (HTS) techniques. HTS techniques enable the determination of complementary regions of lymphocyte receptors with unprecedented efficiency and scale. This progress in methodology enhances the understanding of immunologic changes during pathogen challenge and also provides a basis for further development of novel diagnostic markers, immunotherapies, and vaccines.
Collapse
Affiliation(s)
- Dongni Hou
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Cuicui Chen
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Eric John Seely
- Department of Medicine, Division of Pulmonary and Critical Care Medicine, University of California San Francisco , San Francisco, CA , USA
| | - Shujing Chen
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Yuanlin Song
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| |
Collapse
|
115
|
Heo Y, Ramachandran A, Hwu WM, Ma J, Chen D. BLESS 2: accurate, memory-efficient and fast error correction method. ACTA ACUST UNITED AC 2016; 32:2369-71. [PMID: 27153708 DOI: 10.1093/bioinformatics/btw146] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2015] [Accepted: 03/12/2016] [Indexed: 11/14/2022]
Abstract
UNLABELLED The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes. AVAILABILITY AND IMPLEMENTATION Freely available at https://sourceforge.net/projects/bless-ec CONTACT dchen@illinois.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Heo
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Anand Ramachandran
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Wen-Mei Hwu
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Jian Ma
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Deming Chen
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
116
|
Bremges A, Singer E, Woyke T, Sczyrba A. MeCorS: Metagenome-enabled error correction of single cell sequencing reads. Bioinformatics 2016; 32:2199-201. [PMID: 27153586 PMCID: PMC4937190 DOI: 10.1093/bioinformatics/btw144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2015] [Accepted: 03/09/2016] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED We present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies. AVAILABILITY AND IMPLEMENTATION https://github.com/metagenomics/MeCorS CONTACT: abremges@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andreas Bremges
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Esther Singer
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Tanja Woyke
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Alexander Sczyrba
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| |
Collapse
|
117
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
118
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
119
|
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015; 4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. FINDINGS We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. CONCLUSIONS Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Collapse
|
120
|
Abstract
UNLABELLED FermiKit is a variant calling pipeline for Illumina whole-genome germline data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions and structural variations. FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85 GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/fermikit CONTACT hengli@broadinstitute.org.
Collapse
Affiliation(s)
- Heng Li
- Genomics Platform, Broad Institute, Cambridge, MA 02142, USA
| |
Collapse
|