1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
3
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 PMCID: PMC11222498 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
4
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
5
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
6
|
Mencke JL, He Y, Filippov AA, Nikolich MP, Belew AT, Fouts DE, McGann PT, Swierczewski BE, Getnet D, Ellison DW, Margulieux KR. Identification and Characterization of vB_PreP_EPr2, a Lytic Bacteriophage of Pan-Drug Resistant Providencia rettgeri. Viruses 2022; 14:v14040708. [PMID: 35458437 PMCID: PMC9026810 DOI: 10.3390/v14040708] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 03/14/2022] [Accepted: 03/23/2022] [Indexed: 12/04/2022] Open
Abstract
Providencia rettgeri is an emerging opportunistic Gram-negative pathogen with reports of increasing antibiotic resistance. Pan-drug resistant (PDR) P. rettgeri infections are a growing concern, demonstrating a need for the development of alternative treatment options which is fueling a renewed interest in bacteriophage (phage) therapy. Here, we identify and characterize phage vB_PreP_EPr2 (EPr2) with lytic activity against PDR P. rettgeri MRSN 845308, a clinical isolate that carries multiple antibiotic resistance genes. EPr2 was isolated from an environmental water sample and belongs to the family Autographiviridae, subfamily Studiervirinae and genus Kayfunavirus, with a genome size of 41,261 base pairs. Additional phenotypic characterization showed an optimal MOI of 1 and a burst size of 12.3 ± 3.4 PFU per bacterium. EPr2 was determined to have a narrow host range against a panel of clinical P. rettgeri strains. Despite this fact, EPr2 is a promising lytic phage with potential for use as an alternative therapeutic for treatment of PDR P. rettgeri infections.
Collapse
Affiliation(s)
- Jaime L. Mencke
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
- F. Edward Hébert School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, MD 20814, USA
| | - Yunxiu He
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
| | - Andrey A. Filippov
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
| | - Mikeljon P. Nikolich
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
| | - Ashton T. Belew
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
| | | | - Patrick T. McGann
- Multidrug-Resistant Organism Repository and Surveillance Network (MRSN), Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA;
| | - Brett E. Swierczewski
- Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA;
| | - Derese Getnet
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
| | - Damon W. Ellison
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
- Correspondence: (D.W.E.); (K.R.M.)
| | - Katie R. Margulieux
- Wound Infections Department, Bacterial Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA; (J.L.M.); (Y.H.); (A.A.F.); (M.P.N.); (A.T.B.); (D.G.)
- Correspondence: (D.W.E.); (K.R.M.)
| |
Collapse
|
7
|
Sharma A, Jain P, Mahgoub A, Zhou Z, Mahadik K, Chaterji S. Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing. BMC Bioinformatics 2022; 23:25. [PMID: 34991450 PMCID: PMC8734100 DOI: 10.1186/s12859-021-04547-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 12/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. RESULTS First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model's estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline-18[Formula: see text] faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. CONCLUSION Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics .
Collapse
Affiliation(s)
| | - Pranjal Jain
- Indian Institute of Technology Bombay, Mumbai, India
| | | | | | | | | |
Collapse
|
8
|
He PY, Shao XQ, Duan SF, Han DY, Li K, Shi JY, Zhang RP, Han PJ, Wang QM, Bai FY. Highly diverged lineages of Saccharomyces paradoxus in temperate to subtropical climate zones in China. Yeast 2021; 39:69-82. [PMID: 34961959 DOI: 10.1002/yea.3688] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 12/19/2021] [Accepted: 12/21/2021] [Indexed: 11/06/2022] Open
Abstract
The wild yeast Saccharomyces paradoxus has become a new model in ecology and evolutionary biology. Different lineages of S. paradoxus have been recognized across the world, but the distribution and genetic diversity of the species remain unknown in China, where the origin of its sibling species S. cerevisiae lies. In this study, we investigated the ecological and geographic distribution of S. paradoxus through an extensive field survey in China and performed population genomic analysis on a set of S. paradoxus strains, including 27 strains, representing different geographic and ecological origins within China, and 59 strains representing all the known lineages of the species recognized in the other regions of the world so far. We found two distinct lineages of S. paradoxus in China. The majority of the Chinese strains studied belong to the Far East lineage, and six strains belong to a novel highly diverged lineage. The distribution of these two lineages overlaps ecologically and geographically in temperate to subtropical climate zones in China. With the addition of the new China lineage, the Eurasian population of S. paradoxus exhibits higher genetic diversity than the American population. We observed more possible lineage-specific introgression events from the Eurasian lineages than from the American lineages. Our results expand the knowledge on ecology, genetic diversity, biogeography, and evolution of S. paradoxus.
Collapse
Affiliation(s)
- Peng-Yu He
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Xu-Qian Shao
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Genetic Engineering Division, China National Intellectual Property Administration (CNIPA), Beijing, China
| | - Shou-Fu Duan
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Da-Yong Han
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Kuan Li
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Jun-Yan Shi
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Ri-Peng Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Pei-Jie Han
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Qi-Ming Wang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, Institute of Life Sciences and Green Development, Hebei University, Baoding, Hebei, China
| | - Feng-Yan Bai
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
9
|
Bower OJ, McCarthy A, Lea RA, Alanis-Lobato G, Zohren J, Gerri C, Turner JMA, Niakan KK. Generating CRISPR-Cas9-Mediated Null Mutations and Screening Targeting Efficiency in Human Pluripotent Stem Cells. Curr Protoc 2021; 1:e232. [PMID: 34432381 DOI: 10.1002/cpz1.232] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
CRISPR-Cas9 mutagenesis facilitates the investigation of gene function in a number of developmental and cellular contexts. Human pluripotent stem cells (hPSCs), either embryonic or induced, are a tractable cellular model to investigate molecular mechanisms involved in early human development and cell fate decisions. hPSCs also have broad potential in regenerative medicine to model, investigate, and ameliorate diseases. Here, we provide an optimized protocol for efficient CRISPR-Cas9 genome editing of hPSCs to investigate the functional role of genes by engineering null mutations. We emphasize the importance of screening single guide RNAs (sgRNAs) to identify those with high targeting efficiency for generation of clonally derived null mutant hPSC lines. We provide important considerations for targeting genes that may have a role in hPSC maintenance. We also present methods to evaluate the on-target mutation spectrum and unintended karyotypic changes. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Selecting and ligating sgRNAs into expression plasmids Basic Protocol 2: Validation of sgRNA via in vitro transcription and cleavage assay Basic Protocol 3: Nucleofection of primed human embryonic stem cells Basic Protocol 4: MiSeq analysis of indel mutations Basic Protocol 5: Single cell cloning of targeted hPSCs Basic Protocol 6: Karyotyping of targeted hPSCs.
Collapse
Affiliation(s)
- Oliver J Bower
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Afshan McCarthy
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Rebecca A Lea
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Gregorio Alanis-Lobato
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Jasmin Zohren
- Sex Chromosome Biology Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Claudia Gerri
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
| | - James M A Turner
- Sex Chromosome Biology Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Kathy K Niakan
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, United Kingdom
- The Centre for Trophoblast Research, Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
10
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
11
|
Frequent loss of heterozygosity in CRISPR-Cas9-edited early human embryos. Proc Natl Acad Sci U S A 2021; 118:2004832117. [PMID: 34050011 DOI: 10.1073/pnas.2004832117] [Citation(s) in RCA: 138] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
CRISPR-Cas9 genome editing is a promising technique for clinical applications, such as the correction of disease-associated alleles in somatic cells. The use of this approach has also been discussed in the context of heritable editing of the human germ line. However, studies assessing gene correction in early human embryos report low efficiency of mutation repair, high rates of mosaicism, and the possibility of unintended editing outcomes that may have pathologic consequences. We developed computational pipelines to assess single-cell genomics and transcriptomics datasets from OCT4 (POU5F1) CRISPR-Cas9-targeted and control human preimplantation embryos. This allowed us to evaluate on-target mutations that would be missed by more conventional genotyping techniques. We observed loss of heterozygosity in edited cells that spanned regions beyond the POU5F1 on-target locus, as well as segmental loss and gain of chromosome 6, on which the POU5F1 gene is located. Unintended genome editing outcomes were present in ∼16% of the human embryo cells analyzed and spanned 4-20 kb. Our observations are consistent with recent findings indicating complexity at on-target sites following CRISPR-Cas9 genome editing. Our work underscores the importance of further basic research to assess the safety of genome editing techniques in human embryos, which will inform debates about the potential clinical use of this technology.
Collapse
|
12
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
13
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
14
|
Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019; 20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
Collapse
Affiliation(s)
- Arghya Kusum Das
- Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI USA
| | - Sayan Goswami
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Kisung Lee
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Seung-Jong Park
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| |
Collapse
|
15
|
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep 2019; 9:16157. [PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 10/07/2019] [Indexed: 01/30/2023] Open
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.
Collapse
|
16
|
Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M. Denoising of Aligned Genomic Data. Sci Rep 2019; 9:15067. [PMID: 31636330 PMCID: PMC6803637 DOI: 10.1038/s41598-019-51418-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open
Abstract
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .
Collapse
Affiliation(s)
- Irena Fischer-Hwang
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA.
| | - Idoia Ochoa
- University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, 61801, USA
| | - Tsachy Weissman
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA
| | - Mikel Hernaez
- University of Illinois Urbana-Champaign, Carl R. Woese Institute for Genomic Biology, Urbana, 61801, USA.
| |
Collapse
|
17
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
18
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
19
|
Savel D, LaFramboise T, Grama A, Koyuturk M. Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1378-1388. [PMID: 27362987 PMCID: PMC5754272 DOI: 10.1109/tcbb.2016.2586060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction techniques utilizing substring frequencies have been developed to mitigate this effect. In this paper, we present a novel and effective method called Pluribus, for correcting sequencing errors using a generalized suffix trie. Pluribus utilizes multiple manifestations of an error in the trie to accurately identify errors and suggest corrections. We show that Pluribus produces the least number of false positives across a diverse set of real sequencing datasets when compared to other methods. Furthermore, Pluribus can be used in conjunction with other contemporary error correction methods to achieve higher levels of accuracy than either tool alone. These increases in error correction accuracy are also realized in the quality of the contigs that are generated during assembly. We explore, in-depth, the behavior of Pluribus , to explain the observed improvement in accuracy and assembly performance. Pluribus is freely available at http://compbio. CASE edu/pluribus/.
Collapse
|
20
|
Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18:374. [PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/11/2017] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
Collapse
|
21
|
Dlugosz M, Deorowicz S. RECKONER: read error corrector based on KMC. Bioinformatics 2017; 33:1086-1089. [PMID: 28062451 DOI: 10.1093/bioinformatics/btw746] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 11/24/2016] [Indexed: 11/12/2022] Open
Abstract
Summary Presence of sequencing errors in data produced by next-generation sequencers affects quality of downstream analyzes. Accuracy of them can be improved by performing error correction of sequencing reads. We introduce a new correction algorithm capable of processing eukaryotic close to 500 Mbp-genome-size, high error-rated data using less than 4 GB of RAM in about 35 min on 16-core computer. Availability and Implementation Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner . Contact sebastian.deorowicz@polsl.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
22
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
23
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
24
|
Genome Sequence and Analysis of a Stress-Tolerant, Wild-Derived Strain of Saccharomyces cerevisiae Used in Biofuels Research. G3-GENES GENOMES GENETICS 2016; 6:1757-66. [PMID: 27172212 PMCID: PMC4889671 DOI: 10.1534/g3.116.029389] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The genome sequences of more than 100 strains of the yeast Saccharomyces cerevisiae have been published. Unfortunately, most of these genome assemblies contain dozens to hundreds of gaps at repetitive sequences, including transposable elements, tRNAs, and subtelomeric regions, which is where novel genes generally reside. Relatively few strains have been chosen for genome sequencing based on their biofuel production potential, leaving an additional knowledge gap. Here, we describe the nearly complete genome sequence of GLBRCY22-3 (Y22-3), a strain of S. cerevisiae derived from the stress-tolerant wild strain NRRL YB-210 and subsequently engineered for xylose metabolism. After benchmarking several genome assembly approaches, we developed a pipeline to integrate Pacific Biosciences (PacBio) and Illumina sequencing data and achieved one of the highest quality genome assemblies for any S. cerevisiae strain. Specifically, the contig N50 is 693 kbp, and the sequences of most chromosomes, the mitochondrial genome, and the 2-micron plasmid are complete. Our annotation predicts 92 genes that are not present in the reference genome of the laboratory strain S288c, over 70% of which were expressed. We predicted functions for 43 of these genes, 28 of which were previously uncharacterized and unnamed. Remarkably, many of these genes are predicted to be involved in stress tolerance and carbon metabolism and are shared with a Brazilian bioethanol production strain, even though the strains differ dramatically at most genetic loci. The Y22-3 genome sequence provides an exceptionally high-quality resource for basic and applied research in bioenergy and genetics.
Collapse
|
25
|
The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]
Abstract
Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research.
Collapse
|
26
|
Ni S, Stoneking M. Improvement in detection of minor alleles in next generation sequencing by base quality recalibration. BMC Genomics 2016; 17:139. [PMID: 26920804 PMCID: PMC4769523 DOI: 10.1186/s12864-016-2463-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 02/11/2016] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Minor allele detection in very high coverage sequence data (>1000X) has many applications such as detecting mtDNA heteroplasmy, somatic mutations in cancer or tumors, SNP calling in pool sequencing, etc., where reads with low frequency are not necessarily sequence error but may instead convey biological information. However, the suitability of common base quality recalibration tools for such applications has not been investigated in detail. RESULTS We show that the widely used tool GATK BaseRecalibration has several limitations in minor allele detection. First, GATK IndelRealignment fails to work if the sequence coverage is above a certain level since it then becomes computationally infeasible. Second, the accuracy of the base quality largely depends on the database of known SNPs as the control, which limits the ability of de novo minor allele detection. Third, GATK reduces the base quality of sequence errors at the cost of reducing scores for true minor alleles. To overcome these limitations, we present a novel approach called SEGREG, which applies segmented regression to control sequences (e.g. phiX174 DNA) spiked into a sequencing run. Based on simulations SEGREG improves both the accuracy of base quality scores and the detection of minor alleles. We further investigate sequence error and recalibration parameters by applying a Logarithm Likelihood Ratio (LLR) approach to SEGREG recalibrated base quality scores for phiX174 DNA sequenced to very high coverage, and for mtDNA genome sequences previously analyzed for heteroplasmic variants. CONCLUSIONS Our results suggest that SEGREG improves base recalibration without suffering the limitations discussed above, and the LLR approach benefits from SEGREG in identifying more true minor alleles, while avoiding false positives from sequencing error.
Collapse
Affiliation(s)
- Shengyu Ni
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, D04103, Germany.
| | - Mark Stoneking
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, D04103, Germany.
| |
Collapse
|
27
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
28
|
Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
29
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
30
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
31
|
Abstract
Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
Collapse
|
32
|
Abstract
We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/. Contact:hamid.mirebrahim@email.ucr.edu
Collapse
Affiliation(s)
- Hamid Mirebrahim
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Timothy J Close
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| |
Collapse
|
33
|
Gézsi A, Bolgár B, Marx P, Sarkozy P, Szalai C, Antal P. VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering. BMC Genomics 2015; 16:875. [PMID: 26510841 PMCID: PMC4625715 DOI: 10.1186/s12864-015-2050-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2050-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- András Gézsi
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary. .,Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Bence Bolgár
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Péter Marx
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Peter Sarkozy
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Csaba Szalai
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary.
| | - Péter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| |
Collapse
|
34
|
Kowalski T, Grabowski S, Deorowicz S. Indexing Arbitrary-Length k-Mers in Sequencing Reads. PLoS One 2015; 10:e0133198. [PMID: 26182400 PMCID: PMC4504488 DOI: 10.1371/journal.pone.0133198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/24/2015] [Indexed: 11/25/2022] Open
Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Collapse
Affiliation(s)
- Tomasz Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
35
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
36
|
Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015; 10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact gmarcais@umd.edu.
Collapse
Affiliation(s)
- Guillaume Marçais
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| | | | - Aleksey Zimin
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| |
Collapse
|
37
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|
38
|
Yu YW, Yorukoglu D, Peng J, Berger B. Quality score compression improves genotyping accuracy. Nat Biotechnol 2015; 33:240-3. [PMID: 25748910 PMCID: PMC4439189 DOI: 10.1038/nbt.3170] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Y. William Yu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Deniz Yorukoglu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| |
Collapse
|
39
|
Marinier E, Brown DG, McConkey BJ. Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics 2015; 16:10. [PMID: 25592313 PMCID: PMC4307147 DOI: 10.1186/s12859-014-0435-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 12/17/2014] [Indexed: 12/13/2022] Open
Abstract
Background Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads. Results We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software. Conclusions Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.
Collapse
Affiliation(s)
- Eric Marinier
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Brendan J McConkey
- Department of Biology, University of Waterloo, 200 University Ave W, N2L3G1 Waterloo, Canada.
| |
Collapse
|
40
|
Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014; 15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Bahlul Haider
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Michael Molnar
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Roberto Solis-Oba
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| |
Collapse
|
41
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
42
|
Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. ACTA ACUST UNITED AC 2014; 30:2723-32. [PMID: 24919879 DOI: 10.1093/bioinformatics/btu368] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Bioinformatics tools, such as assemblers and aligners, are expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of stand-alone tools whose sole purpose is to detect and remove sequencing errors. A good error-correcting tool would be a transparent component in a bioinformatics pipeline, simply taking sequence data in any of the standard formats and producing a higher quality version of the same data containing far fewer errors. It should not only be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), but it has to be both fast enough and scalable enough to be usable on the large datasets being produced by current sequencing technologies, and work on data derived from both haploid and diploid organisms. RESULTS This article presents Blue, an error-correction algorithm based on k-mer consensus and context. Blue can correct substitution, deletion and insertion errors, as well as uncalled bases. It accepts both FASTQ and FASTA formats, and corrects quality scores for corrected bases. Blue also maintains the pairing of reads, both within a file and between pairs of files, making it compatible with downstream tools that depend on read pairing. Blue is memory efficient, scalable and faster than other published tools, and usable on large sequencing datasets. On the tests undertaken, Blue also proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors. One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes. AVAILABILITY AND IMPLEMENTATION The code for Blue and its related tools are available from http://www.bioinformatics.csiro.au/Blue. These programs are written in C# and run natively under Windows and under Mono on Linux.
Collapse
Affiliation(s)
- Paul Greenfield
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Konsta Duesing
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Alexie Papanicolaou
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Denis C Bauer
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| |
Collapse
|
43
|
Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. FRONTIERS IN PLANT SCIENCE 2014; 5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]
Abstract
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Collapse
Affiliation(s)
- Claudia Knief
- Institute of Crop Science and Resource Conservation—Molecular Biology of the Rhizosphere, Faculty of Agriculture, University of BonnBonn, Germany
| |
Collapse
|
44
|
Yu YW, Yorukoglu D, Berger B. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2014; 8394:385-399. [PMID: 28825060 DOI: 10.1007/978-3-319-05269-4_31] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
UNLABELLED It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.
Collapse
Affiliation(s)
- Y William Yu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Deniz Yorukoglu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Bonnie Berger
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| |
Collapse
|
45
|
Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol 2014; 23:1679-700. [DOI: 10.1111/mec.12680] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Revised: 01/17/2014] [Accepted: 01/22/2014] [Indexed: 12/17/2022]
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| | - Antonis Rokas
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| |
Collapse
|