1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
3
|
Beiki H, Murdoch BM, Park CA, Kern C, Kontechy D, Becker G, Rincon G, Jiang H, Zhou H, Thorne J, Koltes JE, Michal JJ, Davenport K, Rijnkels M, Ross PJ, Hu R, Corum S, McKay S, Smith TPL, Liu W, Ma W, Zhang X, Xu X, Han X, Jiang Z, Hu ZL, Reecy JM. Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology. Gigascience 2024; 13:giae019. [PMID: 38626724 PMCID: PMC11020238 DOI: 10.1093/gigascience/giae019] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 07/29/2023] [Accepted: 03/27/2024] [Indexed: 04/18/2024] Open
Abstract
BACKGROUND The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. RESULTS A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5' untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue-tissue interconnection involved in different traits and construct the first bovine trait similarity network. CONCLUSIONS These validated results show significant improvement over current bovine genome annotations.
Collapse
Affiliation(s)
- Hamid Beiki
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Brenda M Murdoch
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Carissa A Park
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Chandlar Kern
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Denise Kontechy
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Gabrielle Becker
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | | | - Honglin Jiang
- Department of Animal and Poultry Sciences, Virginia Tech, VA 24060, USA
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Jacob Thorne
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - James E Koltes
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Jennifer J Michal
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Kimberly Davenport
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Monique Rijnkels
- Department of Veterinary Integrative Biosciences, Texas A&M University, TX 77843, USA
| | - Pablo J Ross
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Rui Hu
- Department of Animal and Poultry Sciences, Virginia Tech, VA 24060, USA
| | - Sarah Corum
- Zoetis, Parsippany-Troy Hills, NJ 07054, USA
| | | | | | - Wansheng Liu
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Wenzhi Ma
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Xiaohui Zhang
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Xiaoqing Xu
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Xuelei Han
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Zhihua Jiang
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Zhi-Liang Hu
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - James M Reecy
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
4
|
Yan L, Yin Z, Zhang H, Zhao Z, Wang M, Müller A, Kallenborn F, Wichmann A, Wei Y, Niu B, Schmidt B, Liu W. RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data. Methods 2023; 216:39-50. [PMID: 37330158 DOI: 10.1016/j.ymeth.2023.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/26/2023] [Accepted: 06/12/2023] [Indexed: 06/19/2023] Open
Abstract
Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
Collapse
Affiliation(s)
- Lifeng Yan
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China.
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
| | - Zhan Zhao
- School of Software, Shandong University, Jinan, China
| | - Mingkai Wang
- School of Software, Shandong University, Jinan, China
| | - André Müller
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Felix Kallenborn
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Alexander Wichmann
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
5
|
Cheng C, Fei Z, Xiao P. Methods to improve the accuracy of next-generation sequencing. Front Bioeng Biotechnol 2023; 11:982111. [PMID: 36741756 PMCID: PMC9895957 DOI: 10.3389/fbioe.2023.982111] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 01/11/2023] [Indexed: 01/21/2023] Open
Abstract
Next-generation sequencing (NGS) is present in all fields of life science, which has greatly promoted the development of basic research while being gradually applied in clinical diagnosis. However, the cost and throughput advantages of next-generation sequencing are offset by large tradeoffs with respect to read length and accuracy. Specifically, its high error rate makes it extremely difficult to detect SNPs or low-abundance mutations, limiting its clinical applications, such as pharmacogenomics studies primarily based on SNP and early clinical diagnosis primarily based on low abundance mutations. Currently, Sanger sequencing is still considered to be the gold standard due to its high accuracy, so the results of next-generation sequencing require verification by Sanger sequencing in clinical practice. In order to maintain high quality next-generation sequencing data, a variety of improvements at the levels of template preparation, sequencing strategy and data processing have been developed. This study summarized the general procedures of next-generation sequencing platforms, highlighting the improvements involved in eliminating errors at each step. Furthermore, the challenges and future development of next-generation sequencing in clinical application was discussed.
Collapse
|
6
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
7
|
Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6980335. [PMID: 36120669 PMCID: PMC9477578 DOI: 10.1155/2022/6980335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 06/28/2022] [Accepted: 07/29/2022] [Indexed: 11/18/2022]
Abstract
An area of medical science, that is, gaining prominence, is DNA sequencing. Genetic mutations responsible for the disease have been detected using DNA sequencing. The research is focusing on pattern identification methodologies for dealing with DNA-sequencing problems relating to various applications. A few examples of such problems are alignment and assembly of short reads from next generation sequencing (NGS), comparing DNA sequences, and determining the frequency of a pattern in a sequence. The approximate matching of DNA sequences is also well suited for many applications equivalent to the exact matching of the sequence since the DNA sequences are often subject to mutation. Consequently, recognizing pattern similarity becomes necessary. Furthermore, it can also be used in virtually every application that calls for pattern matching, for example, spell-checking, spam filtering, and search engines. According to the traditional approach, finding a similar pattern in the case where the sequence length is ls and the pattern length is lp occurs in O (ls∗lp). This heavy processing is caused by comparing every character of the sequence repeatedly with the pattern. The research intended to reduce the time complexity of the pattern matching by introducing an approach named “optimized pattern similarity identification” (OPSI). This methodology constructs a table, entitled “shift beyond for avoiding redundant comparison” (SBARC), to bypass the characters in the texts that are already compared with the pattern. The table pertains to the information about the character distance to be skipped in the matching. OPSI discovers at most spots of similar patterns occur in the sequence (by ignoring è mismatches). The experiment resulted in the time complexity identified as O (ls. è). In comparison to the size of the pattern, the allowed number of mismatches will be much smaller. Aspects such as scalability, generalizability, and performance of the OPSI algorithm are discussed. In comparison with the hamming distance-based approximate pattern matching algorithm, the proposed algorithm is found to be 69% more efficient.
Collapse
|
8
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
9
|
Nogueira-Rodrigues J, Leite SC, Pinto-Costa R, Sousa SC, Luz LL, Sintra MA, Oliveira R, Monteiro AC, Pinheiro GG, Vitorino M, Silva JA, Simão S, Fernandes VE, Provazník J, Benes V, Cruz CD, Safronov BV, Magalhães A, Reis CA, Vieira J, Vieira CP, Tiscórnia G, Araújo IM, Sousa MM. Rewired glycosylation activity promotes scarless regeneration and functional recovery in spiny mice after complete spinal cord transection. Dev Cell 2021; 57:440-450.e7. [PMID: 34986324 DOI: 10.1016/j.devcel.2021.12.008] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 11/26/2021] [Accepted: 12/08/2021] [Indexed: 12/11/2022]
Abstract
Regeneration of adult mammalian central nervous system (CNS) axons is abortive, resulting in inability to recover function after CNS lesion, including spinal cord injury (SCI). Here, we show that the spiny mouse (Acomys) is an exception to other mammals, being capable of spontaneous and fast restoration of function after severe SCI, re-establishing hind limb coordination. Remarkably, Acomys assembles a scarless pro-regenerative tissue at the injury site, providing a unique structural continuity of the initial spinal cord geometry. The Acomys SCI site shows robust axon regeneration of multiple tracts, synapse formation, and electrophysiological signal propagation. Transcriptomic analysis of the spinal cord following transcriptome reconstruction revealed that Acomys rewires glycosylation biosynthetic pathways, culminating in a specific pro-regenerative proteoglycan signature at SCI site. Our work uncovers that a glycosylation switch is critical for axon regeneration after SCI and identifies β3gnt7, a crucial enzyme of keratan sulfate biosynthesis, as an enhancer of axon growth.
Collapse
Affiliation(s)
- Joana Nogueira-Rodrigues
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Graduate Program in Molecular and Cell Biology, Instituto de Ciências Biomédicas Abel Salazar (ICBAS), University of Porto, 4050-313 Porto, Portugal
| | - Sérgio C Leite
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Rita Pinto-Costa
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Sara C Sousa
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Graduate Program in Molecular and Cell Biology, Instituto de Ciências Biomédicas Abel Salazar (ICBAS), University of Porto, 4050-313 Porto, Portugal
| | - Liliana L Luz
- Neuronal Networks Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Maria A Sintra
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Raquel Oliveira
- Translational NeuroUrology Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Department of Biomedicine, Experimental Biology Unit, Faculty of Medicine of Porto, University of Porto, 4200-319 Porto, Portugal; Regeneration Group, Wolfson Centre for Age-Related Diseases, Institute of Psychiatry, Psychology and Neuroscience, King's College London WC2R 2LS, London, UK
| | - Ana C Monteiro
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Gonçalo G Pinheiro
- Molecular & Regenerative Medicine Laboratory, Centro de Ciências do Mar (CCMAR), University of Algarve, 8005-139 Faro, Portugal; Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal
| | - Marta Vitorino
- Molecular & Regenerative Medicine Laboratory, Centro de Ciências do Mar (CCMAR), University of Algarve, 8005-139 Faro, Portugal; Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal
| | - Joana A Silva
- Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal
| | - Sónia Simão
- Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal; Algarve Biomedical Center Research Institute (ABC-RI), University of Algarve, 8005-139 Faro, Portugal
| | - Vitor E Fernandes
- Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal; Algarve Biomedical Center Research Institute (ABC-RI), University of Algarve, 8005-139 Faro, Portugal
| | - Jan Provazník
- Genomics Core Facility, European Molecular Biology Laboratory (EMBL), 69117 Heidelberg, Germany
| | - Vladimir Benes
- Genomics Core Facility, European Molecular Biology Laboratory (EMBL), 69117 Heidelberg, Germany
| | - Célia D Cruz
- Translational NeuroUrology Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Department of Biomedicine, Experimental Biology Unit, Faculty of Medicine of Porto, University of Porto, 4200-319 Porto, Portugal
| | - Boris V Safronov
- Neuronal Networks Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Ana Magalhães
- Glycobiology in Cancer Group, Institute of Molecular Pathology and Immunology, IPATIMUP), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Department of Molecular Biology, Instituto de Ciências Biomédicas Abel Salazar (ICBAS), University of Porto, 4050-313 Porto, Portugal
| | - Celso A Reis
- Glycobiology in Cancer Group, Institute of Molecular Pathology and Immunology, IPATIMUP), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal; Department of Molecular Biology, Instituto de Ciências Biomédicas Abel Salazar (ICBAS), University of Porto, 4050-313 Porto, Portugal; Department of Pathology, Faculty of Medicine of Porto, University of Porto, 4200-319 Porto, Portugal
| | - Jorge Vieira
- Phenotypic Evolution Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Cristina P Vieira
- Phenotypic Evolution Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal
| | - Gustavo Tiscórnia
- Molecular & Regenerative Medicine Laboratory, Centro de Ciências do Mar (CCMAR), University of Algarve, 8005-139 Faro, Portugal; Clinica Eugin, Research and Development, 08006 Barcelona, Spain
| | - Inês M Araújo
- Faculty of Medicine and Biomedical Sciences, University of Algarve, 8005-139 Faro, Portugal; Algarve Biomedical Center Research Institute (ABC-RI), University of Algarve, 8005-139 Faro, Portugal; Champalimaud Research Program, Champalimaud Center for the Unknown, 1400-038 Lisbon, Portugal
| | - Mónica M Sousa
- Nerve Regeneration Group, Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, 4200-135 Porto, Portugal.
| |
Collapse
|
10
|
Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34529572 DOI: 10.1109/tcbb.2021.3113131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The extraction of k-mers from reads is an important task in many bioinformatics applications, such as all DNA sequence analysis methods based on de Bruijn graphs. These methods tend to be more accurate when the used k-mers are unique in the analyzed DNA, and thus the use of longer k-mers is preferred. When the read lengths of short read sequencing technologies increase, the error rate will become the determining factor for the largest possible value of k. Here we propose LoMeX which uses spaced seeds to extract long k-mers accurately even in the presence of sequencing errors. Our experiments show that LoMeX can extract long k-mers from current Illumina reads with a similar or higher recall than a standard k-mer counting tool. Furthermore, our experiments on simulated data show that when the read length further increases enabling even longer k-mers, the performance of standard k-mer counters declines, whereas LoMeX still extracts long k-mers successfully.
Collapse
|
11
|
Zhang X, Ping P, Hutvagner G, Blumenstein M, Li J. Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach. Nucleic Acids Res 2021; 49:e106. [PMID: 34291293 PMCID: PMC8631080 DOI: 10.1093/nar/gkab610] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 07/01/2021] [Accepted: 07/06/2021] [Indexed: 12/21/2022] Open
Abstract
Raw sequencing reads of miRNAs contain machine-made substitution errors, or even insertions and deletions (indels). Although the error rate can be low at 0.1%, precise rectification of these errors is critically important because isoform variation analysis at single-base resolution such as novel isomiR discovery, editing events understanding, differential expression analysis, or tissue-specific isoform identification is very sensitive to base positions and copy counts of the reads. Existing error correction methods do not work for miRNA sequencing data attributed to miRNAs’ length and per-read-coverage properties distinct from DNA or mRNA sequencing reads. We present a novel lattice structure combining kmers, (k – 1)mers and (k + 1)mers to address this problem. The method is particularly effective for the correction of indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every-1300-reads containing one error. Studies on experimental miRNA sequencing datasets show that the errors are often rectified at the 5′ ends and the seed regions of the reads, and that there are remarkable changes after the correction in miRNA isoform abundance, volume of singleton reads, overall entropy, isomiR families, tissue-specific miRNAs, and rare-miRNA quantities.
Collapse
Affiliation(s)
- Xuan Zhang
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Pengyao Ping
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Jinyan Li
- To whom correspondence should be addressed. Tel: +61 295149264; Fax: +61 295149264;
| |
Collapse
|
12
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
13
|
He F, Steige KA, Kovacova V, Göbel U, Bouzid M, Keightley PD, Beyer A, de Meaux J. Cis-regulatory evolution spotlights species differences in the adaptive potential of gene expression plasticity. Nat Commun 2021; 12:3376. [PMID: 34099660 PMCID: PMC8184852 DOI: 10.1038/s41467-021-23558-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 04/29/2021] [Indexed: 11/09/2022] Open
Abstract
Phenotypic plasticity is the variation in phenotype that a single genotype can produce in different environments and, as such, is an important component of individual fitness. However, whether the effect of new mutations, and hence evolution, depends on the direction of plasticity remains controversial. Here, we identify the cis-acting modifications that have reshaped gene expression in response to dehydration stress in three Arabidopsis species. Our study shows that the direction of effects of most cis-regulatory variants differentiating the response between A. thaliana and the sister species A. lyrata and A. halleri depends on the direction of pre-existing plasticity in gene expression. A comparison of the rate of cis-acting variant accumulation in each lineage indicates that the selective forces driving adaptive evolution in gene expression favors regulatory changes that magnify the stress response in A. lyrata. The evolutionary constraints measured on the amino-acid sequence of these genes support this interpretation. In contrast, regulatory changes that mitigate the plastic response to stress evolved more frequently in A. halleri. Our results demonstrate that pre-existing plasticity may be a stepping stone for adaptation, but its selective remodeling differs between lineages.
Collapse
Affiliation(s)
- F He
- CEPLAS, University of Cologne, Cologne, Germany
| | - K A Steige
- CEPLAS, University of Cologne, Cologne, Germany
| | - V Kovacova
- CECAD, University of Cologne, Cologne, Germany
| | - U Göbel
- CEPLAS, University of Cologne, Cologne, Germany
| | - M Bouzid
- CEPLAS, University of Cologne, Cologne, Germany
| | - P D Keightley
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK
| | - A Beyer
- CEPLAS, University of Cologne, Cologne, Germany
| | - J de Meaux
- CEPLAS, University of Cologne, Cologne, Germany.
| |
Collapse
|
14
|
Zhang X, Liu Y, Yu Z, Blumenstein M, Hutvagner G, Li J. Instance-based error correction for short reads of disease-associated genes. BMC Bioinformatics 2021; 22:142. [PMID: 34078284 PMCID: PMC8170817 DOI: 10.1186/s12859-021-04058-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. RESULTS To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). CONCLUSION Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.
Collapse
Affiliation(s)
- Xuan Zhang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, China
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Gyorgy Hutvagner
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia.
| |
Collapse
|
15
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
16
|
Li HD, Zhang W, Luo Y, Wang J. IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200316101205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Transcriptome annotation is the basis for understanding gene structures
and analysing gene expression. The transcriptome annotation of many organisms such as humans
is far from incomplete, due partly to the challenge in the identification of isoforms that are
produced from the same gene through alternative splicing. Third generation sequencing (TGS)
reads provide unprecedented opportunity for detecting isoforms due to their long length that
exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection
methods is that they are exclusively based on sequence reads, without incorporating the sequence
information of annotated isoforms.
Objective:
We aim to develop a method to detect isoforms by incorporating annotated isoforms.
Methods:
Based on annotated isoforms, we propose a splice isoform detection method called
IsoDetect. First, the sequence at exon-exon junctions is extracted from annotated isoforms as
“short feature sequences”, which is used to distinguish splice isoforms. Second, we align these
feature sequences to long reads and partition long reads into groups that contain the same set of
feature sequences, thereby avoiding the pair-wise comparison among the large number of long
reads. Third, clustering and consensus generation are carried out based on sequence similarity. For
the long reads that do not contain any short feature sequence, clustering analysis based on
sequence similarity is performed to identify isoforms. Therefore, our method can detect not only
known but also novel isoforms.
Result:
Tested on two datasets from Calypte anna and Zebra Finch, IsoDetect shows higher speed
and good accuracies compared with four existing methods.
Conclusion:
IsoDetect may become a promising method for isoform detection.
Collapse
Affiliation(s)
- Hong-Dong Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Wenjing Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yuwen Luo
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
17
|
Reis M, Wiegleb G, Claude J, Lata R, Horchler B, Ha NT, Reimer C, Vieira CP, Vieira J, Posnien N. Multiple loci linked to inversions are associated with eye size variation in species of the Drosophila virilis phylad. Sci Rep 2020; 10:12832. [PMID: 32732947 PMCID: PMC7393161 DOI: 10.1038/s41598-020-69719-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 07/14/2020] [Indexed: 11/26/2022] Open
Abstract
The size and shape of organs is tightly controlled to achieve optimal function. Natural morphological variations often represent functional adaptations to an ever-changing environment. For instance, variation in head morphology is pervasive in insects and the underlying molecular basis is starting to be revealed in the Drosophila genus for species of the melanogaster group. However, it remains unclear whether similar diversifications are governed by similar or different molecular mechanisms over longer timescales. To address this issue, we used species of the virilis phylad because they have been diverging from D. melanogaster for at least 40 million years. Our comprehensive morphological survey revealed remarkable differences in eye size and head shape among these species with D. novamexicana having the smallest eyes and southern D. americana populations having the largest eyes. We show that the genetic architecture underlying eye size variation is complex with multiple associated genetic variants located on most chromosomes. Our genome wide association study (GWAS) strongly suggests that some of the putative causative variants are associated with the presence of inversions. Indeed, northern populations of D. americana share derived inversions with D. novamexicana and they show smaller eyes compared to southern ones. Intriguingly, we observed a significant enrichment of genes involved in eye development on the 4th chromosome after intersecting chromosomal regions associated with phenotypic differences with those showing high differentiation among D. americana populations. We propose that variants associated with chromosomal inversions contribute to both intra- and interspecific variation in eye size among species of the virilis phylad.
Collapse
Affiliation(s)
- Micael Reis
- Department of Developmental Biology, Göttingen Center for Molecular Biosciences (GZMB), University of Goettingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Gordon Wiegleb
- Department of Developmental Biology, Göttingen Center for Molecular Biosciences (GZMB), University of Goettingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany.,International Max Planck Research School for Genome Science, Am Fassberg 11, 37077, Göttingen, Germany
| | - Julien Claude
- Institut Des Sciences de l'Evolution de Montpellier, CNRS/UM2/IRD, 2 Place Eugène Bataillon, cc64, 34095, Montpellier Cedex 5, France
| | - Rodrigo Lata
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto, Portugal
| | - Britta Horchler
- Department of Developmental Biology, Göttingen Center for Molecular Biosciences (GZMB), University of Goettingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany
| | - Ngoc-Thuy Ha
- Animal Breeding and Genetics Group, Department of Animal Sciences, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.,Center for Integrated Breeding Research, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany
| | - Christian Reimer
- Animal Breeding and Genetics Group, Department of Animal Sciences, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.,Center for Integrated Breeding Research, University of Goettingen, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany
| | - Cristina P Vieira
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto, Portugal
| | - Jorge Vieira
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto, Portugal
| | - Nico Posnien
- Department of Developmental Biology, Göttingen Center for Molecular Biosciences (GZMB), University of Goettingen, Justus-von-Liebig-Weg 11, 37077, Göttingen, Germany.
| |
Collapse
|
18
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
19
|
Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019; 20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
Collapse
Affiliation(s)
- Arghya Kusum Das
- Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI USA
| | - Sayan Goswami
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Kisung Lee
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Seung-Jong Park
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| |
Collapse
|
20
|
Ahmed N, Lévy J, Ren S, Mushtaq H, Bertels K, Al-Ars Z. GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinformatics 2019; 20:520. [PMID: 31653208 PMCID: PMC6815017 DOI: 10.1186/s12859-019-3086-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 09/06/2019] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Due the computational complexity of sequence alignment algorithms, various accelerated solutions have been proposed to speedup this analysis. NVBIO is the only available GPU library that accelerates sequence alignment of high-throughput NGS data, but has limited performance. In this article we present GASAL2, a GPU library for aligning DNA and RNA sequences that outperforms existing CPU and GPU libraries. RESULTS The GASAL2 library provides specialized, accelerated kernels for local, global and all types of semi-global alignment. Pairwise sequence alignment can be performed with and without traceback. GASAL2 outperforms the fastest CPU-optimized SIMD implementations such as SeqAn and Parasail, as well as NVIDIA's own GPU-based library known as NVBIO. GASAL2 is unique in performing sequence packing on GPU, which is up to 750x faster than NVBIO. Overall on Geforce GTX 1080 Ti GPU, GASAL2 is up to 21x faster than Parasail on a dual socket hyper-threaded Intel Xeon system with 28 cores and up to 13x faster than NVBIO with a query length of up to 300 bases and 100 bases, respectively. GASAL2 alignment functions are asynchronous/non-blocking and allow full overlap of CPU and GPU execution. The paper shows how to use GASAL2 to accelerate BWA-MEM, speeding up the local alignment by 20x, which gives an overall application speedup of 1.3x vs. CPU with up to 12 threads. CONCLUSIONS The library provides high performance APIs for local, global and semi-global alignment that can be easily integrated into various bioinformatics tools.
Collapse
Affiliation(s)
- Nauman Ahmed
- Delft University of Technology, Delft, Netherlands and University of Engineering and Technology, Lahore, Pakistan
| | - Jonathan Lévy
- Delft University of Technology, Netherlands, Delft, Netherlands
| | - Shanshan Ren
- Delft University of Technology, Netherlands, Delft, Netherlands
| | - Hamid Mushtaq
- Maastricht UMC+, Netherlands, Maastricht, Netherlands
| | - Koen Bertels
- Delft University of Technology, Netherlands, Delft, Netherlands
| | - Zaid Al-Ars
- Delft University of Technology, Netherlands, Delft, Netherlands
| |
Collapse
|
21
|
GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.
Collapse
|
22
|
|
23
|
Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019; 20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open
Abstract
Background In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| |
Collapse
|
24
|
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 2019; 36:1374-1381. [DOI: 10.1093/bioinformatics/btz102] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 01/07/2019] [Accepted: 02/18/2019] [Indexed: 12/25/2022] Open
Abstract
Abstract
Motivation
Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information.
Results
We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.
Availability and implementation
The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Antoine Limasset
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
| | - Jean-François Flot
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
- Interuniversity Institute of Bioinformatics in Brussels – (IB) 2, Brussels, Belgium
| | | |
Collapse
|
25
|
Ershov V, Tarasov A, Lapidus A, Korobeynikov A. IonHammer: Homopolymer-Space Hamming Clustering for IonTorrent Read Error Correction. J Comput Biol 2019; 26:124-127. [DOI: 10.1089/cmb.2018.0152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vasily Ershov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
| | - Artem Tarasov
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Anton Korobeynikov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
26
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
27
|
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018; 7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| |
Collapse
|
28
|
Mukherjee K, Washimkar D, Muggli MD, Salmela L, Boucher C. Error correcting optical mapping data. Gigascience 2018; 7:5005021. [PMID: 29846578 PMCID: PMC6007263 DOI: 10.1093/gigascience/giy061] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Accepted: 05/16/2018] [Indexed: 12/31/2022] Open
Abstract
Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| | - Darshan Washimkar
- Department of Computer Science, Colorado State University, Fort Collins
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville
| |
Collapse
|
29
|
Yoon S, Kim D, Kang K, Park WJ. TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix. BMC Genomics 2018; 19:653. [PMID: 30180798 PMCID: PMC6123912 DOI: 10.1186/s12864-018-5034-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 08/23/2018] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND The challenges when developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize a de Bruijn graph, with which complexity grows linearly with data size while suffering from errors and repeats. Although one can correct the errors by inspecting the topological structure of the graph, this is not an easy task when there are too many branches. Two research directions are to improve either the graph reliability or the path search precision, and in this study, we focused on the former. RESULTS We present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. In the proposed approach, we built contigs by direct read alignment within a distance margin and performed a junction search to construct splicing graphs. While doing so, a contig of length l was represented by a 4 × l matrix (called a consensus matrix), in which each element was the base count of the aligned reads so far. A representative sequence was obtained by taking the majority in each column of the consensus matrix to be used for further read alignment. Once the splicing graphs had been obtained, we used IsoLasso to find paths with a noticeable read depth. The experiments using real and simulated reads show that the method provided considerable improvement in sensitivity and moderately better performance when comparing sensitivity and precision. This was achieved by the error-aware graph construction using the consensus matrix, with which the reads having errors were made usable for the graph construction (otherwise, they might have been eventually discarded). This improved the quality of the coverage depth information used in the subsequent path search step and finally the reliability of the graph. CONCLUSIONS De novo assembly is mainly used to explore undiscovered isoforms and must be able to represent as many reads as possible in an efficient way. In this sense, TraRECo provides us with a potential alternative for improving graph reliability even though the computational burden is much higher than the single k-mer in the de Bruijn graph approach.
Collapse
Affiliation(s)
- Seokhyun Yoon
- Department of Electronics Eng., College of Engineering, Dankook University, Yongin-si, Korea
| | - Daeseung Kim
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| | - Keunsoo Kang
- Department of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea.
| | - Woong June Park
- Department of Molecular Biology, College of Natural Sciences, Dankook University, Cheonan-si, Korea
| |
Collapse
|
30
|
Xia C, Wang M, Cornejo OE, Jiwan DA, See DR, Chen X. Secretome Characterization and Correlation Analysis Reveal Putative Pathogenicity Mechanisms and Identify Candidate Avirulence Genes in the Wheat Stripe Rust Fungus Puccinia striiformis f. sp. tritici. Front Microbiol 2017; 8:2394. [PMID: 29312156 PMCID: PMC5732408 DOI: 10.3389/fmicb.2017.02394] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2017] [Accepted: 11/20/2017] [Indexed: 12/30/2022] Open
Abstract
Stripe (yellow) rust, caused by Puccinia striiformis f. sp. tritici (Pst), is one of the most destructive diseases of wheat worldwide. Planting resistant cultivars is an effective way to control this disease, but race-specific resistance can be overcome quickly due to the rapid evolving Pst population. Studying the pathogenicity mechanisms is critical for understanding how Pst virulence changes and how to develop wheat cultivars with durable resistance to stripe rust. We re-sequenced 7 Pst isolates and included additional 7 previously sequenced isolates to represent balanced virulence/avirulence profiles for several avirulence loci in seretome analyses. We observed an uneven distribution of heterozygosity among the isolates. Secretome comparison of Pst with other rust fungi identified a large portion of species-specific secreted proteins, suggesting that they may have specific roles when interacting with the wheat host. Thirty-two effectors of Pst were identified from its secretome. We identified candidates for Avr genes corresponding to six Yr genes by correlating polymorphisms for effector genes to the virulence/avirulence profiles of the 14 Pst isolates. The putative AvYr76 was present in the avirulent isolates, but absent in the virulent isolates, suggesting that deleting the coding region of the candidate avirulence gene has produced races virulent to resistance gene Yr76. We conclude that incorporating avirulence/virulence phenotypes into correlation analysis with variations in genomic structure and secretome, particularly presence/absence polymorphisms of effectors, is an efficient way to identify candidate Avr genes in Pst. The candidate effector genes provide a rich resource for further studies to determine the evolutionary history of Pst populations and the co-evolutionary arms race between Pst and wheat. The Avr candidates identified in this study will lead to cloning avirulence genes in Pst, which will enable us to understand molecular mechanisms underlying Pst-wheat interactions, to determine the effectiveness of resistance genes and further to develop durable resistance to stripe rust.
Collapse
Affiliation(s)
- Chongjing Xia
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Meinan Wang
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Omar E. Cornejo
- School of Biological Sciences, Washington State University, Pullman, WA, United States
| | - Derick A. Jiwan
- Wheat Health, Genetics, and Quality Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Pullman, WA, United States
| | - Deven R. See
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
- Wheat Health, Genetics, and Quality Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Pullman, WA, United States
| | - Xianming Chen
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
- Wheat Health, Genetics, and Quality Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Pullman, WA, United States
| |
Collapse
|
31
|
Huang YT, Huang YW. An efficient error correction algorithm using FM-index. BMC Bioinformatics 2017; 18:524. [PMID: 29179672 PMCID: PMC5704532 DOI: 10.1186/s12859-017-1940-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed. RESULTS We develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads. CONCLUSIONS FMOE is freely available at https://github.com/ythuang0522/FMOC .
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan.
| | - Yu-Wen Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan
| |
Collapse
|
32
|
Illingworth CJR, Roy S, Beale MA, Tutill H, Williams R, Breuer J. On the effective depth of viral sequence data. Virus Evol 2017; 3:vex030. [PMID: 29250429 PMCID: PMC5724399 DOI: 10.1093/ve/vex030] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Genome sequence data are of great value in describing evolutionary processes in viral populations. However, in such studies, the extent to which data accurately describes the viral population is a matter of importance. Multiple factors may influence the accuracy of a dataset, including the quantity and nature of the sample collected, and the subsequent steps in viral processing. To investigate this phenomenon, we sequenced replica datasets spanning a range of viruses, and in which the point at which samples were split was different in each case, from a dataset in which independent samples were collected from a single patient to another in which all processing steps up to sequencing were applied to a single sample before splitting the sample and sequencing each replicate. We conclude that neither a high read depth nor a high template number in a sample guarantee the precision of a dataset. Measures of consistency calculated from within a single biological sample may also be insufficient; distortion of the composition of a population by the experimental procedure or genuine within-host diversity between samples may each affect the results. Where it is possible, data from replicate samples should be collected to validate the consistency of short-read sequence data.
Collapse
Affiliation(s)
- Christopher J R Illingworth
- Department of Genetics, University of Cambridge, Cambridge, UK.,Department of Applied Maths and Theoretical Physics, Centre for Mathematical Sciences, University of Cambridge, Cambridge, UK
| | - Sunando Roy
- Division of Infection and Immunity, University College London, London, UK
| | | | - Helena Tutill
- Division of Infection and Immunity, University College London, London, UK
| | - Rachel Williams
- Division of Infection and Immunity, University College London, London, UK
| | - Judith Breuer
- Division of Infection and Immunity, University College London, London, UK
| |
Collapse
|
33
|
Savel D, LaFramboise T, Grama A, Koyuturk M. Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1378-1388. [PMID: 27362987 PMCID: PMC5754272 DOI: 10.1109/tcbb.2016.2586060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction techniques utilizing substring frequencies have been developed to mitigate this effect. In this paper, we present a novel and effective method called Pluribus, for correcting sequencing errors using a generalized suffix trie. Pluribus utilizes multiple manifestations of an error in the trie to accurately identify errors and suggest corrections. We show that Pluribus produces the least number of false positives across a diverse set of real sequencing datasets when compared to other methods. Furthermore, Pluribus can be used in conjunction with other contemporary error correction methods to achieve higher levels of accuracy than either tool alone. These increases in error correction accuracy are also realized in the quality of the contigs that are generated during assembly. We explore, in-depth, the behavior of Pluribus , to explain the observed improvement in accuracy and assembly performance. Pluribus is freely available at http://compbio. CASE edu/pluribus/.
Collapse
|
34
|
Paracchini V, Petrillo M, Lievens A, Puertas Gallardo A, Martinsohn JT, Hofherr J, Maquet A, Silva APB, Kagkli DM, Querci M, Patak A, Angers-Loustau A. Novel nuclear barcode regions for the identification of flatfish species. Food Control 2017; 79:297-308. [PMID: 28867876 PMCID: PMC5446357 DOI: 10.1016/j.foodcont.2017.04.009] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Revised: 04/05/2017] [Accepted: 04/06/2017] [Indexed: 01/30/2023]
Abstract
The development of an efficient seafood traceability framework is crucial for the management of sustainable fisheries and the monitoring of potential substitution fraud across the food chain. Recent studies have shown the potential of DNA barcoding methods in this framework, with most of the efforts focusing on using mitochondrial targets such as the cytochrome oxidase 1 and cytochrome b genes. In this article, we show the identification of novel targets in the nuclear genome, and their associated primers, to be used for the efficient identification of flatfishes of the Pleuronectidae family. In addition, different in silico methods are described to generate a dataset of barcode reference sequences from the ever-growing wealth of publicly available sequence information, replacing, where possible, labour-intensive laboratory work. The short amplicon lengths render the analysis of these new barcode target regions ideally suited to next-generation sequencing techniques, allowing characterisation of multiple fish species in mixed and processed samples. Their location in the nucleus also improves currently used methods by allowing the identification of hybrid individuals.
Collapse
Affiliation(s)
- Valentina Paracchini
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | - Mauro Petrillo
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | - Antoon Lievens
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | | | | | - Johann Hofherr
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | - Alain Maquet
- European Commission, Joint Research Centre (JRC), Retieseweg 111, 2440 Geel, Belgium
| | | | - Dafni Maria Kagkli
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | - Maddalena Querci
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | - Alex Patak
- European Commission, Joint Research Centre (JRC), via E. Fermi 2749, 21027 Ispra, Italy
| | | |
Collapse
|
35
|
Song L, Huang W, Kang J, Huang Y, Ren H, Ding K. Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus. Sci Rep 2017; 7:8106. [PMID: 28808243 PMCID: PMC5556038 DOI: 10.1038/s41598-017-08139-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 07/05/2017] [Indexed: 01/26/2023] Open
Abstract
Ion Torrent Personal Genome Machine (PGM) technology is a mid-length read, low-cost and high-speed next-generation sequencing platform with a relatively high insertion and deletion (indel) error rate. A full systematic assessment of the effectiveness of various error correction algorithms in PGM viral datasets (e.g., hepatitis B virus (HBV)) has not been performed. We examined 19 quality-trimmed PGM datasets for the HBV reverse transcriptase (RT) region and found a total error rate of 0.48% ± 0.12%. Deletion errors were clearly present at the ends of homopolymer runs. Tests using both real and simulated data showed that the algorithms differed in their abilities to detect and correct errors and that the error rate and sequencing depth significantly affected the performance. Of the algorithms tested, Pollux showed a better overall performance but tended to over-correct 'genuine' substitution variants, whereas Fiona proved to be better at distinguishing these variants from sequencing errors. We found that the combined use of Pollux and Fiona gave the best results when error-correcting Ion Torrent PGM viral data.
Collapse
Affiliation(s)
- Liting Song
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Wenxun Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Juan Kang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Yuan Huang
- Center for Hepatobillary and Pancreatic Diseases, Beijing Tsinghua Changgung Hospital, Medical Center, Tsinghua University, Beijing, 100044, P.R. China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Keyue Ding
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China.
| |
Collapse
|
36
|
Lee B, Moon T, Yoon S, Weissman T. DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 2017; 12:e0181463. [PMID: 28749987 PMCID: PMC5531809 DOI: 10.1371/journal.pone.0181463] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/30/2017] [Indexed: 11/29/2022] Open
Abstract
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
Collapse
Affiliation(s)
- Byunghan Lee
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Taesup Moon
- College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
- * E-mail: (TM); (SY)
| | - Sungroh Yoon
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Neurology and Neurological Sciences, Stanford University, Stanford, California, United States of America
- * E-mail: (TM); (SY)
| | - Tsachy Weissman
- Electrical Engineering, Stanford University, Stanford, California, United States of America
| |
Collapse
|
37
|
Ahola V, Wahlberg N, Frilander MJ. Butterfly Genomics: Insights from the Genome ofMelitaea cinxia. ANN ZOOL FENN 2017. [DOI: 10.5735/086.054.0123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Virpi Ahola
- Department of Biosciences, P.O. Box 65, FI-00014 University of Helsinki, Finland
| | - Niklas Wahlberg
- Department of Biology, Lund University, Sölvegatan 37, SE-223 62 Lund, Sweden
| | - Mikko J. Frilander
- Institute of Biotechnology, P.O. Box 56, FI-00014 University of Helsinki, Finland
| |
Collapse
|
38
|
Salmela L, Walve R, Rivals E, Ukkonen E. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 2017; 33:799-806. [PMID: 27273673 PMCID: PMC5351550 DOI: 10.1093/bioinformatics/btw321] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 05/03/2016] [Accepted: 05/16/2016] [Indexed: 12/04/2022] Open
Abstract
Motivation New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher. Availability and Implementation LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ . Contact leena.salmela@cs.helsinki.fi.
Collapse
Affiliation(s)
- Leena Salmela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Riku Walve
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Eric Rivals
- LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
39
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
40
|
Jue NK, Batta-Lona PG, Trusiak S, Obergfell C, Bucklin A, O'Neill MJ, O'Neill RJ. Rapid Evolutionary Rates and Unique Genomic Signatures Discovered in the First Reference Genome for the Southern Ocean Salp, Salpa thompsoni (Urochordata, Thaliacea). Genome Biol Evol 2016; 8:3171-3186. [PMID: 27624472 PMCID: PMC5174732 DOI: 10.1093/gbe/evw215] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
A preliminary genome sequence has been assembled for the Southern Ocean salp, Salpa thompsoni (Urochordata, Thaliacea). Despite the ecological importance of this species in Antarctic pelagic food webs and its potential role as an indicator of changing Southern Ocean ecosystems in response to climate change, no genomic resources are available for S. thompsoni or any closely related urochordate species. Using a multiple-platform, multiple-individual approach, we have produced a 318,767,936-bp genome sequence, covering >50% of the estimated 602 Mb (±173 Mb) genome size for S. thompsoni. Using a nonredundant set of predicted proteins, >50% (16,823) of sequences showed significant homology to known proteins and ∼38% (12,151) of the total protein predictions were associated with Gene Ontology functional information. We have generated 109,958 SNP variant and 9,782 indel predictions for this species, serving as a resource for future phylogenomic and population genetic studies. Comparing the salp genome to available assemblies for four other urochordates, Botryllus schlosseri, Ciona intestinalis, Ciona savignyi and Oikopleura dioica, we found that S. thompsoni shares the previously estimated rapid rates of evolution for these species. High mutation rates are thus independent of genome size, suggesting that rates of evolution >1.5 times that observed for vertebrates are a broad taxonomic characteristic of urochordates. Tests for positive selection implemented in PAML revealed a small number of genes with sites undergoing rapid evolution, including genes involved in ribosome biogenesis and metabolic and immune process that may be reflective of both adaptation to polar, planktonic environments as well as the complex life history of the salps. Finally, we performed an initial survey of small RNAs, revealing the presence of known, conserved miRNAs, as well as novel miRNA genes; unique piRNAs; and mature miRNA signatures for varying developmental stages. Collectively, these resources provide a genomic foundation supporting S. thompsoni as a model species for further examination of the exceptional rates and patterns of genomic evolution shown by urochordates. Additionally, genomic data will allow for the development of molecular indicators of key life history events and processes and afford new understandings and predictions of impacts of climate change on this key species of Antarctic pelagic ecosystems.
Collapse
Affiliation(s)
- Nathaniel K Jue
- Department of Molecular and Cell Biology, Institute for Systems Genomics, University of Connecticut, CT.,Present address: School of Natural Sciences, California State University, Monterey Bay, CA
| | - Paola G Batta-Lona
- Department of Marine Sciences, University of Connecticut, CT.,Present address: Departamento de Biotecnologia Marina, CICESE, Ensenada, B.C. Mexico
| | - Sarah Trusiak
- Department of Molecular and Cell Biology, Institute for Systems Genomics, University of Connecticut, CT
| | - Craig Obergfell
- Department of Molecular and Cell Biology, Institute for Systems Genomics, University of Connecticut, CT
| | - Ann Bucklin
- Department of Marine Sciences, University of Connecticut, CT
| | - Michael J O'Neill
- Department of Molecular and Cell Biology, Institute for Systems Genomics, University of Connecticut, CT
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, Institute for Systems Genomics, University of Connecticut, CT
| |
Collapse
|
41
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
42
|
Improved Efficiency and Reliability of NGS Amplicon Sequencing Data Analysis for Genetic Diagnostic Procedures Using AGSA Software. BIOMED RESEARCH INTERNATIONAL 2016; 2016:5623089. [PMID: 27656653 PMCID: PMC5021467 DOI: 10.1155/2016/5623089] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 06/28/2016] [Indexed: 11/23/2022]
Abstract
Screening for BRCA mutations in women with familial risk of breast or ovarian cancer is an ideal situation for high-throughput sequencing, providing large amounts of low cost data. However, 454, Roche, and Ion Torrent, Thermo Fisher, technologies produce homopolymer-associated indel errors, complicating their use in routine diagnostics. We developed software, named AGSA, which helps to detect false positive mutations in homopolymeric sequences. Seventy-two familial breast cancer cases were analysed in parallel by amplicon 454 pyrosequencing and Sanger dideoxy sequencing for genetic variations of the BRCA genes. All 565 variants detected by dideoxy sequencing were also detected by pyrosequencing. Furthermore, pyrosequencing detected 42 variants that were missed with Sanger technique. Six amplicons contained homopolymer tracts in the coding sequence that were systematically misread by the software supplied by Roche. Read data plotted as histograms by AGSA software aided the analysis considerably and allowed validation of the majority of homopolymers. As an optimisation, additional 250 patients were analysed using microfluidic amplification of regions of interest (Access Array Fluidigm) of the BRCA genes, followed by 454 sequencing and AGSA analysis. AGSA complements a complete line of high-throughput diagnostic sequence analysis, reducing time and costs while increasing reliability, notably for homopolymer tracts.
Collapse
|
43
|
Akogwu I, Wang N, Zhang C, Gong P. A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis. Hum Genomics 2016; 10 Suppl 2:20. [PMID: 27461106 PMCID: PMC4965716 DOI: 10.1186/s40246-016-0068-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. METHODS Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. RESULTS Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. CONCLUSIONS This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
Collapse
Affiliation(s)
- Isaac Akogwu
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Nan Wang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Chaoyang Zhang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
44
|
Miclotte G, Heydari M, Demeester P, Rombauts S, Van de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol 2016; 11:10. [PMID: 27148393 PMCID: PMC4855726 DOI: 10.1186/s13015-016-0075-7] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 04/25/2016] [Indexed: 11/13/2022] Open
Abstract
Background Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
Collapse
|
45
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
46
|
Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
47
|
Tong L, Yang C, Wu PY, Wang MD. Evaluating the impact of sequencing error correction for RNA-seq data with ERCC RNA spike-in controls. ... IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS. IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS 2016; 2016:74-77. [PMID: 27532064 DOI: 10.1109/bhi.2016.7455838] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Sequencing errors are a major issue for several next-generation sequencing-based applications such as de novo assembly and single nucleotide polymorphism detection. Several error-correction methods have been developed to improve raw data quality. However, error-correction performance is hard to evaluate because of the lack of a ground truth. In this study, we propose a novel approach which using ERCC RNA spike-in controls as the ground truth to facilitate error-correction performance evaluation. After aligning raw and corrected RNA-seq data, we characterized the quality of reads by three metrics: mismatch patterns (i.e., the substitution rate of A to C) of reads aligned with one mismatch, mismatch patterns of reads aligned with two mismatches and the percentage increase of reads aligned to reference. We observed that the mismatch patterns for reads aligned with one mismatch are significantly correlated between ERCC spike-ins and real RNA samples. Based on such observations, we conclude that ERCC spike-ins can serve as ground truths for error correction beyond their previous applications for validation of dynamic range and fold-change response. Also, the mismatch patterns for ERCC reads aligned with one mismatch can serve as a novel and reliable metric to evaluate the performance of error-correction tools.
Collapse
Affiliation(s)
- Li Tong
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - Cheng Yang
- Dept. of Biomedical Engineering, Peking University, No.5 Yiheyuan Road Haidian District, Beijing, P.R. China 100871
| | - Po-Yen Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - May D Wang
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| |
Collapse
|
48
|
Angers-Loustau A, Petrillo M, Paracchini V, Kagkli DM, Rischitor PE, Puertas Gallardo A, Patak A, Querci M, Kreysa J. Towards Plant Species Identification in Complex Samples: A Bioinformatics Pipeline for the Identification of Novel Nuclear Barcode Candidates. PLoS One 2016; 11:e0147692. [PMID: 26807711 PMCID: PMC4725681 DOI: 10.1371/journal.pone.0147692] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 01/07/2016] [Indexed: 11/25/2022] Open
Abstract
Monitoring of the food chain to fight fraud and protect consumer health relies on the availability of methods to correctly identify the species present in samples, for which DNA barcoding is a promising candidate. The nuclear genome is a rich potential source of barcode targets, but has been relatively unexploited until now. Here, we show the development and use of a bioinformatics pipeline that processes available genome sequences to automatically screen large numbers of input candidates, identifies novel nuclear barcode targets and designs associated primer pairs, according to a specific set of requirements. We applied this pipeline to identify novel barcodes for plant species, a kingdom for which the currently available solutions are known to be insufficient. We tested one of the identified primer pairs and show its capability to correctly identify the plant species in simple and complex samples, validating the output of our approach.
Collapse
Affiliation(s)
- Alexandre Angers-Loustau
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Mauro Petrillo
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Valentina Paracchini
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Dafni M. Kagkli
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Patricia E. Rischitor
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Antonio Puertas Gallardo
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Alex Patak
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Maddalena Querci
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| | - Joachim Kreysa
- Molecular Biology and Genomic Unit, Institute for Health and Consumer Protection, Joint Research Center, European Commission, Ispra, Italy
| |
Collapse
|
49
|
Park SJ, Saito-Adachi M, Komiyama Y, Nakai K. Advances, practice, and clinical perspectives in high-throughput sequencing. Oral Dis 2016; 22:353-64. [DOI: 10.1111/odi.12403] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Revised: 11/16/2015] [Accepted: 11/16/2015] [Indexed: 01/06/2023]
Affiliation(s)
- S-J Park
- Human Genome Center; The Institute of Medical Science; The University of Tokyo; Tokyo Japan
| | - M Saito-Adachi
- Division of Cancer Genomics; National Cancer Center Research Institute; Tokyo Japan
| | - Y Komiyama
- Human Genome Center; The Institute of Medical Science; The University of Tokyo; Tokyo Japan
| | - K Nakai
- Human Genome Center; The Institute of Medical Science; The University of Tokyo; Tokyo Japan
| |
Collapse
|
50
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|