1
|
SoundharaPandiyan N, Alphonse CRW, Thanumalaya S, Vincent SGP, Kannan RR. Genome sequencing of Caridina pseudogracilirostris and its comparative analysis with malacostracan crustaceans. 3 Biotech 2024; 14:276. [PMID: 39464522 PMCID: PMC11499489 DOI: 10.1007/s13205-024-04121-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 10/04/2024] [Indexed: 10/29/2024] Open
Abstract
The Caridina pseudogracilirostris is commonly found in the brackish waters of the southwestern coastal regions of India. This study provides a comprehensive genomic investigation of the shrimp species C. pseudogracilirostris, offering insights into its genetic makeup, evolutionary dynamics, and functional annotations. The genomic DNA was isolated from tissue samples, sequenced using next-generation sequencing (NGS), and stored in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (Accession No: PRJNA847710). De novo sequencing indicated a genome size of 1.31 Gbp with a low heterozygosity of about 0.81%. Repeat masking and annotation revealed that repeated elements constitute 24.60% of the genome, with simple sequence repeats (SSRs) accounting for 7.26%. Gene prediction identified 14,101 genes, with functional annotations indicating involvement in critical biological processes such as development, cellular function, immunological responses, and reproduction. Furthermore, phylogenetic analysis revealed genomic links among Malacostraca species, indicating gene duplication as a strategy for genetic diversity and adaptation. C. pseudogracilirostris has 1,856 duplicated genes, reflecting a distinct genomic architecture and evolutionary strategy within the Malacostraca branch. These findings enhance our understanding of the genetic characteristics and evolutionary relationships of C. pseudogracilirostris, providing significant insights into the overall evolutionary dynamics of the Malacostraca group. Supplementary Information The online version contains supplementary material available at 10.1007/s13205-024-04121-4.
Collapse
Affiliation(s)
- NandhaGopal SoundharaPandiyan
- Centre for Molecular and Nanomedical Sciences, Centre for Nanoscience and Nanotechnology, School of Bio and Chemical Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu 600119 India
| | - Carlton Ranjith Wilson Alphonse
- Centre for Molecular and Nanomedical Sciences, Centre for Nanoscience and Nanotechnology, School of Bio and Chemical Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu 600119 India
| | | | | | - Rajaretinam Rajesh Kannan
- Department of Biotechnology, Sharda School of Engineering and Technology, Sharda University, Plot No, 32, 34, Knowledge Park III, Greater Noida, Uttar Pradesh 201306 India
| |
Collapse
|
2
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
3
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 PMCID: PMC11222498 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
4
|
Cohen JI, Turgman-Cohen S. The Conservation Genetics of Iris lacustris (Dwarf Lake Iris), a Great Lakes Endemic. PLANTS (BASEL, SWITZERLAND) 2023; 12:2557. [PMID: 37447118 DOI: 10.3390/plants12132557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 05/26/2023] [Accepted: 07/03/2023] [Indexed: 07/15/2023]
Abstract
Iris lacustris, a northern Great Lakes endemic, is a rare species known from 165 occurrences across Lakes Michigan and Huron in the United States and Canada. Due to multiple factors, including habitat loss, lack of seed dispersal, patterns of reproduction, and forest succession, the species is threatened. Early population genetic studies using isozymes and allozymes recovered no to limited genetic variation within the species. To better explore genetic variation across the geographic range of I. lacustris and to identify units for conservation, we used tunable Genotyping-by-Sequencing (tGBS) with 171 individuals across 24 populations from Michigan and Wisconsin, and because the species is polyploid, we filtered the single nucleotide polymorphism (SNP) matrices using polyRAD to recognize diploid and tetraploid loci. Based on multiple population genetic approaches, we resolved three to four population clusters that are geographically structured across the range of the species. The species migrated from west to east across its geographic range, and minimal genetic exchange has occurred among populations. Four units for conservation are recognized, but nine adaptive units were identified, providing evidence for local adaptation across the geographic range of the species. Population genetic analyses with all, diploid, and tetraploid loci recovered similar results, which suggests that methods may be robust to variation in ploidy level.
Collapse
Affiliation(s)
- James Isaac Cohen
- Department of Botany and Plant Ecology, Weber State University, 1415 Edvalson St., Dept. 2504, Ogden, UT 84408-2504, USA
| | - Salomon Turgman-Cohen
- E.S. Witchger School of Engineering, Marian University, 3200 Cold Spring Road, Indianapolis, IN 46222-1997, USA
| |
Collapse
|
5
|
Gordon JL, Oliva Chavez AS, Martinez D, Vachiery N, Meyer DF. Possible biased virulence attenuation in the Senegal strain of Ehrlichia ruminantium by ntrX gene conversion from an inverted segmental duplication. PLoS One 2023; 18:e0266234. [PMID: 36800354 PMCID: PMC9937504 DOI: 10.1371/journal.pone.0266234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 03/16/2022] [Indexed: 02/18/2023] Open
Abstract
Ehrlichia ruminantium is a tick-borne intracellular pathogen of ruminants that causes heartwater, a disease present in Sub-saharan Africa, islands in the Indian Ocean and the Caribbean, inducing significant economic losses. At present, three avirulent strains of E. ruminantium (Gardel, Welgevonden and Senegal isolates) have been produced by a process of serial passaging in mammalian cells in vitro, but unfortunately their use as vaccines do not offer a large range of protection against other strains, possibly due to the genetic diversity present within the species. So far no genetic basis for virulence attenuation has been identified in any E. ruminantium strain that could offer targets to facilitate vaccine production. Virulence attenuated Senegal strains have been produced twice independently, and require many fewer passages to attenuate than the other strains. We compared the genomes of a virulent and attenuated Senegal strain and identified a likely attenuator gene, ntrX, a global transcription regulator and member of a two-component system that is linked to environmental sensing. This gene has an inverted partial duplicate close to the parental gene that shows evidence of gene conversion in different E. ruminantium strains. The pseudogenisation of the gene in the avirulent Senegal strain occurred by gene conversion from the duplicate to the parent, transferring a 4 bp deletion which is unique to the Senegal strain partial duplicate amongst the wild isolates. We confirmed that the ntrX gene is not expressed in the avirulent Senegal strain by RT-PCR. The inverted duplicate structure combined with the 4 bp deletion in the Senegal strain can explain both the attenuation and the faster speed of attenuation in the Senegal strain relative to other strains of E. ruminantium. Our results identify nrtX as a promising target for the generation of attenuated strains of E. ruminantium by random or directed mutagenesis that could be used for vaccine production.
Collapse
Affiliation(s)
- Jonathan L. Gordon
- CIRAD, UMR ASTRE, Petit-Bourg, Guadeloupe, France
- ASTRE, CIRAD, INRAe, Univ Montpellier, Montpellier, France
| | - Adela S. Oliva Chavez
- CIRAD, UMR ASTRE, Petit-Bourg, Guadeloupe, France
- ASTRE, CIRAD, INRAe, Univ Montpellier, Montpellier, France
| | | | | | - Damien F. Meyer
- CIRAD, UMR ASTRE, Petit-Bourg, Guadeloupe, France
- ASTRE, CIRAD, INRAe, Univ Montpellier, Montpellier, France
- * E-mail:
| |
Collapse
|
6
|
Cohen JI, Ruane LG. Conservation genetics of Phlox hirsuta, a serpentine endemic. CONSERV GENET 2022. [DOI: 10.1007/s10592-022-01478-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
7
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
8
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
9
|
Liu S, Koslicki D. CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 2022; 38:i28-i35. [PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shaopeng Liu
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA.,Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA.,Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
10
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
11
|
Tandonnet S, Haq M, Turner A, Grana T, Paganopoulou P, Adams S, Dhawan S, Kanzaki N, Nuez I, Félix MA, Pires-daSilva A. De Novo Genome Assembly of Auanema Melissensis, a Trioecious Free-Living Nematode. J Nematol 2022; 54:20220059. [PMID: 36879950 PMCID: PMC9984802 DOI: 10.2478/jofnem-2022-0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Indexed: 02/09/2023] Open
Abstract
Nematodes of the genus Auanema are interesting models for studying sex determination mechanisms because their populations consist of three sexual morphs (males, females, and hermaphrodites) and produce skewed sex ratios. Here, we introduce a new undescribed species of this genus, Auanema melissensis n. sp., together with its draft nuclear genome. This species is also trioecious and does not cross with the other described species A. rhodensis or A. freiburgensis. Similar to A. freiburgensis, A. melissensis' maternal environment influences the hermaphrodite versus female sex determination of the offspring. The genome of A. melissensis is ~60 Mb, containing 11,040 protein-coding genes and 8.07% of repeat sequences. Using the estimated ancestral chromosomal gene content (Nigon elements), it was possible to identify putative X chromosome scaffolds.
Collapse
Affiliation(s)
- Sophie Tandonnet
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Maairah Haq
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Anisa Turner
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Theresa Grana
- Department of Biological Sciences, University of Mary Washington, Fredericksburg, VA 22401UK
| | | | - Sally Adams
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Sandhya Dhawan
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | - Natsumi Kanzaki
- Kansai Research Center, Forestry and Forest Products Research Institute, Fushimi, Kyoto 612-0855, Japan
| | - Isabelle Nuez
- Institut Jacques Monod, CNRS UMR7592, Université Paris-Diderot, 75013Paris, France
| | - Marie-Anne Félix
- Institut Jacques Monod, CNRS UMR7592, Université Paris-Diderot, 75013Paris, France
| | | |
Collapse
|
12
|
Schroeder A, Pallavicini A, Edomi P, Pansera M, Camatti E. Suitability of a dual COI marker for marine zooplankton DNA metabarcoding. MARINE ENVIRONMENTAL RESEARCH 2021; 170:105444. [PMID: 34399186 DOI: 10.1016/j.marenvres.2021.105444] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 08/02/2021] [Accepted: 08/03/2021] [Indexed: 06/13/2023]
Abstract
As DNA metabarcoding has become an emerging tool for surveying biodiversity, including its application in legally binding assessments, reliable and efficient barcodes are requested, especially for the highly diverse group of zooplankton. This study focuses on comparing the efficiency of two mitochondrial COI barcodes based on the internal primers mlCOIintF and mlCOIintR utilizing mesozooplankton samples collected in a Mediterranean lagoon. Our results indicate that after a slight adjustment, the mlCOIintR primer performs in combination with jdgLCO1490 (herein) very comparably to the much more widely used primer system mlCOIintF/jgHCO2198+dgHCO2198, in terms of level of taxonomic resolution, species detection and their relative abundance in terms of numbers of reads. As for some groups, like Ctenophora, this barcode is not suitable; a combination of them may be the best option to rely on the Folmer region in its entirety without the risk of losing information for a limited primer match.
Collapse
Affiliation(s)
- Anna Schroeder
- National Research Council, Institute of Marine Science (CNR ISMAR) Venice, Arsenale Tesa 104, Castello 2737/F, 30122, Venice, Italy; University of Trieste, Department of Life Sciences, Via Licio Giorgieri 5, 34127, Trieste, Italy.
| | - Alberto Pallavicini
- University of Trieste, Department of Life Sciences, Via Licio Giorgieri 5, 34127, Trieste, Italy; Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Naples, Italy.
| | - Paolo Edomi
- University of Trieste, Department of Life Sciences, Via Licio Giorgieri 5, 34127, Trieste, Italy.
| | - Marco Pansera
- National Research Council, Institute of Marine Science (CNR ISMAR) Venice, Arsenale Tesa 104, Castello 2737/F, 30122, Venice, Italy; Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Naples, Italy.
| | - Elisa Camatti
- National Research Council, Institute of Marine Science (CNR ISMAR) Venice, Arsenale Tesa 104, Castello 2737/F, 30122, Venice, Italy.
| |
Collapse
|
13
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
14
|
Garcia-Garcia S, Cortese MF, Rodríguez-Algarra F, Tabernero D, Rando-Segura A, Quer J, Buti M, Rodríguez-Frías F. Next-generation sequencing for the diagnosis of hepatitis B: current status and future prospects. Expert Rev Mol Diagn 2021; 21:381-396. [PMID: 33880971 DOI: 10.1080/14737159.2021.1913055] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 03/31/2021] [Indexed: 02/07/2023]
Abstract
INTRODUCTION Hepatitis B virus (HBV) causes a complex and persistent infection with a major impact on patients health. Viral-genome sequencing can provide valuable information for characterizing virus genotype, infection dynamics and drug and vaccine resistance. AREAS COVERED This article reviews the current literature to describe the next-generation sequencing progress that facilitated a more comprehensive study of HBV quasispecies in diagnosis and clinical monitoring. EXPERT OPINION HBV variability plays a key role in liver disease progression and treatment efficacy. Second-generation sequencing improved the sensitivity for detecting and quantifying mutations, mixed genotypes and viral recombination. Third-generation sequencing enables the analysis of the entire HBV genome, although the high error rate limits its use in clinical practice.
Collapse
Affiliation(s)
- Selene Garcia-Garcia
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Maria Francesca Cortese
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Francisco Rodríguez-Algarra
- Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - David Tabernero
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| | - Ariadna Rando-Segura
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Josep Quer
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Liver Disease Laboratory-Viral Hepatitis, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Maria Buti
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Department of Internal Medicine, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Francisco Rodríguez-Frías
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| |
Collapse
|
15
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
16
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
17
|
Quantitative Trait Loci (QTL) Analysis of Fruit and Agronomic Traits of Tropical Pumpkin (Cucurbita moschata) in an Organic Production System. HORTICULTURAE 2020. [DOI: 10.3390/horticulturae6010014] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Interest in the development of organically grown vegetable crops has risen over the past decades due to consumer preferences. However, most crops that have desirable consumer traits have been bred in conventional growing conditions, and their transfer to an organic setting is challenging. Here, the organically grown Hawaiian pumpkin (Cucurbita moschata) accession ‘Shima’ was crossed with the conventionally grown Puerto Rican variety ‘Taina Dorada’ to develop a backcross (BC1) population, where ‘Shima’ was the recurrent parent. A total of 202 BC1 (‘Shima’ X F1) progenies were planted in a certified organic field, and twelve traits were evaluated. We used genotype-by-sequencing (GBS) to identify the Quantitative Trait Loci (QTL) associated with insect tolerance along with commercially desirable traits. A total of 1582 single nucleotide polymorphisms (SNPs) were identified, from which 711 SNPs were used to develop a genetic map and perform QTL mapping. Reads associated with significant QTLs were aligned to the publicly available Cucurbita moschata genome and identified several markers linked to genes that have been previously reported to be associated with that trait in other crop systems, such as melon (Cucumis melo L.). This research provides a resource for marker-assisted selection (MAS) efforts in Cucurbita moschata, as well as serving as a model study to improve cultivars that are transitioning from a conventional to an organic setting.
Collapse
|
18
|
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep 2019; 9:16157. [PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 10/07/2019] [Indexed: 01/30/2023] Open
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.
Collapse
|
19
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
20
|
Chromosome-Wide Evolution and Sex Determination in the Three-Sexed Nematode Auanema rhodensis. G3-GENES GENOMES GENETICS 2019; 9:1211-1230. [PMID: 30770412 PMCID: PMC6469403 DOI: 10.1534/g3.119.0011] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Trioecy, a mating system in which males, females and hermaphrodites co-exist, is a useful system to investigate the origin and maintenance of alternative mating strategies. In the trioecious nematode Auanema rhodensis, males have one X chromosome (XO), whereas females and hermaphrodites have two (XX). The female vs. hermaphrodite sex determination mechanisms have remained elusive. In this study, RNA-seq analyses show a 20% difference between the L2 hermaphrodite and female gene expression profiles. RNAi experiments targeting the DM (doublesex/mab-3) domain transcription factor dmd-10/11 suggest that the hermaphrodite sexual fate requires the upregulation of this gene. The genetic linkage map (GLM) shows that there is chromosome-wide heterozygosity for the X chromosome in F2 hermaphrodite-derived lines originated from crosses between two parental inbred strains. These results confirm the lack of recombination of the X chromosome in hermaphrodites, as previously reported. We also describe conserved chromosome elements (Nigon elements), which have been mostly maintained throughout the evolution of Rhabditina nematodes. The seven-chromosome karyotype of A. rhodensis, instead of the typical six found in other rhabditine species, derives from fusion/rearrangements events involving three Nigon elements. The A. rhodensis X chromosome is the smallest and most polymorphic with the least proportion of conserved genes. This may reflect its atypical mode of father-to-son transmission and its lack of recombination in hermaphrodites and males. In conclusion, this study provides a framework for studying the evolution of chromosomes in rhabditine nematodes, as well as possible mechanisms for the sex determination in a three-sexed species.
Collapse
|
21
|
Young MK, Smith RJ, Pilgrim KL, Fairchild MP, Schwartz MK. Integrative taxonomy refutes a species hypothesis: The asymmetric hybrid origin of Arsapnia arapahoe (Plecoptera, Capniidae). Ecol Evol 2019; 9:1364-1377. [PMID: 30805166 PMCID: PMC6374720 DOI: 10.1002/ece3.4852] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 11/02/2018] [Accepted: 11/29/2018] [Indexed: 11/23/2022] Open
Abstract
Molecular tools are commonly directed at refining taxonomies and the species that constitute their fundamental units. This has been especially insightful for groups for which species hypotheses are ambiguous and have largely been based on morphological differences between certain life stages or sexes, and has added importance when taxa are a focus of conservation efforts. Here, we examine the taxonomic status of Arsapnia arapahoe, a winter stonefly in the family Capniidae that is a species of conservation concern because of its limited abundance and restricted range in northern Colorado, USA. Phylogenetic analyses of sequences of mitochondrial and nuclear genes of this and other capniid stoneflies from this region and elsewhere in western North America indicated extensive haplotype sharing, limited genetic differences, and a lack of reciprocal monophyly between A. arapahoe and the sympatric A. decepta, despite distinctive and consistent morphological differences in the sexual apparatus of males of both species. Analyses of autosomal and sex-linked single nucleotide polymorphisms detected using genotyping by sequencing indicated that all individuals of A. arapahoe consisted of F1 hybrids between female A. decepta and males of another sympatric stonefly, Capnia gracilaria. Rather than constitute a self-sustaining evolutionary lineage, A. arapahoe appears to represent the product of nonintrogressive hybridization in the limited area of syntopy between two widely distributed taxa. This offers a cautionary tale for taxonomists and conservation biologists working on the less-studied components of the global fauna.
Collapse
Affiliation(s)
- Michael K. Young
- U.S. Forest Service, Rocky Mountain Research Station, National Genomics Center for Wildlife and Fish ConservationMissoulaMontana
| | - Rebecca J. Smith
- U.S. Forest Service, Rocky Mountain Research Station, National Genomics Center for Wildlife and Fish ConservationMissoulaMontana
| | - Kristine L. Pilgrim
- U.S. Forest Service, Rocky Mountain Research Station, National Genomics Center for Wildlife and Fish ConservationMissoulaMontana
| | | | - Michael K. Schwartz
- U.S. Forest Service, Rocky Mountain Research Station, National Genomics Center for Wildlife and Fish ConservationMissoulaMontana
| |
Collapse
|
22
|
Ershov V, Tarasov A, Lapidus A, Korobeynikov A. IonHammer: Homopolymer-Space Hamming Clustering for IonTorrent Read Error Correction. J Comput Biol 2019; 26:124-127. [DOI: 10.1089/cmb.2018.0152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vasily Ershov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
| | - Artem Tarasov
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Anton Korobeynikov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
23
|
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics 2018; 19:912. [PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Collapse
Affiliation(s)
- Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jin Xie
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Lin Bai
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wen Chen
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Mingju Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhonglei Zhang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China
| | - Zhe Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney, NSW 2007, Australia
| |
Collapse
|
24
|
Moreno R, Castro P, Vrána J, Kubaláková M, Cápal P, García V, Gil J, Millán T, Doležel J. Integration of Genetic and Cytogenetic Maps and Identification of Sex Chromosome in Garden Asparagus ( Asparagus officinalis L.). FRONTIERS IN PLANT SCIENCE 2018; 9:1068. [PMID: 30108600 PMCID: PMC6079222 DOI: 10.3389/fpls.2018.01068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Accepted: 07/02/2018] [Indexed: 05/30/2023]
Abstract
A genetic linkage map of dioecious garden asparagus (Asparagus officinalis L., 2n = 2x = 20) was constructed using F1 population, simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) markers. In total, 1376 SNPs and 27 SSRs were used for genetic mapping. Two resulting parental maps contained 907 and 678 markers spanning 1947 and 1814 cM, for female and male parent, respectively, over ten linkage groups representing ten haploid chromosomes of the species. With the aim to anchor the ten genetic linkage groups to individual chromosomes and develop a tool to facilitate genome analysis and gene cloning, we have optimized a protocol for flow cytometric chromosome analysis and sorting in asparagus. The analysis of DAPI-stained suspensions of intact mitotic chromosomes by flow cytometry resulted in histograms of relative fluorescence intensity (flow karyotypes) comprising eight major peaks. The analysis of chromosome morphology and localization of 5S and 45S rDNA by FISH on flow-sorted chromosomes, revealed that four chromosomes (IV, V, VI, VIII) could be discriminated and sorted. Seventy-two SSR markers were used to characterize chromosome content of individual peaks on the flow karyotype. Out of them, 27 were included in the genetic linkage map and anchored genetic linkage groups to chromosomes. The sex determining locus was located on LG5, which was associated with peak V representing a chromosome with 5S rDNA locus. The results obtained in this study will support asparagus improvement by facilitating targeted marker development and gene isolation using flow-sorted chromosomes.
Collapse
Affiliation(s)
- Roberto Moreno
- Department of Genetics-ETSIAM, University of Córdoba, Córdoba, Spain
| | - Patricia Castro
- Department of Genetics-ETSIAM, University of Córdoba, Córdoba, Spain
| | - Jan Vrána
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czechia
| | - Marie Kubaláková
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czechia
| | - Petr Cápal
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czechia
| | - Verónica García
- Department of Genetics-ETSIAM, University of Córdoba, Córdoba, Spain
| | - Juan Gil
- Department of Genetics-ETSIAM, University of Córdoba, Córdoba, Spain
| | - Teresa Millán
- Department of Genetics-ETSIAM, University of Córdoba, Córdoba, Spain
| | - Jaroslav Doležel
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czechia
| |
Collapse
|
25
|
Cheng H, Wu M, Xu Y. FMtree: a fast locating algorithm of FM-indexes for genomic data. Bioinformatics 2018; 34:416-424. [PMID: 28968761 DOI: 10.1093/bioinformatics/btx596] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 09/16/2017] [Indexed: 11/15/2022] Open
Abstract
Motivation As a fundamental task in bioinformatics, searching for massive short patterns over a long text has been accelerated by various compressed full-text indexes. These indexes are able to provide similar searching functionalities to classical indexes, e.g. suffix trees and suffix arrays, while requiring less space. For genomic data, a well-known family of compressed full-text indexes, called FM-indexes, presents unmatched performance in practice. One major drawback of FM-indexes is that their locating operations, which report all occurrence positions of patterns in a given text, are not efficient, especially for the patterns with many occurrences. Results In this paper, we introduce a novel locating algorithm, FMtree, to fast retrieve all occurrence positions of any pattern via FM-indexes. When searching for a pattern over a given text, FMtree organizes the search space of the locating operation into a conceptual multiway tree. As a result, multiple occurrence positions of this pattern can be retrieved simultaneously by traversing the multiway tree. Compared with existing locating algorithms, our tree-based algorithm reduces large numbers of redundant operations and presents better data locality. Experimental results show that FMtree is usually one order of magnitude faster than the state-of-the-art algorithms, and still memory-efficient. Availability and implementation FMtree is freely available at https://github.com/chhylp123/FMtree. Contact xuyun@ustc.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haoyu Cheng
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
- Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073, China
| | - Ming Wu
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
| | - Yun Xu
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
- Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073, China
| |
Collapse
|
26
|
Huang YT, Huang YW. An efficient error correction algorithm using FM-index. BMC Bioinformatics 2017; 18:524. [PMID: 29179672 PMCID: PMC5704532 DOI: 10.1186/s12859-017-1940-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed. RESULTS We develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads. CONCLUSIONS FMOE is freely available at https://github.com/ythuang0522/FMOC .
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan.
| | - Yu-Wen Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan
| |
Collapse
|
27
|
Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol 2017; 261:157-168. [PMID: 28888961 DOI: 10.1016/j.jbiotec.2017.07.017] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2017] [Revised: 07/17/2017] [Accepted: 07/19/2017] [Indexed: 11/27/2022]
Abstract
BACKGROUND The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (Döring et al., 2008). RESULTS The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools. CONCLUSIONS We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.
Collapse
Affiliation(s)
- Knut Reinert
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany.
| | - Temesgen Hailemariam Dadi
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Marcel Ehrhardt
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Hannes Hauswedell
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - René Rahn
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Jongkyu Kim
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Christopher Pockrandt
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Jörg Winkler
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | | | - Gianvito Urgese
- Department of Control and Computer Engineering, Politecnico di Torino, Italy
| | | |
Collapse
|
28
|
Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18:374. [PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/11/2017] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
Collapse
|
29
|
Song L, Huang W, Kang J, Huang Y, Ren H, Ding K. Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus. Sci Rep 2017; 7:8106. [PMID: 28808243 PMCID: PMC5556038 DOI: 10.1038/s41598-017-08139-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 07/05/2017] [Indexed: 01/26/2023] Open
Abstract
Ion Torrent Personal Genome Machine (PGM) technology is a mid-length read, low-cost and high-speed next-generation sequencing platform with a relatively high insertion and deletion (indel) error rate. A full systematic assessment of the effectiveness of various error correction algorithms in PGM viral datasets (e.g., hepatitis B virus (HBV)) has not been performed. We examined 19 quality-trimmed PGM datasets for the HBV reverse transcriptase (RT) region and found a total error rate of 0.48% ± 0.12%. Deletion errors were clearly present at the ends of homopolymer runs. Tests using both real and simulated data showed that the algorithms differed in their abilities to detect and correct errors and that the error rate and sequencing depth significantly affected the performance. Of the algorithms tested, Pollux showed a better overall performance but tended to over-correct 'genuine' substitution variants, whereas Fiona proved to be better at distinguishing these variants from sequencing errors. We found that the combined use of Pollux and Fiona gave the best results when error-correcting Ion Torrent PGM viral data.
Collapse
Affiliation(s)
- Liting Song
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Wenxun Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Juan Kang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Yuan Huang
- Center for Hepatobillary and Pancreatic Diseases, Beijing Tsinghua Changgung Hospital, Medical Center, Tsinghua University, Beijing, 100044, P.R. China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China
| | - Keyue Ding
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400010, P.R. China.
| |
Collapse
|
30
|
Yin Z, Lan H, Tan G, Lu M, Vasilakos AV, Liu W. Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges. Comput Struct Biotechnol J 2017; 15:403-411. [PMID: 28883909 PMCID: PMC5581845 DOI: 10.1016/j.csbj.2017.07.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 06/30/2017] [Accepted: 07/28/2017] [Indexed: 12/25/2022] Open
Abstract
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
Collapse
Affiliation(s)
- Zekun Yin
- Shandong University, Jinan, Shandong, China
| | | | - Guangming Tan
- Institute of Computing Technology, Chinese Academy of Sciences, China
| | - Mian Lu
- Huawei Singapore Research Centre, Singapore
| | - Athanasios V Vasilakos
- Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, Skellefteå SE-931 87, Sweden
| | - Weiguo Liu
- Shandong University, Jinan, Shandong, China
| |
Collapse
|
31
|
Lee B, Moon T, Yoon S, Weissman T. DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 2017; 12:e0181463. [PMID: 28749987 PMCID: PMC5531809 DOI: 10.1371/journal.pone.0181463] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/30/2017] [Indexed: 11/29/2022] Open
Abstract
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
Collapse
Affiliation(s)
- Byunghan Lee
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Taesup Moon
- College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
- * E-mail: (TM); (SY)
| | - Sungroh Yoon
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Neurology and Neurological Sciences, Stanford University, Stanford, California, United States of America
- * E-mail: (TM); (SY)
| | - Tsachy Weissman
- Electrical Engineering, Stanford University, Stanford, California, United States of America
| |
Collapse
|
32
|
Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today 2017; 22:712-717. [DOI: 10.1016/j.drudis.2017.01.014] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 12/16/2016] [Accepted: 01/25/2017] [Indexed: 12/17/2022]
|
33
|
Tumber A, Nuzzi A, Hookway ES, Hatch SB, Velupillai S, Johansson C, Kawamura A, Savitsky P, Yapp C, Szykowska A, Wu N, Bountra C, Strain-Damerell C, Burgess-Brown NA, Ruda GF, Fedorov O, Munro S, England KS, Nowak RP, Schofield CJ, La Thangue NB, Pawlyn C, Davies F, Morgan G, Athanasou N, Müller S, Oppermann U, Brennan PE. Potent and Selective KDM5 Inhibitor Stops Cellular Demethylation of H3K4me3 at Transcription Start Sites and Proliferation of MM1S Myeloma Cells. Cell Chem Biol 2017; 24:371-380. [PMID: 28262558 PMCID: PMC5361737 DOI: 10.1016/j.chembiol.2017.02.006] [Citation(s) in RCA: 100] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Revised: 10/31/2016] [Accepted: 02/01/2017] [Indexed: 12/16/2022]
Abstract
Methylation of lysine residues on histone tail is a dynamic epigenetic modification that plays a key role in chromatin structure and gene regulation. Members of the KDM5 (also known as JARID1) sub-family are 2-oxoglutarate (2-OG) and Fe2+-dependent oxygenases acting as histone 3 lysine 4 trimethyl (H3K4me3) demethylases, regulating proliferation, stem cell self-renewal, and differentiation. Here we present the characterization of KDOAM-25, an inhibitor of KDM5 enzymes. KDOAM-25 shows biochemical half maximal inhibitory concentration values of <100 nM for KDM5A-D in vitro, high selectivity toward other 2-OG oxygenases sub-families, and no off-target activity on a panel of 55 receptors and enzymes. In human cell assay systems, KDOAM-25 has a half maximal effective concentration of ∼50 μM and good selectivity toward other demethylases. KDM5B is overexpressed in multiple myeloma and negatively correlated with the overall survival. Multiple myeloma MM1S cells treated with KDOAM-25 show increased global H3K4 methylation at transcriptional start sites and impaired proliferation.
Collapse
Affiliation(s)
- Anthony Tumber
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Andrea Nuzzi
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Edward S Hookway
- NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK
| | - Stephanie B Hatch
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Srikannathasan Velupillai
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Catrine Johansson
- NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK; Chemistry Research Laboratory, University of Oxford, 12 Mansfield Road, Oxford OX1 3TA, UK
| | - Akane Kawamura
- Chemistry Research Laboratory, University of Oxford, 12 Mansfield Road, Oxford OX1 3TA, UK; Division of Cardiovascular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford OX3 7BN, UK
| | - Pavel Savitsky
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK
| | - Clarence Yapp
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | | | - Na Wu
- NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK
| | - Chas Bountra
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK
| | | | | | - Gian Filippo Ruda
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Oleg Fedorov
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Shonagh Munro
- Department of Oncology, University of Oxford, Oxford OX3 7DQ, UK
| | - Katherine S England
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Radoslaw P Nowak
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK
| | | | | | - Charlotte Pawlyn
- Division of Cancer Therapeutics, Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK
| | - Faith Davies
- Division of Cancer Therapeutics, Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK; University of Arkansas for Medical Sciences, Myeloma Institute, 4301 W. Markham #816, Little Rock, AR 72205, USA
| | - Gareth Morgan
- Division of Cancer Therapeutics, Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK; University of Arkansas for Medical Sciences, Myeloma Institute, 4301 W. Markham #816, Little Rock, AR 72205, USA
| | - Nick Athanasou
- NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK
| | - Susanne Müller
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK.
| | - Udo Oppermann
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; NIHR Oxford Biomedical Research Unit, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Oxford OX3 7LD, UK.
| | - Paul E Brennan
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK; Nuffield Department of Medicine, Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK.
| |
Collapse
|
34
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
35
|
Lavezzo E, Barzon L, Toppo S, Palù G. Third generation sequencing technologies applied to diagnostic microbiology: benefits and challenges in applications and data analysis. Expert Rev Mol Diagn 2016; 16:1011-23. [PMID: 27453996 DOI: 10.1080/14737159.2016.1217158] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
INTRODUCTION The diagnosis of infectious diseases is among the most successful areas of application of new generation sequencing technologies. The field has seen the development of numerous experimental and analytical approaches for the detection and the fine description of pathogenic and non-pathogenic microorganisms. AREAS COVERED Without claiming to be exhaustive with respect to all applications and methods developed over the years, this review focuses on the advantages and the issues brought by the new technologies, with an eye in particular to third generation sequencing methods. Both experimental procedures and algorithmic strategies are presented, following the most relevant publications which have led to progress in our ability of detecting infectious agents. Expert commentary: The technical advance brought by third generation sequencing platforms has the potential to significantly expand the range of diagnostic tools that will be available to clinicians. Nonetheless, the implementation of these technologies in clinical practice is still far from being actionable and will temporally follow the path undertaken by second generation methods, which still require the setup of standardized pipelines in both wet and dry laboratory procedures.
Collapse
Affiliation(s)
- Enrico Lavezzo
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Luisa Barzon
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Stefano Toppo
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| | - Giorgio Palù
- a Department of Molecular Medicine , University of Padova , Padova , Italy
| |
Collapse
|
36
|
Milicchio F, Rose R, Bian J, Min J, Prosperi M. Visual programming for next-generation sequencing data analytics. BioData Min 2016; 9:16. [PMID: 27127540 PMCID: PMC4848821 DOI: 10.1186/s13040-016-0095-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open
Abstract
Background High-throughput or next-generation sequencing (NGS) technologies have become an established and affordable experimental framework in biological and medical sciences for all basic and translational research. Processing and analyzing NGS data is challenging. NGS data are big, heterogeneous, sparse, and error prone. Although a plethora of tools for NGS data analysis has emerged in the past decade, (i) software development is still lagging behind data generation capabilities, and (ii) there is a ‘cultural’ gap between the end user and the developer. Text Generic software template libraries specifically developed for NGS can help in dealing with the former problem, whilst coupling template libraries with visual programming may help with the latter. Here we scrutinize the state-of-the-art low-level software libraries implemented specifically for NGS and graphical tools for NGS analytics. An ideal developing environment for NGS should be modular (with a native library interface), scalable in computational methods (i.e. serial, multithread, distributed), transparent (platform-independent), interoperable (with external software interface), and usable (via an intuitive graphical user interface). These characteristics should facilitate both the run of standardized NGS pipelines and the development of new workflows based on technological advancements or users’ needs. We discuss in detail the potential of a computational framework blending generic template programming and visual programming that addresses all of the current limitations. Conclusion In the long term, a proper, well-developed (although not necessarily unique) software framework will bridge the current gap between data generation and hypothesis testing. This will eventually facilitate the development of novel diagnostic tools embedded in routine healthcare.
Collapse
Affiliation(s)
| | | | - Jiang Bian
- Department of Health Outcomes and Policy, University of Florida, Gainesville, FL USA
| | - Jae Min
- Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, 32610-0231 FL USA
| | - Mattia Prosperi
- Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, 2004 Mowry Road, Gainesville, 32610-0231 FL USA
| |
Collapse
|
37
|
Durai DA, Schulz MH. Informed kmer selection for de novo transcriptome assembly. Bioinformatics 2016; 32:1670-7. [PMID: 27153653 PMCID: PMC4892416 DOI: 10.1093/bioinformatics/btw217] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2015] [Accepted: 04/17/2016] [Indexed: 11/23/2022] Open
Abstract
Motivation:De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k. As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. Results: Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. Conclusion: We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. Availability and Implementation:A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATION. Supplementary information:Supplementary data are available at Bioinformatics online. Contact:mschulz@mmci.uni-saarland.de
Collapse
Affiliation(s)
- Dilip A Durai
- Cluster of Excellence on Multimodal Computing and Interaction, Saarland University, Saarbrücken, 66123, Germany Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, 66123, Germany
| | - Marcel H Schulz
- Cluster of Excellence on Multimodal Computing and Interaction, Saarland University, Saarbrücken, 66123, Germany Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, 66123, Germany
| |
Collapse
|
38
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
39
|
Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
40
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
41
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
42
|
Kowalski T, Grabowski S, Deorowicz S. Indexing Arbitrary-Length k-Mers in Sequencing Reads. PLoS One 2015; 10:e0133198. [PMID: 26182400 PMCID: PMC4504488 DOI: 10.1371/journal.pone.0133198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/24/2015] [Indexed: 11/25/2022] Open
Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Collapse
Affiliation(s)
- Tomasz Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
43
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
44
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|