Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	[Subscribe] [Scholar Register]

Number

Cited by Other Article(s)

Dou C, Yang Y, Zhu F, Li B, Duan Y. Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints. Brief Bioinform 2024;25:bbae363. [PMID: 39073829 DOI: 10.1093/bib/bbae363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 06/25/2024] [Accepted: 07/13/2024] [Indexed: 07/30/2024] Open

Cao B, Zheng Y, Shao Q, Liu Z, Xie L, Zhao Y, Wang B, Zhang Q, Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep 2024;43:113699. [PMID: 38517891 DOI: 10.1016/j.celrep.2024.113699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/15/2023] [Accepted: 01/05/2024] [Indexed: 03/24/2024] Open

Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024;25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open

Abstract

BACKGROUND

The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages.

RESULTS

We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome.

CONCLUSIONS

This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.

Collapse

Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022;49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]

Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 2022;23:190. [PMID: 36076275 PMCID: PMC9454175 DOI: 10.1186/s13059-022-02743-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/01/2022] [Indexed: 11/13/2022] Open

Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022;23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.

RESULTS

We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.

CONCLUSION

False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Collapse

Shibuya Y, Belazzougui D, Kucherov G. Space-efficient representation of genomic k-mer count tables. Algorithms Mol Biol 2022;17:5. [PMID: 35317833 PMCID: PMC8939220 DOI: 10.1186/s13015-022-00212-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/01/2022] [Indexed: 11/10/2022] Open

Simion P, Narayan J, Houtain A, Derzelle A, Baudry L, Nicolas E, Arora R, Cariou M, Cruaud C, Gaudray FR, Gilbert C, Guiglielmoni N, Hespeels B, Kozlowski DKL, Labadie K, Limasset A, Llirós M, Marbouty M, Terwagne M, Virgo J, Cordaux R, Danchin EGJ, Hallet B, Koszul R, Lenormand T, Flot JF, Van Doninck K. Chromosome-level genome assembly reveals homologous chromosomes and recombination in asexual rotifer Adineta vaga. SCIENCE ADVANCES 2021;7:eabg4216. [PMID: 34613768 PMCID: PMC8494291 DOI: 10.1126/sciadv.abg4216] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]

Affiliation(s)

Paul Simion Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium Corresponding author. (K.V.D.); (J.-F.F.); (P.S.)
Jitendra Narayan Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Antoine Houtain Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Alessandro Derzelle Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Lyam Baudry Institut Pasteur, Unité Régulation Spatiale des Génomes, UMR 3525, CNRS, Paris F-75015, France Collège Doctoral, Sorbonne Université, F-75005 Paris, France
Emilien Nicolas Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium Molecular Biology and Evolution, Université libre de Bruxelles (ULB), Brussels 1050, Belgium
Rohan Arora Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium Molecular Biology and Evolution, Université libre de Bruxelles (ULB), Brussels 1050, Belgium
Marie Cariou Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium CIRI, Centre International de Recherche en Infectiologie, Univ Lyon, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, ENS de Lyon, F-69007 Lyon, France
Corinne Cruaud Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France
Florence Rodriguez Gaudray Evolutionary Biology and Ecology, Université libre de Bruxelles (ULB), Brussels 1050, Belgium
Clément Gilbert Évolution, Génomes, Comportement et Écologie, Université Paris-Saclay, CNRS, IRD, UMR, 91198 Gif-sur-Yvette, France
Nadège Guiglielmoni Evolutionary Biology and Ecology, Université libre de Bruxelles (ULB), Brussels 1050, Belgium
Boris Hespeels Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Djampa K. L. Kozlowski INRAE, Université Côte-d’Azur, CNRS, Institut Sophia Agrobiotech, Sophia Antipolis 06903, France
Karine Labadie Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France
Antoine Limasset Université de Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d’Ascq, France
Marc Llirós Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium Institut d’Investigació Biomédica de Girona, Malalties Digestives i Microbiota, 17190 Salt, Spain
Martial Marbouty Institut Pasteur, Unité Régulation Spatiale des Génomes, UMR 3525, CNRS, Paris F-75015, France
Matthieu Terwagne Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Julie Virgo Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium
Richard Cordaux Ecologie et Biologie des interactions, Université de Poitiers, UMR CNRS 7267, 5 rue Albert Turpain, 86073 Poitiers, France
Etienne G. J. Danchin INRAE, Université Côte-d’Azur, CNRS, Institut Sophia Agrobiotech, Sophia Antipolis 06903, France
Bernard Hallet LIBST, Université Catholique de Louvain (UCLouvain), Croix du Sud 4/5, Louvain-la-Neuve 1348, Belgium
Romain Koszul Institut Pasteur, Unité Régulation Spatiale des Génomes, UMR 3525, CNRS, Paris F-75015, France
Thomas Lenormand CEFE, Univ Montpellier, CNRS, Univ Paul Valéry Montpellier 3, EPHE, IRD, Montpellier, France
Jean-Francois Flot Evolutionary Biology and Ecology, Université libre de Bruxelles (ULB), Brussels 1050, Belgium Interuniversity Institute of Bioinformatics in Brussels - (IB), Brussels 1050, Belgium Corresponding author. (K.V.D.); (J.-F.F.); (P.S.)
Karine Van Doninck Research Unit in Environmental and Evolutionary Biology, Université de Namur, Namur 5000, Belgium Molecular Biology and Evolution, Université libre de Bruxelles (ULB), Brussels 1050, Belgium Corresponding author. (K.V.D.); (J.-F.F.); (P.S.)

Collapse

Zhang X, Ping P, Hutvagner G, Blumenstein M, Li J. Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach. Nucleic Acids Res 2021;49:e106. [PMID: 34291293 PMCID: PMC8631080 DOI: 10.1093/nar/gkab610] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 07/01/2021] [Accepted: 07/06/2021] [Indexed: 12/21/2022] Open

Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021;37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open

Zhang X, Liu Y, Yu Z, Blumenstein M, Hutvagner G, Li J. Instance-based error correction for short reads of disease-associated genes. BMC Bioinformatics 2021;22:142. [PMID: 34078284 PMCID: PMC8170817 DOI: 10.1186/s12859-021-04058-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open

Holley G, Beyter D, Ingimundardottir H, Møller PL, Kristmundsdottir S, Eggertsson HP, Halldorsson BV. Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 2021;22:28. [PMID: 33419473 PMCID: PMC7792008 DOI: 10.1186/s13059-020-02244-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 12/15/2020] [Indexed: 12/20/2022] Open

Steyaert A, Audenaert P, Fostier J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 2020;21:402. [PMID: 32928110 PMCID: PMC7491180 DOI: 10.1186/s12859-020-03740-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 09/04/2020] [Indexed: 12/01/2022] Open

Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020;21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open

Prezza N, Pisanti N, Sciortino M, Rosone G. Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinformatics 2020;21:260. [PMID: 32938358 PMCID: PMC7493873 DOI: 10.1186/s12859-020-03586-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 06/08/2020] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT.

RESULTS

In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel .

CONCLUSIONS

Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.

Collapse

Yanes L, Garcia Accinelli G, Wright J, Ward BJ, Clavijo BJ. A Sequence Distance Graph framework for genome assembly and analysis. F1000Res 2019;8:1490. [PMID: 31723420 PMCID: PMC6833988 DOI: 10.12688/f1000research.20233.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/12/2019] [Indexed: 11/20/2022] Open