1
|
Guo F, Li Y, Zhao H, Liu X, Mao J, Ma D, Liu S. GKNnet: an relational graph convolutional network-based method with knowledge-augmented activation layer for microbial structural variation detection. Brief Bioinform 2025; 26:bbaf200. [PMID: 40324334 PMCID: PMC12052243 DOI: 10.1093/bib/bbaf200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Revised: 03/09/2025] [Accepted: 04/10/2025] [Indexed: 05/07/2025] Open
Abstract
Structural variants (SVs) in microbial genomes play a critical role in phenotypic changes, environmental adaptation, and species evolution, with deletion variations particularly closely linked to phenotypic traits. Therefore, accurate and comprehensive identification of deletion variations is essential. Although long-read sequencing technology can detect more SVs, its high error rate introduces substantial noise, leading to high false-positive and low recall rates in existing SV detection algorithms. This paper presents an SV detection method based on graph convolutional networks (GCNs). The model first represents node features through a heterogeneous graph, leveraging the GCN to precisely identify variant regions. Additionally, a knowledge-augmented activation layer (KANLayer) with a learnable activation function is introduced to reduce noise around variant regions, thereby improving model precision and reducing false positives. A clustering algorithm then aggregates multiple overlapping regions near the variant center into a single accurate SV interval, further enhancing recall. Validation on both simulated and real datasets demonstrates that our method achieves superior F1 scores compared to benchmark methods (cuteSV, Sniffles, Svim, and Pbsv), highlighting its advantage and robustness in SV detection and offering an innovative solution for microbial genome structural variation research.
Collapse
Affiliation(s)
- Fengyi Guo
- School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
| | - Yuanbo Li
- School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
| | - Hongyuan Zhao
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
| | - Xiaogang Liu
- Luzhou Laojiao Group Co. Ltd, 157 Guojiao Road, Jiangyang District, Luzhou 646000, Sichuan, China
| | - Jian Mao
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
- Shaoxing Key Laboratory of Traditional Fermentation Food and Human Health, Jiangnan University (Shaoxing) Industrial Technology Research Institute, Keqiao District, Shaoxing 312000, Zhejiang, China
| | - Dongna Ma
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
| | - Shuangping Liu
- School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, 1800 Lihu Avenue, Binhu District, Wuxi, Jiangsu 214122, China
- Luzhou Laojiao Group Co. Ltd, 157 Guojiao Road, Jiangyang District, Luzhou 646000, Sichuan, China
| |
Collapse
|
2
|
Mahmoud M, Agustinho DP, Sedlazeck FJ. A Hitchhiker's Guide to long-read genomic analysis. Genome Res 2025; 35:545-558. [PMID: 40228901 PMCID: PMC12047252 DOI: 10.1101/gr.279975.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
Over the past decade, long-read sequencing has evolved into a pivotal technology for uncovering the hidden and complex regions of the genome. Significant cost efficiency, scalability, and accuracy advancements have driven this evolution. Concurrently, novel analytical methods have emerged to harness the full potential of long reads. These advancements have enabled milestones such as the first fully completed human genome, enhanced identification and understanding of complex genomic variants, and deeper insights into the interplay between epigenetics and genomic variation. This mini-review provides a comprehensive overview of the latest developments in long-read DNA sequencing analysis, encompassing reference-based and de novo assembly approaches. We explore the entire workflow, from initial data processing to variant calling and annotation, focusing on how these methods improve our ability to interpret a wide array of genomic variants. Additionally, we discuss the current challenges, limitations, and future directions in the field, offering a detailed examination of the state-of-the-art bioinformatics methods for long-read sequencing.
Collapse
Affiliation(s)
- Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Daniel P Agustinho
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA;
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
3
|
Wang J, Cheng K, Yan C, Luo H, Luo J. DconnLoop: a deep learning model for predicting chromatin loops based on multi-source data integration. BMC Bioinformatics 2025; 26:96. [PMID: 40170155 PMCID: PMC11959853 DOI: 10.1186/s12859-025-06092-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Accepted: 02/19/2025] [Indexed: 04/03/2025] Open
Abstract
BACKGROUND Chromatin loops are critical for the three-dimensional organization of the genome and gene regulation. Accurate identification of chromatin loops is essential for understanding the regulatory mechanisms in disease. However, current mainstream detection methods rely primarily on single-source data, such as Hi-C, which limits these methods' ability to capture the diverse features of chromatin loop structures. In contrast, multi-source data integration and deep learning approaches, though not yet widely applied, hold significant potential. RESULTS In this study, we developed a method called DconnLoop to integrate Hi-C, ChIP-seq, and ATAC-seq data to predict chromatin loops. This method achieves feature extraction and fusion of multi-source data by integrating residual mechanisms, directional connectivity excitation modules, and interactive feature space decoders. Finally, we apply density estimation and density clustering to the genome-wide prediction results to identify more representative loops. The code is available from https://github.com/kuikui-C/DconnLoop . CONCLUSIONS The results demonstrate that DconnLoop outperforms existing methods in both precision and recall. In various experiments, including Aggregate Peak Analysis and peak enrichment comparisons, DconnLoop consistently shows advantages. Extensive ablation studies and validation across different sequencing depths further confirm DconnLoop's robustness and generalizability.
Collapse
Affiliation(s)
- Junfeng Wang
- School of Physics and Electronic Information Engineering, Henan Polytechnic University, Jiaozuo, 454003, China
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China
| | - Kuikui Cheng
- School of Physics and Electronic Information Engineering, Henan Polytechnic University, Jiaozuo, 454003, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, 475001, China
| | - Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China.
| |
Collapse
|
4
|
Qiu T, Li J, Guo Y, Jiang L, Tang J. SVEA: an accurate model for structural variation detection using multi-channel image encoding and enhanced AlexNet architecture. J Transl Med 2025; 23:221. [PMID: 39987107 PMCID: PMC11846410 DOI: 10.1186/s12967-025-06213-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Accepted: 02/06/2025] [Indexed: 02/24/2025] Open
Abstract
BACKGROUND Structural variations (SVs) are a pervasive and impactful class of genetic variation within the genome, significantly influencing gene function, impacting human health, and contributing to disease. Recent advances in deep learning have shown promise for SV detection; however, current methods still encounter key challenges in effective feature extraction and accurately predicting complex variations. METHODS We introduce SVEA, an advanced deep learning model designed to address these challenges. SVEA employs a novel multi-channel image encoding approach that transforms SVs into multi-dimensional image formats, improving the model's ability to capture subtle genomic variations. Additionally, SVEA integrates multi-head self-attention mechanisms and multi-scale convolution modules, enhancing its ability to capture global context and multi-scale features. The model was trained and tested on a diverse range of genomic datasets to evaluate its accuracy and generalizability. RESULTS SVEA demonstrated superior performance in detecting complex SVs compared to existing methods, with improved accuracy across various genomic regions. The multi-channel encoding and advanced feature extraction techniques contributed to the model's enhanced ability to predict subtle and complex variations. CONCLUSIONS This study presents SVEA, a deep learning model incorporating advanced encoding and feature extraction techniques to enhance structural variation prediction. The model demonstrates high accuracy, outperforming existing methods by approximately 4%, while also identifying areas for further optimization.
Collapse
Affiliation(s)
- Taixing Qiu
- College of Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518055, China
| | - Jiawei Li
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518055, China
| | - Yan Guo
- Department of Public Health Sciences, University of Miami, Miami, FL, 33136, USA
| | - Limin Jiang
- Department of Public Health Sciences, University of Miami, Miami, FL, 33136, USA.
| | - Jijun Tang
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518055, China.
| |
Collapse
|
5
|
Todd C, Jin L, McQuillan I. SV-JIM, detailed pairwise structural variant calling using long-reads and genome assemblies. Methods 2025; 234:305-313. [PMID: 39826659 DOI: 10.1016/j.ymeth.2024.12.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 12/21/2024] [Accepted: 12/30/2024] [Indexed: 01/22/2025] Open
Abstract
This paper proposes a detailed process for SV calling that permits a data-driven assessment of multiple SV callers that uses both genome assemblies and long-reads. The process is implemented as a software pipeline named Structural Variant - Jaccard Index Measure, or SVJIM, using the Snakemake [20] workflow management system. Like most state-of-the-art SV callers, SV-JIM detects the presence of variations between pairs of genomes, but it streamlines the numerous SV calling stages into a single process for user convenience and evaluates the multiple SV sets produced using the Jaccard index measure to identify those with the highest consistency among the included SV callers. SV-JIM then produces aggregated SV results based on how many callers supported the reported SVs. For validation, SV-JIM was assessed through three case studies on the Homo sapiens genome and two plant genomes - Brassica nigra and Arabidopsis thaliana. Executing SV-JIM identified a significant amount of inter-caller variance which varied by tens of thousands of results on the larger Brassica nigra and Homo sapiens genomes. Further, aggregating the SV sets helped simplify better retention of the less frequently occurring SV types by requiring a level of minimum support rather than from a specific SV caller combination. Finally, these case studies identified a potential for inflated precision reporting that can occur during evaluation. SV-JIM is available publicly under MIT license at https://github.com/USask-BINFO/SV-JIM.
Collapse
Affiliation(s)
- Clarence Todd
- Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada.
| | - Lingling Jin
- Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada.
| | - Ian McQuillan
- Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada.
| |
Collapse
|
6
|
Zhai H, Dong C, Wang T, Luo J. HiSVision: A Method for Detecting Large-Scale Structural Variations Based on Hi-C Data and Detection Transformer. Interdiscip Sci 2024:10.1007/s12539-024-00677-0. [PMID: 39714580 DOI: 10.1007/s12539-024-00677-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 11/17/2024] [Indexed: 12/24/2024]
Abstract
Structural variation (SV) is an important component of the diversity of the human genome. Many studies have shown that SV has a significant impact on human disease and is strongly associated with the development of cancer. In recent years, the Hi-C sequencing technique has been shown to be useful for detecting large-scale SVs, and several methods have been proposed for identifying SVs from Hi-C data. However, due to the complexity of the 3D genome structure, accurate identifying SVs from the Hi-C contact matrix remains a challenging task. Here, we present HiSVision, a method for identifying large-scale SVs from Hi-C data using a detection transformer framework. Inspired by object detection network, we transform the Hi-C contact matrix into images, then identify candidate SV regions on the image by detection transformer, and finally filter SVs based on features around the breakpoints. Experimental results show that HiSVision outperforms existing methods in terms of precision and F1 score on cancer cell lines and simulated datasets. The source code and data are available from https://github.com/dcy99/HiSVision .
Collapse
Affiliation(s)
- Haixia Zhai
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China
| | - Chengyao Dong
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China
| | - Tao Wang
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China
| | - Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, 454003, China.
| |
Collapse
|
7
|
Luo J, Zhang Z, Ma X, Yan C, Luo H. GTasm: a genome assembly method using graph transformers and HiFi reads. Front Genet 2024; 15:1495657. [PMID: 39525812 PMCID: PMC11543488 DOI: 10.3389/fgene.2024.1495657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 10/14/2024] [Indexed: 11/16/2024] Open
Abstract
Motivation Genome assembly aims to reconstruct the whole chromosome-scale genome sequence. Obtaining accurate and complete chromosome-scale genome sequence serve as an indispensable foundation for downstream genomics analyses. Due to the complex repeat regions contained in genome sequence, the assembly results commonly are fragmented. Long reads with high accuracy rate can greatly enhance the integrity of genome assembly results. Results Here we introduce GTasm, an assembly method that uses graph transformer network to find optimal assembly results based on assembly graphs. Based on assembly graph, GTasm first extracts features about vertices and edges. Then, GTasm scores the edges by graph transformer model, and adopt a heuristic algorithm to find optimal paths in the assembly graph, each path corresponding to a contig. The graph transformer model is trained using simulated HiFi reads from CHM13, and GTasm is compared with other assembly methods using real HIFI read set. Through experimental result, GTasm can produce well assembly results, and achieve good performance on NA50 and NGA50 evaluation indicators. Applying deep learning models to genome assembly can improve the continuity and accuracy of assembly results. The code is available from https://github.com/chu-xuezhe/GTasm.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Ziheng Zhang
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Xinliang Ma
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
8
|
Junjun R, Zhengqian Z, Ying W, Jialiang W, Yongzhuang L. A comprehensive review of deep learning-based variant calling methods. Brief Funct Genomics 2024; 23:303-313. [PMID: 38366908 DOI: 10.1093/bfgp/elae003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/14/2024] [Accepted: 01/18/2023] [Indexed: 02/18/2024] Open
Abstract
Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning-based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning-based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.
Collapse
Affiliation(s)
- Ren Junjun
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Zhang Zhengqian
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Wu Ying
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Wang Jialiang
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| | - Liu Yongzhuang
- Harbin Institute of Technology, School of Computer Science and Technology, Harbin 150001, China
| |
Collapse
|
9
|
Merkulov P, Serganova M, Petrov G, Mityukov V, Kirov I. Long-read sequencing of extrachromosomal circular DNA and genome assembly of a Solanum lycopersicum breeding line revealed active LTR retrotransposons originating from S. Peruvianum L. introgressions. BMC Genomics 2024; 25:404. [PMID: 38658857 PMCID: PMC11044480 DOI: 10.1186/s12864-024-10314-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/15/2024] [Indexed: 04/26/2024] Open
Abstract
Transposable elements (TEs) are a major force in the evolution of plant genomes. Differences in the transposition activities and landscapes of TEs can vary substantially, even in closely related species. Interspecific hybridization, a widely employed technique in tomato breeding, results in the creation of novel combinations of TEs from distinct species. The implications of this process for TE transposition activity have not been studied in modern cultivars. In this study, we used nanopore sequencing of extrachromosomal circular DNA (eccDNA) and identified two highly active Ty1/Copia LTR retrotransposon families of tomato (Solanum lycopersicum), called Salsa and Ketchup. Elements of these families produce thousands of eccDNAs under controlled conditions and epigenetic stress. EccDNA sequence analysis revealed that the major parts of eccDNA produced by Ketchup and Salsa exhibited low similarity to the S. lycopersicum genomic sequence. To trace the origin of these TEs, whole-genome nanopore sequencing and de novo genome assembly were performed. We found that these TEs occurred in a tomato breeding line via interspecific introgression from S. peruvianum. Our findings collectively show that interspecific introgressions can contribute to both genetic and phenotypic diversity not only by introducing novel genetic variants, but also by importing active transposable elements from other species.
Collapse
Affiliation(s)
- Pavel Merkulov
- All-Russia Research Institute of Agricultural Biotechnology, 127550, Moscow, Russia
- Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Russia
| | - Melania Serganova
- All-Russia Research Institute of Agricultural Biotechnology, 127550, Moscow, Russia
- Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Russia
| | - Georgy Petrov
- Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Russia
| | - Vladislav Mityukov
- Skolkovo Institute of Science and Technology, 121205, Moscow, Russia
- Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, 127051, Moscow, Russia
| | - Ilya Kirov
- All-Russia Research Institute of Agricultural Biotechnology, 127550, Moscow, Russia.
- Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Russia.
| |
Collapse
|
10
|
Luo J, Gao R, Chang W, Wang J. LSnet: detecting and genotyping deletions using deep learning network. Front Genet 2023; 14:1189775. [PMID: 37388936 PMCID: PMC10301831 DOI: 10.3389/fgene.2023.1189775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 06/05/2023] [Indexed: 07/01/2023] Open
Abstract
The role and biological impact of structural variation (SV) are increasingly evident. Deletion accounts for 40% of SV and is an important type of SV. Therefore, it is of great significance to detect and genotype deletions. At present, high accurate long reads can be obtained as HiFi reads. And, through a combination of error-prone long reads and high accurate short reads, we can also get accurate long reads. These accurate long reads are helpful for detecting and genotyping SVs. However, due to the complexity of genome and alignment information, detecting and genotyping SVs remain a challenging task. Here, we propose LSnet, an approach for detecting and genotyping deletions with a deep learning network. Because of the ability of deep learning to learn complex features in labeled datasets, it is beneficial for detecting SV. First, LSnet divides the reference genome into continuous sub-regions. Based on the alignment between the sequencing data (the combination of error-prone long reads and short reads or HiFi reads) and the reference genome, LSnet extracts nine features for each sub-region, and these features are considered as signal of deletion. Second, LSnet uses a convolutional neural network and an attention mechanism to learn critical features in every sub-region. Next, in accordance with the relationship among the continuous sub-regions, LSnet uses a gated recurrent units (GRU) network to further extract more important deletion signatures. And a heuristic algorithm is present to determine the location and length of deletions. Experimental results show that LSnet outperforms other methods in terms of the F1 score. The source code is available from GitHub at https://github.com/eioyuou/LSnet.
Collapse
|
11
|
Ma H, Zhong C, Chen D, He H, Yang F. cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network. BMC Bioinformatics 2023; 24:119. [PMID: 36977976 PMCID: PMC10045035 DOI: 10.1186/s12859-023-05243-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 03/21/2023] [Indexed: 03/30/2023] Open
Abstract
BACKGROUND Genomic structural variant detection is a significant and challenging issue in genome analysis. The existing long-read based structural variant detection methods still have space for improvement in detecting multi-type structural variants. RESULTS In this paper, we propose a method called cnnLSV to obtain detection results with higher quality by eliminating false positives in the detection results merged from the callsets of existing methods. We design an encoding strategy for four types of structural variants to represent long-read alignment information around structural variants into images, input the images into a constructed convolutional neural network to train a filter model, and load the trained model to remove the false positives to improve the detection performance. We also eliminate mislabeled training samples in the training model phase by using principal component analysis algorithm and unsupervised clustering algorithm k-means. Experimental results on both simulated and real datasets show that our proposed method outperforms existing methods overall in detecting insertions, deletions, inversions, and duplications. The program of cnnLSV is available at https://github.com/mhuidong/cnnLSV . CONCLUSIONS The proposed cnnLSV can detect structural variants by using long-read alignment information and convolutional neural network to achieve overall higher performance, and effectively eliminate incorrectly labeled samples by using the principal component analysis and k-means algorithms in training model stage.
Collapse
Affiliation(s)
- Huidong Ma
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
- Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China
| | - Cheng Zhong
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China.
- Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China.
| | - Danyang Chen
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
- Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China
| | - Haofa He
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
- Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China
| | - Feng Yang
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
- Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China
| |
Collapse
|
12
|
Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer. Genes (Basel) 2022; 13:genes13091583. [PMID: 36140751 PMCID: PMC9498802 DOI: 10.3390/genes13091583] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/30/2022] [Accepted: 08/31/2022] [Indexed: 11/23/2022] Open
Abstract
The goal of biomarker testing, in the field of personalized medicine, is to guide treatments to achieve the best possible results for each patient. The accurate and reliable identification of everyone’s genome variants is essential for the success of clinical genomics, employing third-generation sequencing. Different variant calling techniques have been used and recommended by both Oxford Nanopore Technologies (ONT) and Nanopore communities. A thorough examination of the variant callers might give critical guidance for third-generation sequencing-based clinical genomics. In this study, two reference genome sample datasets (NA12878) and (NA24385) and the set of high-confidence variant calls provided by the Genome in a Bottle (GIAB) were used to allow the evaluation of the performance of six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, as an integral step in the in-house variant detection workflow. Out of the six variant callers understudy, Clair3 and Human-SNP-wf that has Clair3 incorporated into it achieved the highest performance rates in comparison to the other variant callers. Evaluation of the results for the tool was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. In conclusion, our findings give important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing.
Collapse
|