1
|
Zhou M, Dong J, Jiang H, Zhao Z, Yuan T. A copy number variation detection method based on OCSVM algorithm using multi strategies integration. Sci Rep 2025; 15:3526. [PMID: 39875521 PMCID: PMC11775105 DOI: 10.1038/s41598-025-88143-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Accepted: 01/24/2025] [Indexed: 01/30/2025] Open
Abstract
Copy number variation (CNV) is an important part of human genetic variations, which is associated with various kinds of diseases. To tackle the limitations of traditional CNV detection methods, such as restricted detection types, high error rates, and challenges in precisely identifying the location of variant breakpoints, a new method called MSCNV (copy number variations detection method for multi-strategies integration based on a one-class support vector machine model) is proposed. MSCNV establishes a multi-signal channel that integrates three strategies: read depth, split read, and read pair. First, a one-class support vector machine algorithm is used to detect abnormal signals in read depth and mapping quality values to determine the rough CNV region. Then, the rough CNV region is filtered by using paired read signals to improve the precision of MSCNV method. Finally, MSCNV explores and recognizes tandem duplication regions, interspersed duplication regions, and loss regions. It uses split read signals to determine the precise location of mutation points and to determine the type of variation. Compared with Manta, FREEC, GROM-RD, Rsicnv, and CNVkit, MSCNV significantly improves the sensitivity, precision, F1-score, and overlap density score of CNV detection while reducing the boundary bias of the detection results.
Collapse
Affiliation(s)
- Mengjiao Zhou
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China
- Shandong Provincial Academy of Educational Recruitment and Examination, Jinan, 250011, Shandong, P.R. China
| | - Jinxin Dong
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China.
| | - Hua Jiang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China.
| | - Zuyao Zhao
- Orthopedics Department, Liaocheng People's Hospital, Liaocheng, 252000, P.R. China
| | - Tianting Yuan
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China
| |
Collapse
|
2
|
Yu X, Qin F, Liu S, Brown NJ, Lu Q, Cai G, Guler JL, Xiao F. HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.19.629494. [PMID: 39763944 PMCID: PMC11702719 DOI: 10.1101/2024.12.19.629494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/15/2025]
Abstract
Copy number variants (CNVs) are prevalent in both diploid and haploid genomes, with the latter containing a single copy of each gene. Studying CNVs in genomes from single or few cells is significantly advancing our knowledge in human disorders and disease susceptibility. Low-input including low-cell and single-cell sequencing data for haploid and diploid organisms generally displays shallow and highly non-uniform read counts resulting from the whole genome amplification steps that introduce amplification biases. In addition, haploid organisms typically possess relatively short genomes and require a higher degree of DNA amplification compared to diploid organisms. However, most CNV detection methods are specifically developed for diploid genomes without specific consideration of effects on haploid genomes. Challenges also reside in reference samples or normal controls which are used to provide baseline signals for defining copy number losses or gains. In traditional methods, references are usually pre-specified from cells that are assumed to be normal or disease-free. However, the use of pre-defined reference cells can bias results if common CNVs are present. Here, we present the development of a comprehensive statistical framework for data normalization and CNV detection in haploid single- or low-cell DNA sequencing data called HapCNV. The prominent advancement is the construction of a novel genomic location specific pseudo-reference that selects unbiased references using a preliminary cell clustering method. This approach effectively preserves common CNVs. Using simulations, we demonstrated that HapCNV outperformed existing methods by generating more accurate CNV detection, especially for short CNVs. Superior performance of HapCNV was also validated in detecting known CNVs in a real P. falciparum parasite dataset. In conclusion, HapCNV provides a novel and useful approach for CNV detection in haploid low-input sequencing datasets, with easy applicability to diploids.
Collapse
Affiliation(s)
- Xuanxuan Yu
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Fei Qin
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20850, USA
| | - Shiwei Liu
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Noah J. Brown
- Department of Biology, University of Virginia, Charlottesville, VA, USA
| | - Qing Lu
- Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA
| | - Guoshuai Cai
- Department of Surgery, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Jennifer L. Guler
- Department of Biology, University of Virginia, Charlottesville, VA, USA
| | - Feifei Xiao
- Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA
| |
Collapse
|
3
|
Sinha R, Pal RK, De RK. A novel method addressing NGS-based mappability bias for sensitive detection of DNA alterations. J Bioinform Comput Biol 2024; 22:2450009. [PMID: 39030667 DOI: 10.1142/s0219720024500094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
A turning point in cancer research is the introduction of massively parallel sequencing technology which greatly reduced the cost and time for genome sequencing. This enhanced the scope for detecting and analyzing the role of structural alterations in cancer. However, certain bias exists in NGS-based approaches, which badly affects the CNV identification process. Moreover, DNA repeats existing in CNV regions need special attention as they will degrade the performance of majority of the existing CNV detection tools, even after applying generalized bias correction method. This motivated this work, where a novel method has been designed to address the issue of DNA repeats and thereby mappability bias existing in regions of CNV. The method consists of three phases, where the first phase computes the alignment information of uniquely mapped DNA reads, considering the base quality and base mismatch parameters at nucleotide level precision. The second and the third phase use a novel approach to allocate the non-uniquely mapped reads to an optimal region of the DNA repeats based on a probabilistic membership model. The proposed method is capable of identifying CNVs present in coding, as well as non-coding region of the DNA, and is also capable of detecting CNVs existing in DNA repeat regions. The methodology achieves a sensitivity greater than [Formula: see text] during the performed simulations, and on real data, the detected variants are validated with the database of genomic variants, where the percentage overlap is also greater than 95%, and has achieved much better breakpoint prediction, as compared with other popular bias correction CNV detection methods.
Collapse
Affiliation(s)
- Rituparna Sinha
- Information Technology, Heritage Institute of Technology, Anandapur Kolkata, West Bengal, India
| | - Rajat Kumar Pal
- Computer Science and Engineering Department, University of Calcutta, Kolkata, India
| | - Rajat Kumar De
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| |
Collapse
|
4
|
Kurt S, Chen M, Toosi H, Chen X, Engblom C, Mold J, Hartman J, Lagergren J. CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics. Bioinformatics 2024; 40:btae284. [PMID: 38676578 PMCID: PMC11087824 DOI: 10.1093/bioinformatics/btae284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 03/06/2024] [Accepted: 04/25/2024] [Indexed: 04/29/2024] Open
Abstract
MOTIVATION Copy number variations (CNVs) are common genetic alterations in tumour cells. The delineation of CNVs holds promise for enhancing our comprehension of cancer progression. Moreover, accurate inference of CNVs from single-cell sequencing data is essential for unravelling intratumoral heterogeneity. However, existing inference methods face limitations in resolution and sensitivity. RESULTS To address these challenges, we present CopyVAE, a deep learning framework based on a variational autoencoder architecture. Through experiments, we demonstrated that CopyVAE can accurately and reliably detect CNVs from data obtained using single-cell RNA sequencing. CopyVAE surpasses existing methods in terms of sensitivity and specificity. We also discussed CopyVAE's potential to advance our understanding of genetic alterations and their impact on disease advancement. AVAILABILITY AND IMPLEMENTATION CopyVAE is implemented and freely available under MIT license at https://github.com/kurtsemih/copyVAE.
Collapse
Affiliation(s)
- Semih Kurt
- School of EECS and SciLifeLab, KTH Royal Institute of Technology, Stockholm, 100 44, Sweden
| | - Mandi Chen
- School of EECS and SciLifeLab, KTH Royal Institute of Technology, Stockholm, 100 44, Sweden
| | - Hosein Toosi
- School of EECS and SciLifeLab, KTH Royal Institute of Technology, Stockholm, 100 44, Sweden
| | - Xinsong Chen
- Department of Oncology and Pathology, Karolinska Institutet, Solna, 171 77, Sweden
| | - Camilla Engblom
- Department of Cell and Molecular Biology, Karolinska Institutet, Solna, 171 77, Sweden
| | - Jeff Mold
- Department of Cell and Molecular Biology, Karolinska Institutet, Solna, 171 77, Sweden
| | - Johan Hartman
- Department of Oncology and Pathology, Karolinska Institutet, Solna, 171 77, Sweden
- Department of Clinical Pathology and Cytology, Karolinska University Laboratory, Solna, 171 76, Sweden
| | - Jens Lagergren
- School of EECS and SciLifeLab, KTH Royal Institute of Technology, Stockholm, 100 44, Sweden
| |
Collapse
|
5
|
Xie K, Ge X, Alvi HAK, Liu K, Song J, Yu Q. OTSUCNV: an adaptive segmentation and OTSU-based anomaly classification method for CNV detection using NGS data. BMC Genomics 2024; 25:126. [PMID: 38291375 PMCID: PMC10826217 DOI: 10.1186/s12864-024-10018-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 01/15/2024] [Indexed: 02/01/2024] Open
Abstract
Copy-number variations (CNVs), which refer to deletions and duplications of chromosomal segments, represent a significant source of variation among individuals, contributing to human evolution and being implicated in various diseases ranging from mental illness and developmental disorders to cancer. Despite the development of several methods for detecting copy number variations based on next-generation sequencing (NGS) data, achieving robust detection performance for CNVs with arbitrary coverage and amplitude remains challenging due to the inherent complexity of sequencing samples. In this paper, we propose an alternative method called OTSUCNV for CNV detection on whole genome sequencing (WGS) data. This method utilizes a newly designed adaptive sequence segmentation algorithm and an OTSU-based CNV prediction algorithm, which does not rely on any distribution assumptions or involve complex outlier factor calculations. As a result, the effective detection of CNVs is achieved with lower computational complexity. The experimental results indicate that the proposed method demonstrates outstanding performance, and hence it may be used as an effective tool for CNV detection.
Collapse
Affiliation(s)
- Kun Xie
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Xiaojun Ge
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Haque A K Alvi
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Kang Liu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Jianfeng Song
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
| | - Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
- Hangzhou Institute of Technology, Xidian University, Hangzhou, 311200, China.
| |
Collapse
|
6
|
Zhang Y, Liu W, Duan J. On the core segmentation algorithms of copy number variation detection tools. Brief Bioinform 2024; 25:bbae022. [PMID: 38340093 PMCID: PMC10858679 DOI: 10.1093/bib/bbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/26/2023] [Indexed: 02/12/2024] Open
Abstract
Shotgun sequencing is a high-throughput method used to detect copy number variants (CNVs). Although there are numerous CNV detection tools based on shotgun sequencing, their quality varies significantly, leading to performance discrepancies. Therefore, we conducted a comprehensive analysis of next-generation sequencing-based CNV detection tools over the past decade. Our findings revealed that the majority of mainstream tools employ similar detection rationale: calculates the so-called read depth signal from aligned sequencing reads and then segments the signal by utilizing either circular binary segmentation (CBS) or hidden Markov model (HMM). Hence, we compared the performance of those two core segmentation algorithms in CNV detection, considering varying sequencing depths, segment lengths and complex types of CNVs. To ensure a fair comparison, we designed a parametrical model using mainstream statistical distributions, which allows for pre-excluding bias correction such as guanine-cytosine (GC) content during the preprocessing step. The results indicate the following key points: (1) Under ideal conditions, CBS demonstrates high precision, while HMM exhibits a high recall rate. (2) For practical conditions, HMM is advantageous at lower sequencing depths, while CBS is more competitive in detecting small variant segments compared to HMM. (3) In case involving complex CNVs resembling real sequencing, HMM demonstrates more robustness compared with CBS. (4) When facing large-scale sequencing data, HMM costs less time compared with the CBS, while their memory usage is approximately equal. This can provide an important guidance and reference for researchers to develop new tools for CNV detection.
Collapse
Affiliation(s)
- Yibo Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Wenyu Liu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Junbo Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
7
|
Li C, Fan S, Zhao H, Liu X. CNV-FB: A Feature bagging strategy-based approach to detect copy number variants from NGS data. J Bioinform Comput Biol 2023; 21:2350026. [PMID: 38212874 DOI: 10.1142/s0219720023500269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2024]
Abstract
Copy number variation (CNV), as a type of genomic structural variation, accounts for a large proportion of structural variation and is related to the pathogenesis and susceptibility to some human diseases, playing an important role in the development and change of human diseases. The development of next-generation sequencing technology (NGS) provides strong support for the design of CNV detection algorithms. Although a large number of methods have been developed to detect CNVs using NGS data, it is still considered a difficult problem to detect CNVs with low purity and coverage. In this paper, a new calculation method CNV-FB is proposed to detect CNVs from NGS data. The core idea of CNV-FB is to randomly sample the read depth values of the genome fragment, and then each sample is individually detected for outliers, and finally combined into a final outlier score. The CNV-FB method was applied to simulation data and real data experiments and compared with the other five methods of the same type. The results show that the CNV-FB method has a better detection effect than other methods. Therefore, the CNV-FB method may be an effective algorithm for detecting genomic mutations.
Collapse
Affiliation(s)
- Chengyou Li
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Shiqiang Fan
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Haiyong Zhao
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Xiaotong Liu
- School of Agronomy and Agricultural Engineering, Liaocheng University, Liaocheng 252000, P. R. China
| |
Collapse
|
8
|
Liu G, Yang H, He Z. Detection of copy number variations based on a local distance using next-generation sequencing data. Front Genet 2023; 14:1147761. [PMID: 37811148 PMCID: PMC10556732 DOI: 10.3389/fgene.2023.1147761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 09/14/2023] [Indexed: 10/10/2023] Open
Abstract
As one of the main types of structural variation in the human genome, copy number variation (CNV) plays an important role in the occurrence and development of human cancers. Next-generation sequencing (NGS) technology can provide base-level resolution, which provides favorable conditions for the accurate detection of CNVs. However, it is still a very challenging task to accurately detect CNVs from cancer samples with different purity and low sequencing coverage. Local distance-based CNV detection (LDCNV), an innovative computational approach to predict CNVs using NGS data, is proposed in this work. LDCNV calculates the average distance between each read depth (RD) and its k nearest neighbors (KNNs) to define the distance of KNNs of each RD, and the average distance between the KNNs for each RD to define their internal distance. Based on the above definitions, a local distance score is constructed using the ratio between the distance of KNNs and the internal distance of KNNs for each RD. The local distance scores are used to fit a normal distribution to evaluate the significance level of each RDS, and then use the hypothesis test method to predict the CNVs. The performance of the proposed method is verified with simulated and real data and compared with several popular methods. The experimental results show that the proposed method is superior to various other techniques. Therefore, the proposed method can be helpful for cancer diagnosis and targeted drug development.
Collapse
Affiliation(s)
- Guojun Liu
- School of Mathematics, Xi’an University of Finance and Economics, Xi’an, China
| | - Hongzhi Yang
- Department of Radiology, XD Group Hospital, Xi’an, China
| | - Zongzhen He
- School of Mathematics, Xi’an University of Finance and Economics, Xi’an, China
| |
Collapse
|
9
|
Kosugi S, Kamatani Y, Harada K, Tomizuka K, Momozawa Y, Morisaki T, Terao C. Detection of trait-associated structural variations using short-read sequencing. CELL GENOMICS 2023; 3:100328. [PMID: 37388916 PMCID: PMC10300613 DOI: 10.1016/j.xgen.2023.100328] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 02/17/2023] [Accepted: 04/25/2023] [Indexed: 07/01/2023]
Abstract
Genomic structural variation (SV) affects genetic and phenotypic characteristics in diverse organisms, but the lack of reliable methods to detect SV has hindered genetic analysis. We developed a computational algorithm (MOPline) that includes missing call recovery combined with high-confidence SV call selection and genotyping using short-read whole-genome sequencing (WGS) data. Using 3,672 high-coverage WGS datasets, MOPline stably detected ∼16,000 SVs per individual, which is over ∼1.7-3.3-fold higher than previous large-scale projects while exhibiting a comparable level of statistical quality metrics. We imputed SVs from 181,622 Japanese individuals for 42 diseases and 60 quantitative traits. A genome-wide association study with the imputed SVs revealed 41 top-ranked or nearly top-ranked genome-wide significant SVs, including 8 exonic SVs with 5 novel associations and enriched mobile element insertions. This study demonstrates that short-read WGS data can be used to identify rare and common SVs associated with a variety of traits.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
| | - Yoichiro Kamatani
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa-shi, Chiba 277-8562, Japan
| | - Katsutoshi Harada
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Kohei Tomizuka
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Yukihide Momozawa
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa 230-0045, Japan
| | - Takayuki Morisaki
- Division of Molecular Pathology, Institute of Medical Science, The University of Tokyo, 4-6-1, Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan
| | | | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| |
Collapse
|
10
|
Adaptive Savitzky–Golay Filters for Analysis of Copy Number Variation Peaks from Whole-Exome Sequencing Data. INFORMATION 2023. [DOI: 10.3390/info14020128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023] Open
Abstract
Copy number variation (CNV) is a form of structural variation in the human genome that provides medical insight into complex human diseases; while whole-genome sequencing is becoming more affordable, whole-exome sequencing (WES) remains an important tool in clinical diagnostics. Because of its discontinuous nature and unique characteristics of sparse target-enrichment-based WES data, the analysis and detection of CNV peaks remain difficult tasks. The Savitzky–Golay (SG) smoothing is well known as a fast and efficient smoothing method. However, no study has documented the use of this technique for CNV peak detection. It is well known that the effectiveness of the classical SG filter depends on the proper selection of the window length and polynomial degree, which should correspond with the scale of the peak because, in the case of peaks with a high rate of change, the effectiveness of the filter could be restricted. Based on the Savitzky–Golay algorithm, this paper introduces a novel adaptive method to smooth irregular peak distributions. The proposed method ensures high-precision noise reduction by dynamically modifying the results of the prior smoothing to automatically adjust parameters. Our method offers an additional feature extraction technique based on density and Euclidean distance. In comparison to classical Savitzky–Golay filtering and other peer filtering methods, the performance evaluation demonstrates that adaptive Savitzky–Golay filtering performs better. According to experimental results, our method effectively detects CNV peaks across all genomic segments for both short and long tags, with minimal peak height fidelity values (i.e., low estimation bias). As a result, we clearly demonstrate how well the adaptive Savitzky–Golay filtering method works and how its use in the detection of CNV peaks can complement the existing techniques used in CNV peak analysis.
Collapse
|
11
|
Dharanipragada P, Parekh N. In Silico Identification and Functional Characterization of Genetic Variations across DLBCL Cell Lines. Cells 2023; 12:cells12040596. [PMID: 36831263 PMCID: PMC9954129 DOI: 10.3390/cells12040596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 01/12/2023] [Accepted: 01/29/2023] [Indexed: 02/15/2023] Open
Abstract
Diffuse large B-cell lymphoma (DLBCL) is the most common form of non-Hodgkin lymphoma and frequently develops through the accumulation of several genetic variations. With the advancement in high-throughput techniques, in addition to mutations and copy number variations, structural variations have gained importance for their role in genome instability leading to tumorigenesis. In this study, in order to understand the genetics of DLBCL pathogenesis, we carried out a whole-genome mutation profile analysis of eleven human cell lines from germinal-center B-cell-like (GCB-7) and activated B-cell-like (ABC-4) subtypes of DLBCL. Analysis of genetic variations including small sequence variants and large structural variations across the cell lines revealed distinct variation profiles indicating the heterogeneous nature of DLBCL and the need for novel patient stratification methods to design potential intervention strategies. Validation and prognostic significance of the variants was assessed using annotations provided for DLBCL samples in cBioPortal for Cancer Genomics. Combining genetic variations revealed new subgroups between the subtypes and associated enriched pathways, viz., PI3K-AKT signaling, cell cycle, TGF-beta signaling, and WNT signaling. Mutation landscape analysis also revealed drug-variant associations and possible effectiveness of known and novel DLBCL treatments. From the whole-genome-based mutation analysis, our findings suggest putative molecular genetics of DLBCL lymphomagenesis and potential genomics-driven precision treatments.
Collapse
|
12
|
Liu G, Yang H, Yuan X. A shortest path-based approach for copy number variation detection from next-generation sequencing data. Front Genet 2023; 13:1084974. [PMID: 36733945 PMCID: PMC9887524 DOI: 10.3389/fgene.2022.1084974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 12/27/2022] [Indexed: 01/18/2023] Open
Abstract
Copy number variation (CNV) is one of the main structural variations in the human genome and accounts for a considerable proportion of variations. As CNVs can directly or indirectly cause cancer, mental illness, and genetic disease in humans, their effective detection in humans is of great interest in the fields of oncogene discovery, clinical decision-making, bioinformatics, and drug discovery. The advent of next-generation sequencing data makes CNV detection possible, and a large number of CNV detection tools are based on next-generation sequencing data. Due to the complexity (e.g., bias, noise, alignment errors) of next-generation sequencing data and CNV structures, the accuracy of existing methods in detecting CNVs remains low. In this work, we design a new CNV detection approach, called shortest path-based Copy number variation (SPCNV), to improve the detection accuracy of CNVs. SPCNV calculates the k nearest neighbors of each read depth and defines the shortest path, shortest path relation, and shortest path cost sets based on which further calculates the mean shortest path cost of each read depth and its k nearest neighbors. We utilize the ratio between the mean shortest path cost for each read depth and the mean of the mean shortest path cost of its k nearest neighbors to construct a relative shortest path score formula that is able to determine a score for each read depth. Based on the score profile, a boxplot is then applied to predict CNVs. The performance of the proposed method is verified by simulation data experiments and compared against several popular methods of the same type. Experimental results show that the proposed method achieves the best balance between recall and precision in each set of simulated samples. To further verify the performance of the proposed method in real application scenarios, we then select real sample data from the 1,000 Genomes Project to conduct experiments. The proposed method achieves the best F1-scores in almost all samples. Therefore, the proposed method can be used as a more reliable tool for the routine detection of CNVs.
Collapse
Affiliation(s)
- Guojun Liu
- School of Statistics, Xi’an University of Finance and Economics, Xi’an, China,*Correspondence: Guojun Liu, ; Xiguo Yuan,
| | - Hongzhi Yang
- Medical Imaging Center, Xidian Group Hospital, Xi’an, China
| | - Xiguo Yuan
- Hangzhou Institute of Technology, Xidian University, Hangzhou, China,*Correspondence: Guojun Liu, ; Xiguo Yuan,
| |
Collapse
|
13
|
Kim H, Shim Y, Lee TG, Won D, Choi JR, Shin S, Lee ST. Copy-number analysis by base-level normalization: An intuitive visualization tool for evaluating copy number variations. Clin Genet 2023; 103:35-44. [PMID: 36152294 DOI: 10.1111/cge.14236] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/19/2022] [Accepted: 09/20/2022] [Indexed: 12/13/2022]
Abstract
Next-generation sequencing (NGS) facilitates comprehensive molecular analyses that help with diagnosing unsolved disorders. In addition to detecting single-nucleotide variations and small insertions/deletions, bioinformatics tools can identify copy number variations (CNVs) in NGS data, which improves the diagnostic yield. However, due to the possibility of false positives, subsequent confirmation tests are generally performed. Here, we introduce Copy-number Analysis by BAse-level NormAlization (CABANA), a visualization tool that allows users to intuitively identify candidate CNVs using the normalized single-base-level read depth calculated from NGS data. To demonstrate how CABANA works, NGS data were obtained from 474 patients with neuromuscular disorders. CNVs were screened using a conventional bioinformatics tool, ExomeDepth, and then we normalized and visualized those data at the single-base level using CABANA, followed by manual inspection by geneticists to filter out false positives and determine candidate CNVs. In doing so, we identified 31 candidate CNVs (7%) in 474 patients and subsequently confirmed all of them to be true using multiplex ligation-dependent probe amplification. The performance of CABANA was deemed acceptable by comparing its diagnostic yield with previous data about neuromuscular disorders. Despite some limitations, we expect CABANA to help researchers accurately identify CNVs and reduce the need for subsequent confirmation testing.
Collapse
Affiliation(s)
- Hongkyung Kim
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Yeeun Shim
- Brain Korea 21 PLUS Project for Medical Science, Yonsei University, Seoul, Republic of Korea
| | - Taek Gyu Lee
- Brain Korea 21 PLUS Project for Medical Science, Yonsei University, Seoul, Republic of Korea
| | - Dongju Won
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Jong Rak Choi
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea.,Dxome Co. Ltd, Seongnam-si, Gyeonggi-do, Republic of Korea
| | - Saeam Shin
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Seung-Tae Lee
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea.,Dxome Co. Ltd, Seongnam-si, Gyeonggi-do, Republic of Korea
| |
Collapse
|
14
|
Zhang T, Dong J, Jiang H, Zhao Z, Zhou M, Yuan T. CNV-PCC: An efficient method for detecting copy number variations from next-generation sequencing data. Front Bioeng Biotechnol 2022; 10:1000638. [DOI: 10.3389/fbioe.2022.1000638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
Copy number variations (CNVs) significantly influence the diversity of the human genome and the occurrence of many complex diseases. The next-generation sequencing (NGS) technology provides rich data for detecting CNVs, and the read depth (RD)-based approach is widely used. However, low CN (copy number of 3–4) duplication events are challenging to identify with existing methods, especially when the size of CNVs is small. In addition, the RD-based approach can only obtain rough breakpoints. We propose a new method, CNV-PCC (detection of CNVs based on Principal Component Classifier), to identify CNVs in whole genome sequencing data. CNV-PPC first uses the split read signal to search for potential breakpoints. A two-stage segmentation strategy is then implemented to enhance the identification capabilities of low CN duplications and small CNVs. Next, the outlier scores are calculated for each segment by PCC (Principal Component Classifier). Finally, the OTSU algorithm calculates the threshold to determine the CNVs regions. The analysis of simulated data results indicates that CNV-PCC outperforms the other methods for sensitivity and F1-score and improves breakpoint accuracy. Furthermore, CNV-PCC shows high consistency on real sequencing samples with other methods. This study demonstrates that CNV-PCC is an effective method for detecting CNVs, even for low CN duplications and small CNVs.
Collapse
|
15
|
Wang X, Junqing L, Huang T. CNVABNN: An AdaBoost algorithm and neural networks-based detection of copy number variations from NGS data. Comput Biol Chem 2022; 99:107720. [DOI: 10.1016/j.compbiolchem.2022.107720] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 06/22/2022] [Accepted: 06/23/2022] [Indexed: 11/03/2022]
|
16
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|
17
|
svBreak: A New Approach for the Detection of Structural Variant Breakpoints Based on Convolutional Neural Network. BIOMED RESEARCH INTERNATIONAL 2022; 2022:7196040. [PMID: 35345526 PMCID: PMC8957449 DOI: 10.1155/2022/7196040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Revised: 01/04/2022] [Accepted: 01/27/2022] [Indexed: 12/01/2022]
Abstract
Structural variation (SV) is an important type of genome variation and confers susceptibility to human cancer diseases. Systematic analysis of SVs has become a crucial step for the exploration of mechanisms and precision diagnosis of cancers. The central point is how to accurately detect SV breakpoints by using next-generation sequencing (NGS) data. Due to the cooccurrence of multiple types of SVs in the human genome and the intrinsic complexity of SVs, the discrimination of SV breakpoint types is a challenging task. In this paper, we propose a convolutional neural network- (CNN-) based approach, called svBreak, for the detection and discrimination of common types of SV breakpoints. The principle of svBreak is that it extracts a set of SV-related features for each genome site from the sequencing reads aligned to the reference genome and establishes a data matrix where each row represents one site and each column represents one feature and then adopts a CNN model to analyze such data matrix for the prediction of SV breakpoints. The performance of the proposed approach is tested via simulation studies and application to a real sequencing sample. The experimental results demonstrate the merits of the proposed approach when compared with existing methods. Thus, svBreak can be expected to be a supplementary approach in the field of SV analysis in human tumor genomes.
Collapse
|
18
|
Gordeeva V, Sharova E, Arapidi G. Progress in Methods for Copy Number Variation Profiling. Int J Mol Sci 2022; 23:ijms23042143. [PMID: 35216262 PMCID: PMC8879278 DOI: 10.3390/ijms23042143] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 02/09/2022] [Accepted: 02/11/2022] [Indexed: 02/04/2023] Open
Abstract
Copy number variations (CNVs) are the predominant class of structural genomic variations involved in the processes of evolutionary adaptation, genomic disorders, and disease progression. Compared with single-nucleotide variants, there have been challenges associated with the detection of CNVs owing to their diverse sizes. However, the field has seen significant progress in the past 20–30 years. This has been made possible due to the rapid development of molecular diagnostic methods which ensure a more detailed view of the genome structure, further complemented by recent advances in computational methods. Here, we review the major approaches that have been used to routinely detect CNVs, ranging from cytogenetics to the latest sequencing technologies, and then cover their specific features.
Collapse
Affiliation(s)
- Veronika Gordeeva
- Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Correspondence:
| | - Elena Sharova
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
| | - Georgij Arapidi
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Shemyakin–Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 117997 Moscow, Russia
| |
Collapse
|
19
|
Lee WP, Zhu Q, Yang X, Liu S, Cerveira E, Ryan M, Mil-Homens A, Bellfy L, Ye K, Lee C, Zhang C. JAX-CNV: A Whole-genome Sequencing-based Algorithm for Copy Number Detection at Clinical Grade Level. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1197-1206. [PMID: 35085778 DOI: 10.1016/j.gpb.2021.06.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Revised: 04/30/2021] [Accepted: 09/06/2021] [Indexed: 10/19/2022]
Abstract
We aimed to develop a whole-genome sequencing (WGS)-based copy number variant (CNV) calling algorithm with the potential of replacing chromosomal microarray assay (CMA) for clinical diagnosis. JAX-CNV is thus developed for CNV detection from WGS data. The performance of this CNV calling algorithm was evaluated in a blinded manner on 31 samples and compared to the 112 CNVs reported by clinically validated CMAs for these 31 samples. The result showed that JAX-CNV recalled 100% of these CNVs. Besides, JAX-CNV identified an average of 30 CNVs per individual that was an approximately seven-fold increase compared to calls of clinically validated CMAs. Experimental validation of 24 randomly selected CNVs showed one false positive, i.e., a false discovery rate (FDR) of 4.17%. A robustness test on lower-coverage data revealed a 100% sensitivity for CNVs larger than 300 kb (the current threshold for College of American Pathologists) down to 10× coverage. For CNVs larger than 50 kb, sensitivities were 100% for coverages deeper than 20×, 97% for 15×, and 95% for 10×. We developed a WGS-based CNV pipeline, including this newly developed CNV caller JAX-CNV, and found it capable of detecting CMA-reported CNVs at a sensitivity of 100% with about a FDR of 4%. We propose that JAX-CNV could be further examined in a multi-institutional study to justify the transition of first-tier genetic testing from CMAs to WGS. JAX-CNV is available at https://github.com/TheJacksonLaboratory/JAX-CNV.
Collapse
Affiliation(s)
- Wan-Ping Lee
- Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China; The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA; School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an 710049, China; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Qihui Zhu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Xiaofei Yang
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA; School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Silvia Liu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Eliza Cerveira
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Mallory Ryan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Adam Mil-Homens
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Lauren Bellfy
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Kai Ye
- Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Charles Lee
- Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China; The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA; Department of Life Sciences, Ewha Womans University, Seoul 03760, South Korea
| | - Chengsheng Zhang
- Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China; The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
| |
Collapse
|
20
|
Xie K, Liu K, Alvi HAK, Chen Y, Wang S, Yuan X. KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data. Front Cell Dev Biol 2022; 9:796249. [PMID: 35004691 PMCID: PMC8728060 DOI: 10.3389/fcell.2021.796249] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 11/23/2021] [Indexed: 11/19/2022] Open
Abstract
Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.
Collapse
Affiliation(s)
- Kun Xie
- School of Computer Science and Technology, Xidian University, Xi'an, China.,Hangzhou Institute of Technology, Xidian University, Hangzhou, China
| | - Kang Liu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Haque A K Alvi
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuehui Chen
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, China
| | - Shuzhen Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China.,Hangzhou Institute of Technology, Xidian University, Hangzhou, China
| |
Collapse
|
21
|
Sinha R, Pal RK, De RK. GenSeg and MR-GenSeg: A Novel Segmentation Algorithm and its Parallel MapReduce Based Approach for Identifying Genomic Regions With Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:443-454. [PMID: 32750860 DOI: 10.1109/tcbb.2020.3000661] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying intragenic as well as intergenic sequences of the DNA, having structural alterations, is a significantly important research area, since this may be the root cause of many neurological and autoimmune diseases, including cancer. Working with whole genome NGS data has provided a new insight in this regard, but has lead to huge explosion of data that is growing exponentially. Hence, the challenges lie in efficient means of storage and processing this big data. In this study, we have developed a novel segmentation algorithm, called GenSeg, and its parallel MapReduce based algorithm, called MR-GenSeg, for detecting copy number variations. In order to annotate CNVs (variants), segments formed by GenSeg/MR-GenSeg have been represented in a novel way using a binary tree, where each node is a CNV event. GenSeg considers each position specific data of whole genome DNA sequence, so that precise identification of breakpoints is possible. GenSeg/MR-GenSeg has been compared with twelve popular CNV detection algorithms, where it has outperformed the others in terms of sensitivity, and has achieved a good F-score value. MR-GenSeg has excelled in terms of SpeedUp, when compared with these algorithms. The effect of CNVs on immunoglobulin (IG) genes has also been analysed in this study. Availability: The source codes are available at https://github.com/rituparna-sinha/MapReduce-GENSEG.
Collapse
|
22
|
Huang T, Li J, Jia B, Sang H. CNV-MEANN: A Neural Network and Mind Evolutionary Algorithm-Based Detection of Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 12:700874. [PMID: 34484298 PMCID: PMC8415314 DOI: 10.3389/fgene.2021.700874] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 07/19/2021] [Indexed: 11/20/2022] Open
Abstract
Copy number variation (CNV), is defined as repetitions or deletions of genomic segments of 1 Kb to 5 Mb, and is a major trigger for human disease. The high-throughput and low-cost characteristics of next-generation sequencing technology provide the possibility of the detection of CNVs in the whole genome, and also greatly improve the clinical practicability of next-generation sequencing (NGS) testing. However, current methods for the detection of CNVs are easily affected by sequencing and mapping errors, and uneven distribution of reads. In this paper, we propose an improved approach, CNV-MEANN, for the detection of CNVs, involving changing the structure of the neural network used in the MFCNV method. This method has three differences relative to the MFCNV method: (1) it utilizes a new feature, mapping quality, to replace two features in MFCNV, (2) it considers the influence of the loss categories of CNV on disease prediction, and refines the output structure, and (3) it uses a mind evolutionary algorithm to optimize the backpropagation (neural network) neural network model, and calculates individual scores for each genome bin to predict CNVs. Using both simulated and real datasets, we tested the performance of CNV-MEANN and compared its performance with those of seven widely used CNV detection methods. Experimental results demonstrated that the CNV-MEANN approach outperformed other methods with respect to sensitivity, precision, and F1-score. The proposed method was able to detect many CNVs that other approaches could not, and it reduced the boundary bias. CNV-MEANN is expected to be an effective method for the analysis of changes in CNVs in the genome.
Collapse
Affiliation(s)
- Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Baoxian Jia
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Hongyan Sang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| |
Collapse
|
23
|
Yuan X, Li J, Bai J, Xi J. A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1811-1820. [PMID: 31880558 DOI: 10.1109/tcbb.2019.2961886] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Copy number variation (CNV) is a major type of genomic structural variations that play an important role in human disorders. Next generation sequencing (NGS) has fueled the advancement in algorithm design to detect CNVs at base-pair resolution. However, accurate detection of CNVs of low amplitudes remains a challenging task. This paper proposes a new computational method, CNV-LOF, to identify CNVs of full-range amplitudes from NGS data. CNV-LOF is distinctly different from traditional methods, which mainly consider aberrations from a global perspective and rely on some assumed distribution of NGS read depths. In contrast, CNV-LOF takes a local view on the read depths and assigns an outlier factor to each genome segment. With the outlier factor profile, CNV-LOF uses a boxplot procedure to declare CNVs without the reliance of any distribution assumptions. Simulation experiments indicate that CNV-LOF outperforms five existing methods with respect to F1-measure, sensitivity, and precision. CNV-LOF is further validated on real sequencing samples, yielding highly consistent results with peer methods. CNV-LOF is able to detect CNVs of low and moderate amplitudes where the other existing methods fail, and it is expected to become a routine approach for the discovery of novel CNVs on whole sequencing genome.
Collapse
|
24
|
Zhao HY, Li Q, Tian Y, Chen YH, Alvi HAK, Yuan XG. CIRCNV: Detection of CNVs Based on a Circular Profile of Read Depth from Sequencing Data. BIOLOGY 2021; 10:biology10070584. [PMID: 34202028 PMCID: PMC8301091 DOI: 10.3390/biology10070584] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 06/10/2021] [Accepted: 06/21/2021] [Indexed: 12/29/2022]
Abstract
Simple Summary In this study, we propose a copy number variation (CNV) detection method called CIRCNV, which is based on a circular profile of the read depth from sequencing data. The proposed method is an extended version of our previously developed method CNV-LOF. The main difference of CIRCNV from CNV-LOF lies in its two new features: (1) it transfers the read depth profile from a line shape to a circular shape via a polar coordinate transformation to generate a meaningful two-dimensional dataset for CNV analysis and promote fairness between the ends and middle part of the genome, and (2) it performs two rounds of CNV declaration via estimating tumor purity and recovering the truth circular RD profile. We test and evaluate the performance of CIRCNV via conducting simulation studies and real sequencing tumor sample applications. The experimental results show that CIRCNV outperforms peer methods with respect to sensitivity, precision, and the F1-score. The experiments prove that the proposed method is a reliable and effective tool in the field of variation analysis of tumor genomes. Abstract Copy number variation (CNV) is a common type of structural variation in the human genome. Accurate detection of CNVs from tumor genomes can provide crucial information for the study of tumor genesis and cancer precision diagnosis. However, the contamination of normal genomes in tumor genomes and the crude profiles of the read depth make such a task difficult. In this paper, we propose an alternative approach, called CIRCNV, for the detection of CNVs from sequencing data. CIRCNV is an extension of our previously developed method CNV-LOF, which uses local outlier factors to predict CNVs. Comparatively, CIRCNV can be performed on individual tumor samples and has the following two new features: (1) it transfers the read depth profile from a line shape to a circular shape via a polar coordinate transformation, in order to improve the efficiency of the read depth (RD) profile for the detection of CNVs; and (2) it performs a second round of CNV declaration based on the truth circular RD profile, which is recovered by estimating tumor purity. We test and validate the performance of CIRCNV based on simulation and real sequencing data and perform comparisons with several peer methods. The results demonstrate that CIRCNV can obtain superior performance in terms of sensitivity and precision. We expect that our proposed method will be a supplement to existing methods and become a routine tool in the field of variation analysis of tumor genomes.
Collapse
Affiliation(s)
- Hai-Yong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng 252000, China;
| | - Qi Li
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Ye Tian
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Yue-Hui Chen
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Ji’nan 250022, China;
| | - Haque A. K. Alvi
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Xi-Guo Yuan
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
- Correspondence:
| |
Collapse
|
25
|
Guo Y, Wang S, Yuan X. HBOS-CNV: A New Approach to Detect Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 12:642473. [PMID: 34163521 PMCID: PMC8215577 DOI: 10.3389/fgene.2021.642473] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 05/05/2021] [Indexed: 11/13/2022] Open
Abstract
Copy number variation (CNV) is a genomic mutation that plays an important role in tumor evolution and tumor genesis. Accurate detection of CNVs from next-generation sequencing (NGS) data is still a challenging task due to artifacts such as uneven mapped reads and unbalanced amplitudes of gains and losses. This study proposes a new approach called HBOS-CNV to detect CNVs from NGS data. The central point of HBOS-CNV is that it uses a new statistic, the histogram-based outlier score (HBOS), to evaluate the fluctuation of genome bins to determine those of changed copy numbers. In comparison with existing statistics in the evaluation of CNVs, HBOS is a non-linearly transformed value from the observed read depth (RD) value of each genome bin, having the potential ability to relieve the effects resulted from the above artifacts. In the calculation of HBOS values, a dynamic width histogram is utilized to depict the density of bins on the genome being analyzed, which can reduce the effects of noises partially contributed by mapping and sequencing errors. The evaluation of genome bins using such a new statistic can lead to less extremely significant CNVs having a high probability of detection. We evaluated this method using a large number of simulation datasets and compared it with four existing methods (CNVnator, CNV-IFTV, CNV-LOF, and iCopyDav). The results demonstrated that our proposed method outperforms the others in terms of sensitivity, precision, and F1-measure. Furthermore, we applied the proposed method to a set of real sequencing samples from the 1000 Genomes Project and determined a number of CNVs with biological meanings. Thus, the proposed method can be regarded as a routine approach in the field of genome mutation analysis for cancer samples.
Collapse
Affiliation(s)
- Yang Guo
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Shuzhen Wang
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
26
|
Liu Y, Ye X, Zhan X, Yu CY, Zhang J, Huang K. TPQCI: A topology potential-based method to quantify functional influence of copy number variations. Methods 2021; 192:46-56. [PMID: 33894380 DOI: 10.1016/j.ymeth.2021.04.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 04/18/2021] [Accepted: 04/19/2021] [Indexed: 12/21/2022] Open
Abstract
Copy number variation (CNV) is a major type of chromosomal structural variation that play important roles in many diseases including cancers. Due to genome instability, a large number of CNV events can be detected in diseases such as cancer. Therefore, it is important to identify the functionally important CNVs in diseases, which currently still poses a challenge in genomics. One of the critical steps to solve the problem is to define the influence of CNV. In this paper, we provide a topology potential based method, TPQCI, to quantify this kind of influence by integrating statistics, gene regulatory associations, and biological function information. We used this metric to detect functionally enriched genes on genomic segments with CNV in breast cancer and multiple myeloma and discovered biological functions influenced by CNV. Our results demonstrate that, by using our proposed TPQCI metric, we can detect disease-specific genes that are influenced by CNVs. Source codes of TPQCI are provided in Github (https://github.com/usos/TPQCI).
Collapse
Affiliation(s)
- Yusong Liu
- Collage of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang 150001, China; Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Xiufen Ye
- Collage of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang 150001, China
| | - Xiaohui Zhan
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, Guangdong 518037, China; Department of Bioinformatics, School of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Christina Y Yu
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | - Jie Zhang
- Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Kun Huang
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; Regenstrief Institute, Indianapolis, IN 46202, USA.
| |
Collapse
|
27
|
Copy Number Variant Detection with Low-Coverage Whole-Genome Sequencing Represents a Viable Alternative to the Conventional Array-CGH. Diagnostics (Basel) 2021; 11:diagnostics11040708. [PMID: 33920867 PMCID: PMC8071346 DOI: 10.3390/diagnostics11040708] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Revised: 04/09/2021] [Accepted: 04/13/2021] [Indexed: 12/13/2022] Open
Abstract
Copy number variations (CNVs) represent a type of structural variant involving alterations in the number of copies of specific regions of DNA that can either be deleted or duplicated. CNVs contribute substantially to normal population variability, however, abnormal CNVs cause numerous genetic disorders. At present, several methods for CNV detection are applied, ranging from the conventional cytogenetic analysis, through microarray-based methods (aCGH), to next-generation sequencing (NGS). In this paper, we present GenomeScreen, an NGS-based CNV detection method for low-coverage, whole-genome sequencing. We determined the theoretical limits of its accuracy and obtained confirmation in an extensive in silico study and in real patient samples with known genotypes. In theory, at least 6 M uniquely mapped reads are required to detect a CNV with the length of 100 kilobases (kb) or more with high confidence (Z-score > 7). In practice, the in silico analysis required at least 8 M to obtain >99% accuracy (for 100 kb deviations). We compared GenomeScreen with one of the currently used aCGH methods in diagnostic laboratories, which has mean resolution of 200 kb. GenomeScreen and aCGH both detected 59 deviations, while GenomeScreen furthermore detected 134 other (usually) smaller variations. When compared to aCGH, overall performance of the proposed GenemoScreen tool is comparable or superior in terms of accuracy, turn-around time, and cost-effectiveness, thus providing reasonable benefits, particularly in a prenatal diagnosis setting.
Collapse
|
28
|
Yuan X, Yu J, Xi J, Yang L, Shang J, Li Z, Duan J. CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:539-549. [PMID: 31180897 DOI: 10.1109/tcbb.2019.2920889] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurate detection of copy number variations (CNVs) from short-read sequencing data is challenging due to the uneven distribution of reads and the unbalanced amplitudes of gains and losses. The direct use of read depths to measure CNVs tends to limit performance. Thus, robust computational approaches equipped with appropriate statistics are required to detect CNV regions and boundaries. This study proposes a new method called CNV_IFTV to address this need. CNV_IFTV assigns an anomaly score to each genome bin through a collection of isolation trees. The trees are trained based on isolation forest algorithm through conducting subsampling from measured read depths. With the anomaly scores, CNV_IFTV uses a total variation model to smooth adjacent bins, leading to a denoised score profile. Finally, a statistical model is established to test the denoised scores for calling CNVs. CNV_IFTV is tested on both simulated and real data in comparison to several peer methods. The results indicate that the proposed method outperforms the peer methods. CNV_IFTV is a reliable tool for detecting CNVs from short-read sequencing data even for low-level coverage and tumor purity. The detection results on tumor samples can aid to evaluate known cancer genes and to predict target drugs for disease diagnosis.
Collapse
|
29
|
Abstract
Gains and losses of large segments of genomic DNA, known as copy number variants (CNVs) gained considerable interest in clinical diagnostics lately, as particular forms may lead to inherited genetic diseases. In recent decades, researchers developed a wide variety of cytogenetic and molecular methods with different detection capabilities to detect clinically relevant CNVs. In this review, we summarize methodological progress from conventional approaches to current state of the art techniques capable of detecting CNVs from a few bases up to several megabases. Although the recent rapid progress of sequencing methods has enabled precise detection of CNVs, determining their functional effect on cellular and whole-body physiology remains a challenge. Here, we provide a comprehensive list of databases and bioinformatics tools that may serve as useful assets for researchers, laboratory diagnosticians, and clinical geneticists facing the challenge of CNV detection and interpretation.
Collapse
|
30
|
Xie K, Tian Y, Yuan X. A Density Peak-Based Method to Detect Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 11:632311. [PMID: 33519925 PMCID: PMC7838601 DOI: 10.3389/fgene.2020.632311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 12/21/2020] [Indexed: 11/29/2022] Open
Abstract
Copy number variation (CNV) is a common type of structural variations in human genome and confers biological meanings to human complex diseases. Detection of CNVs is an important step for a systematic analysis of CNVs in medical research of complex diseases. The recent development of next-generation sequencing (NGS) platforms provides unprecedented opportunities for the detection of CNVs at a base-level resolution. However, due to the intrinsic characteristics behind NGS data, accurate detection of CNVs is still a challenging task. In this article, we propose a new density peak-based method, called dpCNV, for the detection of CNVs from NGS data. The algorithm of dpCNV is designed based on density peak clustering algorithm. It extracts two features, i.e., local density and minimum distance, from sequencing read depth (RD) profile and generates a two-dimensional data. Based on the generated data, a two-dimensional null distribution is constructed to test the significance of each genome bin and then the significant genome bins are declared as CNVs. We test the performance of the dpCNV method on a number of simulated datasets and make comparison with several existing methods. The experimental results demonstrate that our proposed method outperforms others in terms of sensitivity and F1-score. We further apply it to a set of real sequencing samples and the results demonstrate the validity of dpCNV. Therefore, we expect that dpCNV can be used as a supplementary to existing methods and may become a routine tool in the field of genome mutation analysis.
Collapse
Affiliation(s)
- Kun Xie
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Ye Tian
- The School of Computer Science and Technology, Xidian University, Xi'an, China.,Xi'an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- The School of Computer Science and Technology, Xidian University, Xi'an, China.,Xi'an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
31
|
Liu G, Zhang J, Yuan X, Wei C. RKDOSCNV: A Local Kernel Density-Based Approach to the Detection of Copy Number Variations by Using Next-Generation Sequencing Data. Front Genet 2020; 11:569227. [PMID: 33329705 PMCID: PMC7673372 DOI: 10.3389/fgene.2020.569227] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 09/04/2020] [Indexed: 12/04/2022] Open
Abstract
Copy number variations (CNVs) are significant causes of many human cancers and genetic diseases. The detection of CNVs has become a common method by which to analyze human diseases using next-generation sequencing (NGS) data. However, effective detection of insignificant CNVs is still a challenging task. In this study, we propose a new detection method, RKDOSCNV, to meet the need. RKDOSCNV uses kernel density estimation method to evaluate the local kernel density distribution of each read depth segment (RDS) based on an expanded nearest neighbor (k-nearest neighbors, reverse nearest neighbors, and shared nearest neighbors of each RDS) data set, and assigns a relative kernel density outlier score (RKDOS) for each RDS. According to the RKDOS profile, RKDOSCNV predicts the candidate CNVs by choosing a reasonable threshold, which it uses split read approach to correct the boundaries of candidate CNVs. The performance of RKDOSCNV is assessed by comparing it with several current popular methods via experiments with simulated and real data at different tumor purity levels. The experimental results verify that the performance of RKDOSCNV is superior to that of several other methods. In summary, RKDOSCNV is a simple and effective method for the detection of CNVs from whole genome sequencing (WGS) data, especially for samples with low tumor purity.
Collapse
Affiliation(s)
- Guojun Liu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Chao Wei
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
32
|
Chanwigoon S, Piwluang S, Wichadakul D. inCNV: An Integrated Analysis Tool for Copy Number Variation on Whole Exome Sequencing. Evol Bioinform Online 2020; 16:1176934320956577. [PMID: 33029071 PMCID: PMC7520931 DOI: 10.1177/1176934320956577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 08/13/2020] [Indexed: 12/13/2022] Open
Abstract
The detection of copy number variations (CNVs) on whole-exome sequencing (WES) represents a cost-effective technique for the study of genetic variants. This approach, however, has encountered an obstacle with high false-positive rates due to biases from exome sequencing capture kits and GC contents. Although plenty of CNV detection tools have been developed, they do not perform well with all types of CNVs. In addition, most tools lack features of genetic annotation, CNV visualization, and flexible installation, requiring users to put much effort into CNV interpretation. Here, we present "inCNV," a web-based application that can accept multiple CNV-tool results, then integrate and prioritize them with user-friendly interfaces. This application helps users analyze the importance of called CNVs by generating CNV annotations from Ensembl, Database of Genomic Variants (DGV), ClinVar, and Online Mendelian Inheritance in Man (OMIM). Moreover, users can select and export CNVs of interest including their flanking sequences for primer design and experimental verification. We demonstrated how inCNV could help users filter and narrow down the called CNVs to a potentially novel CNV, a common CNV within a group of samples of the same disease, or a de novo CNV of a sample within the same family. Besides, we have provided in CNV as a docker image for ease of installation (https://github.com/saowwapark/inCNV).
Collapse
Affiliation(s)
- Saowwapark Chanwigoon
- Software Engineering Program, Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
| | - Sakkayaphab Piwluang
- Software Engineering Program, Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
| | - Duangdao Wichadakul
- Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
| |
Collapse
|
33
|
Non-invasive prenatal testing (NIPT) by low coverage genomic sequencing: Detection limits of screened chromosomal microdeletions. PLoS One 2020; 15:e0238245. [PMID: 32845907 PMCID: PMC7449492 DOI: 10.1371/journal.pone.0238245] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 08/12/2020] [Indexed: 12/21/2022] Open
Abstract
To study the detection limits of chromosomal microaberrations in non-invasive prenatal testing with aim for five target microdeletion syndromes, including DiGeorge, Prader-Willi/Angelman, 1p36, Cri-Du-Chat, and Wolf-Hirschhorn syndromes. We used known cases of pathogenic deletions from ISCA database to specifically define regions critical for the target syndromes. Our approach to detect microdeletions, from whole genome sequencing data, is based on sample normalization and read counting for individual bins. We performed both an in-silico study using artificially created data sets and a laboratory test on mixed DNA samples, with known microdeletions, to assess the sensitivity of prediction for varying fetal fractions, deletion lengths, and sequencing read counts. The in-silico study showed sensitivity of 79.3% for 10% fetal fraction with 20M read count, which further increased to 98.4% if we searched only for deletions longer than 3Mb. The test on laboratory-prepared mixed samples was in agreement with in-silico results, while we were able to correctly detect 24 out of 29 control samples. Our results suggest that it is possible to incorporate microaberration detection into basic NIPT as part of the offered screening/diagnostics procedure, however, accuracy and reliability depends on several specific factors.
Collapse
|
34
|
Zhao H, Huang T, Li J, Liu G, Yuan X. MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2020; 11:434. [PMID: 32499814 PMCID: PMC7243272 DOI: 10.3389/fgene.2020.00434] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2020] [Accepted: 04/08/2020] [Indexed: 11/13/2022] Open
Abstract
Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.
Collapse
Affiliation(s)
- Haiyong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China.,The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Guojun Liu
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
35
|
Yokoyama TT, Kasahara M. Visualization tools for human structural variations identified by whole-genome sequencing. J Hum Genet 2020; 65:49-60. [PMID: 31666648 PMCID: PMC8075883 DOI: 10.1038/s10038-019-0687-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/27/2019] [Accepted: 10/02/2019] [Indexed: 01/02/2023]
Abstract
Visualizing structural variations (SVs) is a critical step for finding associations between SVs and human traits or diseases. Given that there are many sequencing platforms used for SV identification and given that how best to visualize SVs together with other data, such as read alignments and annotations, depends on research goals, there are dozens of SV visualization tools designed for different research goals and sequencing platforms. Here, we provide a comprehensive survey of over 30 SV visualization tools to help users choose which tools to use. This review targets users who wish to visualize a set of SVs identified from the massively parallel sequencing reads of an individual human genome. We first categorize the ways in which SV visualization tools display SVs into ten major categories, which we denote as view modules. View modules allow readers to understand the features of each SV visualization tool quickly. Next, we introduce the features of individual SV visualization tools from several aspects, including whether SV views are integrated with annotations, whether long-read alignment is displayed, whether underlying data structures are graph-based, the type of SVs shown, whether auditing is possible, whether bird's eye view is available, sequencing platforms, and the number of samples. We hope that this review will serve as a guide for readers on the currently available SV visualization tools and lead to the development of new SV visualization tools in the near future.
Collapse
Affiliation(s)
- Toshiyuki T Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Masahiro Kasahara
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
| |
Collapse
|
36
|
Zelenova MA, Yurov YB, Vorsanova SG, Iourov IY. Laundering CNV data for candidate process prioritization in brain disorders. Mol Cytogenet 2019; 12:54. [PMID: 31890034 PMCID: PMC6933640 DOI: 10.1186/s13039-019-0468-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 12/17/2019] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Prioritization of genomic data has become a useful tool for uncovering the phenotypic effect of genetic variations (e.g. copy number variations or CNV) and disease mechanisms. Due to the complexity, brain disorders represent a major focus of genomic research aimed at revealing pathologic significance of genomic changes leading to brain dysfunction. Here, we propose a "CNV data laundering" algorithm based on filtering and prioritizing of genomic pathways retrieved from available databases for uncovering altered molecular pathways in brain disorders. The algorithm comprises seven consecutive steps of processing individual CNV data sets. First, the data are compared to in-house and web databases to discriminate recurrent non-pathogenic variants. Second, the CNV pool is confined to the genes predominantly expressed in the brain. Third, intergenic interactions are used for filtering causative CNV. Fourth, a network of interconnected elements specific for an individual genome variation set is created. Fifth, ontologic data (pathways/functions) are attributed to clusters of network elements. Sixth, the pathways are prioritized according to the significance of elements affected by CNV. Seventh, prioritized pathways are clustered according to the ontologies. RESULTS The algorithm was applied to 191 CNV data sets obtained from children with brain disorders (intellectual disability and autism spectrum disorders) by SNP array molecular karyotyping. "CNV data laundering" has identified 13 pathway clusters (39 processes/475 genes) implicated in the phenotypic manifestations. CONCLUSIONS Elucidating altered molecular pathways in brain disorders, the algorithm may be used for uncovering disease mechanisms and genotype-phenotype correlations. These opportunities are strongly required for developing therapeutic strategies in devastating neuropsychiatric diseases.
Collapse
Affiliation(s)
- Maria A. Zelenova
- Mental Health Research Center, Russia Moscow, 115522
- Academician Yu.E. Veltishchev Research Clinical Institute of Pediatrics, N.I, Pirogov Russian National Research Medical University, Ministry of Health of the Russian Federation, Russia Moscow, 125635
| | - Yuri B. Yurov
- Mental Health Research Center, Russia Moscow, 115522
- Academician Yu.E. Veltishchev Research Clinical Institute of Pediatrics, N.I, Pirogov Russian National Research Medical University, Ministry of Health of the Russian Federation, Russia Moscow, 125635
| | - Svetlana G. Vorsanova
- Mental Health Research Center, Russia Moscow, 115522
- Academician Yu.E. Veltishchev Research Clinical Institute of Pediatrics, N.I, Pirogov Russian National Research Medical University, Ministry of Health of the Russian Federation, Russia Moscow, 125635
| | - Ivan Y. Iourov
- Mental Health Research Center, Russia Moscow, 115522
- Academician Yu.E. Veltishchev Research Clinical Institute of Pediatrics, N.I, Pirogov Russian National Research Medical University, Ministry of Health of the Russian Federation, Russia Moscow, 125635
| |
Collapse
|
37
|
Dharanipragada P, Parekh N. Genome-wide characterization of copy number variations in diffuse large B-cell lymphoma with implications in targeted therapy. PRECISION CLINICAL MEDICINE 2019; 2:246-258. [PMID: 35693879 PMCID: PMC8985800 DOI: 10.1093/pcmedi/pbz024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 11/12/2019] [Accepted: 11/17/2019] [Indexed: 12/12/2022] Open
Abstract
Diffuse large B-cell lymphoma (DLBCL) is the aggressive form of haematological malignancies with relapse/refractory in ~ 40% of cases. It mostly develops due to accumulation of various genetic and epigenetic variations that contribute to its aggressiveness. Though large-scale structural alterations have been reported in DLBCL, their functional role in pathogenesis and as potential targets for therapy is not yet well understood. In this study we performed detection and analysis of copy number variations (CNVs) in 11 human DLBCL cell lines (4 activated B-cell–like [ABC] and 7 germinal-centre B-cell–like [GCB]), that serve as model systems for DLBCL cancer cell biology. Significant heterogeneity observed in CNV profiles of these cell lines and poor prognosis associated with ABC subtype indicates the importance of individualized screening for diagnostic and prognostic targets. Functional analysis of key cancer genes exhibiting copy alterations across the cell lines revealed activation/disruption of ten potentially targetable immuno-oncogenic pathways. Genome guided in silico therapy that putatively target these pathways is elucidated. Based on our analysis, five CNV-genes associated with worst survival prognosis are proposed as potential prognostic markers of DLBCL.
Collapse
Affiliation(s)
- Prashanthi Dharanipragada
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, Telangana 500 032, India
| | - Nita Parekh
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, Telangana 500 032, India
| |
Collapse
|
38
|
A Deep Learning Approach for Detecting Copy Number Variation in Next-Generation Sequencing Data. G3-GENES GENOMES GENETICS 2019; 9:3575-3582. [PMID: 31455677 PMCID: PMC6829143 DOI: 10.1534/g3.119.400596] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Copy number variants (CNV) are associated with phenotypic variation in several species. However, properly detecting changes in copy numbers of sequences remains a difficult problem, especially in lower quality or lower coverage next-generation sequencing data. Here, inspired by recent applications of machine learning in genomics, we describe a method to detect duplications and deletions in short-read sequencing data. In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods of coverage estimation alone, and of equal power in high coverage data. We also demonstrate how replicating training sets allows a more precise detection of CNVs, even identifying novel CNVs in two genomes previously surveyed thoroughly for CNVs using long read data.
Collapse
|
39
|
Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol 2019; 20:117. [PMID: 31159850 PMCID: PMC6547561 DOI: 10.1186/s13059-019-1720-5] [Citation(s) in RCA: 283] [Impact Index Per Article: 47.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 05/20/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. RESULTS We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. CONCLUSION These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yukihide Momozawa
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Xiaoxi Liu
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Chikashi Terao
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Michiaki Kubo
- RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yoichiro Kamatani
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| |
Collapse
|
40
|
Zhang L, Bai W, Yuan N, Du Z. Comprehensively benchmarking applications for detecting copy number variation. PLoS Comput Biol 2019; 15:e1007069. [PMID: 31136576 PMCID: PMC6555534 DOI: 10.1371/journal.pcbi.1007069] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Revised: 06/07/2019] [Accepted: 05/06/2019] [Indexed: 12/15/2022] Open
Abstract
Motivation: Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives. For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Additionally, CNVnator and GROM-RD perform well for low-depth sequencing data. Our results provide a comprehensive performance evaluation for these selected CNV detection methods and facilitate future development and improvement in CNV prediction methods. As an important type of genomic structural variation, CNVs are associated with complex phenotypes because they change the number of copies of genes in cells, affecting coding sequences and playing an important role in the susceptibility or resistance to human diseases. To identify CNVs, several experimental methods have been developed, but their resolution is very low, and the detection of short CNVs presents a bottleneck. In recent years, the advancement of high-throughput sequencing techniques has made it possible to precisely detect CNVs, especially short ones. Many CNV detection applications were developed based on the availability of high-throughput sequencing data. Due to different CNV detection algorithms, the CNVs identified by different applications vary greatly. Therefore, it is necessary to help investigators choose suitable applications for CNV detection depending upon their objectives. For this reason, we not only compared ten commonly used CNV detection applications but also benchmarked the applications by sensitivity, specificity and computational demands. Our results show that the sequencing depth can strongly affect CNV detection. Among the ten applications benchmarked, LUMPY performs best for both high sensitivity and specificity for each sequencing depth. We also give recommended applications for specific purposes, for example, CNVnator and RDXplorer for high sensitivity and CNVnator and GROM-RD for low-depth sequencing data.
Collapse
Affiliation(s)
- Le Zhang
- College of Computer Science, Sichuan University, Chengdu, China
- Medical Big Data Center, Sichuan University, Chengdu, China
- Zdmedical, Information polytron Technologies Inc. Chongqing, Chongqing, China
- * E-mail: (LZ); (ZD)
| | - Wanyu Bai
- College of Computer Science, Sichuan University, Chengdu, China
| | - Na Yuan
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
| | - Zhenglin Du
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
- * E-mail: (LZ); (ZD)
| |
Collapse
|