1
|
Zhou M, Dong J, Jiang H, Zhao Z, Yuan T. A copy number variation detection method based on OCSVM algorithm using multi strategies integration. Sci Rep 2025; 15:3526. [PMID: 39875521 PMCID: PMC11775105 DOI: 10.1038/s41598-025-88143-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Accepted: 01/24/2025] [Indexed: 01/30/2025] Open
Abstract
Copy number variation (CNV) is an important part of human genetic variations, which is associated with various kinds of diseases. To tackle the limitations of traditional CNV detection methods, such as restricted detection types, high error rates, and challenges in precisely identifying the location of variant breakpoints, a new method called MSCNV (copy number variations detection method for multi-strategies integration based on a one-class support vector machine model) is proposed. MSCNV establishes a multi-signal channel that integrates three strategies: read depth, split read, and read pair. First, a one-class support vector machine algorithm is used to detect abnormal signals in read depth and mapping quality values to determine the rough CNV region. Then, the rough CNV region is filtered by using paired read signals to improve the precision of MSCNV method. Finally, MSCNV explores and recognizes tandem duplication regions, interspersed duplication regions, and loss regions. It uses split read signals to determine the precise location of mutation points and to determine the type of variation. Compared with Manta, FREEC, GROM-RD, Rsicnv, and CNVkit, MSCNV significantly improves the sensitivity, precision, F1-score, and overlap density score of CNV detection while reducing the boundary bias of the detection results.
Collapse
Affiliation(s)
- Mengjiao Zhou
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China
- Shandong Provincial Academy of Educational Recruitment and Examination, Jinan, 250011, Shandong, P.R. China
| | - Jinxin Dong
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China.
| | - Hua Jiang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China.
| | - Zuyao Zhao
- Orthopedics Department, Liaocheng People's Hospital, Liaocheng, 252000, P.R. China
| | - Tianting Yuan
- School of Computer Science and Technology, Liaocheng University, Liaocheng, 252000, Shandong, P.R. China
| |
Collapse
|
2
|
Zhang Y, Liu W, Duan J. On the core segmentation algorithms of copy number variation detection tools. Brief Bioinform 2024; 25:bbae022. [PMID: 38340093 PMCID: PMC10858679 DOI: 10.1093/bib/bbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/26/2023] [Indexed: 02/12/2024] Open
Abstract
Shotgun sequencing is a high-throughput method used to detect copy number variants (CNVs). Although there are numerous CNV detection tools based on shotgun sequencing, their quality varies significantly, leading to performance discrepancies. Therefore, we conducted a comprehensive analysis of next-generation sequencing-based CNV detection tools over the past decade. Our findings revealed that the majority of mainstream tools employ similar detection rationale: calculates the so-called read depth signal from aligned sequencing reads and then segments the signal by utilizing either circular binary segmentation (CBS) or hidden Markov model (HMM). Hence, we compared the performance of those two core segmentation algorithms in CNV detection, considering varying sequencing depths, segment lengths and complex types of CNVs. To ensure a fair comparison, we designed a parametrical model using mainstream statistical distributions, which allows for pre-excluding bias correction such as guanine-cytosine (GC) content during the preprocessing step. The results indicate the following key points: (1) Under ideal conditions, CBS demonstrates high precision, while HMM exhibits a high recall rate. (2) For practical conditions, HMM is advantageous at lower sequencing depths, while CBS is more competitive in detecting small variant segments compared to HMM. (3) In case involving complex CNVs resembling real sequencing, HMM demonstrates more robustness compared with CBS. (4) When facing large-scale sequencing data, HMM costs less time compared with the CBS, while their memory usage is approximately equal. This can provide an important guidance and reference for researchers to develop new tools for CNV detection.
Collapse
Affiliation(s)
- Yibo Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Wenyu Liu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Junbo Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
3
|
Raman L, Baetens M, De Smet M, Dheedene A, Van Dorpe J, Menten B. PREFACE: In silico pipeline for accurate cell-free fetal DNA fraction prediction. Prenat Diagn 2019; 39:925-933. [PMID: 31219182 PMCID: PMC6771918 DOI: 10.1002/pd.5508] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/16/2019] [Accepted: 06/15/2019] [Indexed: 12/12/2022]
Abstract
Objective During routine noninvasive prenatal testing (NIPT), cell‐free fetal DNA fraction is ideally derived from shallow‐depth whole‐genome sequencing data, preventing the need for additional experimental assays. The fraction of aligned reads to chromosome Y enables proper quantification for male fetuses, unlike for females, where advanced predictive procedures are required. This study introduces PREdict FetAl ComponEnt (PREFACE), a novel bioinformatics pipeline to establish fetal fraction in a gender‐independent manner. Methods PREFACE combines the strengths of principal component analysis and neural networks to model copy number profiles. Results For sets of roughly 1100 male NIPT samples, a cross‐validated Pearson correlation of 0.9 between predictions and fetal fractions according to Y chromosomal read counts was noted. PREFACE enables training with both male and unlabeled female fetuses. Using our complete cohort (nfemale = 2468, nmale = 2723), the correlation metric reached 0.94. Conclusions Allowing individual institutions to generate optimized models sidelines between‐laboratory bias, as PREFACE enables user‐friendly training with a limited amount of retrospective data. In addition, our software provides the fetal fraction based on the copy number state of chromosome X. We show that these measures can predict mixed multiple pregnancies, sex chromosomal aneuploidies, and the source of observed aberrations. What's already known about this topic?
Cell‐free fetal DNA fraction is an important estimate during noninvasive prenatal testing (NIPT). Most techniques to establish fetal fraction require experimental procedures, which impede routine execution.
What does this study add?
PREFACE is a novel software to accurately predict fetal fraction based on solely shallow‐depth whole‐genome sequencing data, the fundamental base of a default NIPT assay. In contrast to previous efforts, PREFACE enables user‐friendly model training with a limited amount of retrospective data.
Collapse
Affiliation(s)
- Lennart Raman
- Department of Pathology, Ghent University, Ghent University Hospital, Ghent, Belgium.,Center for Medical Genetics, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Machteld Baetens
- Center for Medical Genetics, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Matthias De Smet
- Center for Medical Genetics, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Annelies Dheedene
- Center for Medical Genetics, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Jo Van Dorpe
- Department of Pathology, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Björn Menten
- Center for Medical Genetics, Ghent University, Ghent University Hospital, Ghent, Belgium
| |
Collapse
|
4
|
Zhang L, Bai W, Yuan N, Du Z. Comprehensively benchmarking applications for detecting copy number variation. PLoS Comput Biol 2019; 15:e1007069. [PMID: 31136576 PMCID: PMC6555534 DOI: 10.1371/journal.pcbi.1007069] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Revised: 06/07/2019] [Accepted: 05/06/2019] [Indexed: 12/15/2022] Open
Abstract
Motivation: Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives. For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Additionally, CNVnator and GROM-RD perform well for low-depth sequencing data. Our results provide a comprehensive performance evaluation for these selected CNV detection methods and facilitate future development and improvement in CNV prediction methods. As an important type of genomic structural variation, CNVs are associated with complex phenotypes because they change the number of copies of genes in cells, affecting coding sequences and playing an important role in the susceptibility or resistance to human diseases. To identify CNVs, several experimental methods have been developed, but their resolution is very low, and the detection of short CNVs presents a bottleneck. In recent years, the advancement of high-throughput sequencing techniques has made it possible to precisely detect CNVs, especially short ones. Many CNV detection applications were developed based on the availability of high-throughput sequencing data. Due to different CNV detection algorithms, the CNVs identified by different applications vary greatly. Therefore, it is necessary to help investigators choose suitable applications for CNV detection depending upon their objectives. For this reason, we not only compared ten commonly used CNV detection applications but also benchmarked the applications by sensitivity, specificity and computational demands. Our results show that the sequencing depth can strongly affect CNV detection. Among the ten applications benchmarked, LUMPY performs best for both high sensitivity and specificity for each sequencing depth. We also give recommended applications for specific purposes, for example, CNVnator and RDXplorer for high sensitivity and CNVnator and GROM-RD for low-depth sequencing data.
Collapse
Affiliation(s)
- Le Zhang
- College of Computer Science, Sichuan University, Chengdu, China
- Medical Big Data Center, Sichuan University, Chengdu, China
- Zdmedical, Information polytron Technologies Inc. Chongqing, Chongqing, China
- * E-mail: (LZ); (ZD)
| | - Wanyu Bai
- College of Computer Science, Sichuan University, Chengdu, China
| | - Na Yuan
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
| | - Zhenglin Du
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
- * E-mail: (LZ); (ZD)
| |
Collapse
|
5
|
Raman L, Dheedene A, De Smet M, Van Dorpe J, Menten B. WisecondorX: improved copy number detection for routine shallow whole-genome sequencing. Nucleic Acids Res 2019; 47:1605-1614. [PMID: 30566647 PMCID: PMC6393301 DOI: 10.1093/nar/gky1263] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 11/09/2018] [Accepted: 12/06/2018] [Indexed: 12/16/2022] Open
Abstract
Shallow whole-genome sequencing to infer copy number alterations (CNAs) in the human genome is rapidly becoming the method par excellence for routine diagnostic use. Numerous tools exist to deduce aberrations from massive parallel sequencing data, yet most are optimized for research and often fail to redeem paramount needs in a clinical setting. Optimally, a read depth-based analytical software should be able to deal with single-end and low-coverage data-this to make sequencing costs feasible. Other important factors include runtime, applicability to a variety of analyses and overall performance. We compared the most important aspect, being normalization, across six different CNA tools, selected for their assumed ability to satisfy the latter needs. In conclusion, WISECONDOR, which uses a within-sample normalization technique, undoubtedly produced the best results concerning variance, distributional assumptions and basic ability to detect true variations. Nonetheless, as is the case with every tool, WISECONDOR has limitations, which arise through its exclusiveness for non-invasive prenatal testing. Therefore, this work presents WisecondorX in addition, an improved WISECONDOR that enables its use for varying types of applications. WisecondorX is freely available at https://github.com/CenterForMedicalGeneticsGhent/WisecondorX.
Collapse
Affiliation(s)
- Lennart Raman
- Department of Pathology, Ghent University, Ghent University Hospital, Ghent, Belgium
- Center for Medical Genetics Ghent, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Annelies Dheedene
- Center for Medical Genetics Ghent, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Matthias De Smet
- Center for Medical Genetics Ghent, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Jo Van Dorpe
- Department of Pathology, Ghent University, Ghent University Hospital, Ghent, Belgium
| | - Björn Menten
- Center for Medical Genetics Ghent, Ghent University, Ghent University Hospital, Ghent, Belgium
| |
Collapse
|
6
|
Malekpour SA, Pezeshk H, Sadeghi M. MGP-HMM: Detecting genome-wide CNVs using an HMM for modeling mate pair insertion sizes and read counts. Math Biosci 2016; 279:53-62. [PMID: 27424951 DOI: 10.1016/j.mbs.2016.07.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Revised: 06/12/2016] [Accepted: 07/10/2016] [Indexed: 01/02/2023]
Abstract
MOTIVATION Association of Copy Number Variation (CNV) with schizophrenia, autism, developmental disabilities and fatal diseases such as cancer is verified. Recent developments in Next Generation Sequencing (NGS) have facilitated the CNV studies. However, many of the current CNV detection tools are not capable of discriminating tandem duplication from non-tandem duplications. RESULTS In this study, we propose MGP-HMM as a tool which besides detecting genome-wide deletions discriminates tandem duplications from non-tandem duplications. MGP-HMM takes mate pair abnormalities into account and predicts the digitized number of tandem or non-tandem copies. Abnormalities in the mate pair directions and insertion sizes, after being mapped to the reference genome, are elucidated using a Hidden Markov Model (HMM). For this purpose, a Mixture Gaussian density with time-dependent parameters is applied for emitting mate pair insertion sizes from HMM states. Indeed, depending on observed abnormalities in mate pair insertion size or its orientation, each component in the mixture density will have different parameters. MGP-HMM also applies a Poisson distribution for modeling read depth data. This parametric modeling of the mate pair reads enables us to estimate the length of CNVs precisely, which is an advantage over methods which rely only on read depth approach for the CNV detection. Hidden state of the proposed HMM is the digitized copy number of a genomic segment and states correspond to the multipliers of the mixture Gaussian components. The accuracy of our model is validated on a set of next generation sequencing real and simulated data and is compared to other tools.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran; School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran.
| |
Collapse
|