1
|
Zhang Y, Liu W, Duan J. On the core segmentation algorithms of copy number variation detection tools. Brief Bioinform 2024; 25:bbae022. [PMID: 38340093 PMCID: PMC10858679 DOI: 10.1093/bib/bbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/26/2023] [Indexed: 02/12/2024] Open
Abstract
Shotgun sequencing is a high-throughput method used to detect copy number variants (CNVs). Although there are numerous CNV detection tools based on shotgun sequencing, their quality varies significantly, leading to performance discrepancies. Therefore, we conducted a comprehensive analysis of next-generation sequencing-based CNV detection tools over the past decade. Our findings revealed that the majority of mainstream tools employ similar detection rationale: calculates the so-called read depth signal from aligned sequencing reads and then segments the signal by utilizing either circular binary segmentation (CBS) or hidden Markov model (HMM). Hence, we compared the performance of those two core segmentation algorithms in CNV detection, considering varying sequencing depths, segment lengths and complex types of CNVs. To ensure a fair comparison, we designed a parametrical model using mainstream statistical distributions, which allows for pre-excluding bias correction such as guanine-cytosine (GC) content during the preprocessing step. The results indicate the following key points: (1) Under ideal conditions, CBS demonstrates high precision, while HMM exhibits a high recall rate. (2) For practical conditions, HMM is advantageous at lower sequencing depths, while CBS is more competitive in detecting small variant segments compared to HMM. (3) In case involving complex CNVs resembling real sequencing, HMM demonstrates more robustness compared with CBS. (4) When facing large-scale sequencing data, HMM costs less time compared with the CBS, while their memory usage is approximately equal. This can provide an important guidance and reference for researchers to develop new tools for CNV detection.
Collapse
Affiliation(s)
- Yibo Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Wenyu Liu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Junbo Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
2
|
Zhou M, Zhang C, Chen M, Hu Z, Li M, Li Z, Wu L, Liang D. A protospacer adjacent motif-free, multiplexed, and quantitative nucleic acid detection platform with barcode-based Cas12a activity. MedComm (Beijing) 2023; 4:e310. [PMID: 37405277 PMCID: PMC10315165 DOI: 10.1002/mco2.310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 05/20/2023] [Accepted: 05/22/2023] [Indexed: 07/06/2023] Open
Abstract
Clustered regularly interspaced short palindromic repeat (CRISPR)-based biosensors have been developed to facilitate the rapid and sensitive detection of nucleic acids. However, most approaches using CRISPR-based detection have disadvantages associated with the limitations of CRISPR RNA (crRNA), protospacer adjacent motif (PAM) or protospacer flanking sequence restriction, single channel detection, and difficulty in quantitative detection resulting in only some target sites being detected qualitatively. Here, we aimed to develop a barcode-based Cas12a-mediated DNA detection (BCDetection) strategy, which overcomes the aforementioned drawbacks and enables (1) detection with a universal PAM and crRNA without PAM or crRNA restriction, (2) simultaneous detection of multiple targets in a single reaction, and (3) quantitative detection, which can significantly distinguish copy number differences up to as low as a two-fold limit. We could efficiently and simultaneously detect three β-thalassemia mutations in a single reaction using BCDetection. Notably, samples from normal individuals, spinal muscular atrophy (SMA) carriers, and SMA patients were significantly and accurately distinguished using the quantitative detection ability of BCDetection, indicating its potential application in β-thalassemia and SMA carrier screening. Therefore, our findings demonstrate that BCDetection provides a new platform for accurate and efficient quantitative detection using CRISPR/Cas12a, highlighting its bioanalytical applications.
Collapse
Affiliation(s)
- Miaojin Zhou
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Chunhua Zhang
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
- Department of Medical GeneticsYunnan Maternal and Child Health Care HospitalKunmingYunnanChina
| | - Miaomiao Chen
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Zhiqing Hu
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Menglin Li
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Zhuo Li
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Lingqian Wu
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| | - Desheng Liang
- Center for Medical Genetics & Hunan Key Laboratory of Medical GeneticsSchool of Life SciencesCentral South UniversityChangshaHunanChina
| |
Collapse
|
3
|
Wen S, Wang M, Qian X, Li Y, Wang K, Choi J, Pennesi ME, Yang P, Marra M, Koenekoop RK, Lopez I, Matynia A, Gorin M, Sui R, Yao F, Goetz K, Porto FBO, Chen R. Systematic assessment of the contribution of structural variants to inherited retinal diseases. Hum Mol Genet 2023; 32:2005-2015. [PMID: 36811936 PMCID: PMC10244226 DOI: 10.1093/hmg/ddad032] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 01/03/2023] [Accepted: 02/11/2023] [Indexed: 02/24/2023] Open
Abstract
Despite increasing success in determining genetic diagnosis for patients with inherited retinal diseases (IRDs), mutations in about 30% of the IRD cases remain unclear or unsettled after targeted gene panel or whole exome sequencing. In this study, we aimed to investigate the contributions of structural variants (SVs) to settling the molecular diagnosis of IRD with whole-genome sequencing (WGS). A cohort of 755 IRD patients whose pathogenic mutations remain undefined were subjected to WGS. Four SV calling algorithms including include MANTA, DELLY, LUMPY and CNVnator were used to detect SVs throughout the genome. All SVs identified by any one of these four algorithms were included for further analysis. AnnotSV was used to annotate these SVs. SVs that overlap with known IRD-associated genes were examined with sequencing coverage, junction reads and discordant read pairs. Polymerase Chain Reaction (PCR) followed by Sanger sequencing was used to further confirm the SVs and identify the breakpoints. Segregation of the candidate pathogenic alleles with the disease was performed when possible. A total of 16 candidate pathogenic SVs were identified in 16 families, including deletions and inversions, representing 2.1% of patients with previously unsolved IRDs. Autosomal dominant, autosomal recessive and X-linked inheritance of disease-causing SVs were observed in 12 different genes. Among these, SVs in CLN3, EYS and PRPF31 were found in multiple families. Our study suggests that the contribution of SVs detected by short-read WGS is about 0.25% of our IRD patient cohort and is significantly lower than that of single nucleotide changes and small insertions and deletions.
Collapse
Affiliation(s)
- Shu Wen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Meng Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Xinye Qian
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yumei Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Keqing Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jongsu Choi
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mark E Pennesi
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Paul Yang
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Molly Marra
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Robert K Koenekoop
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, Montreal, Quebec, H4A 3S5, Canada
| | - Irma Lopez
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, Montreal, Quebec, H4A 3S5, Canada
| | - Anna Matynia
- Jules Stein Eye Institute, Los Angeles, CA 90095, USA
- Ophthalmology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, CA 90095, USA
| | - Michael Gorin
- Jules Stein Eye Institute, Los Angeles, CA 90095, USA
- Ophthalmology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, CA 90095, USA
| | - Ruifang Sui
- Department of Ophthalmology, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, 100005, China
| | - Fengxia Yao
- Medical Research Center, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, 100005, China
| | - Kerry Goetz
- Office of the Director, National Eye Institute/National Institutes of Health, Bethesda, MD 20892, USA
| | - Fernanda Belga Ottoni Porto
- INRET Clínica e Centro de Pesquisa, Belo Horizonte, Minas Gerais, 30150270, Brazil
- Department of Ophthalmology, Santa Casa de Misericórdia de Belo Horizonte, Belo Horizonte, Minas Gerais, 30150221, Brazil
- Centro Oftalmológico de Minas Gerais, Belo Horizonte, Minas Gerais, 30180070, Brazil
| | - Rui Chen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
4
|
Ding Y, Liao Y, He J, Ma J, Wei X, Liu X, Zhang G, Wang J. Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity. Front Genet 2023; 14:1213907. [PMID: 37323665 PMCID: PMC10267386 DOI: 10.3389/fgene.2023.1213907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 05/24/2023] [Indexed: 06/17/2023] Open
Abstract
Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. Methods: In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. Results: The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. Conclusion: CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.
Collapse
Affiliation(s)
- Youde Ding
- The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People’s Hospital, Qingyuan, China
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Yuan Liao
- The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People’s Hospital, Qingyuan, China
| | - Ji He
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Jianfeng Ma
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Xu Wei
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Xuemei Liu
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Guiying Zhang
- The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People’s Hospital, Qingyuan, China
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| | - Jing Wang
- The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People’s Hospital, Qingyuan, China
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China
| |
Collapse
|