1
|
Du H, Dardas Z, Jolly A, Grochowski CM, Jhangiani SN, Li H, Muzny D, Fatih JM, Yesil G, Elçioglu NH, Gezdirici A, Marafi D, Pehlivan D, Calame DG, Carvalho CMB, Posey JE, Gambin T, Coban-Akdemir Z, Lupski JR. HMZDupFinder: a robust computational approach for detecting intragenic homozygous duplications from exome sequencing data. Nucleic Acids Res 2024; 52:e18. [PMID: 38153174 PMCID: PMC10899794 DOI: 10.1093/nar/gkad1223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/18/2023] [Accepted: 12/13/2023] [Indexed: 12/29/2023] Open
Abstract
Homozygous duplications contribute to genetic disease by altering gene dosage or disrupting gene regulation and can be more deleterious to organismal biology than heterozygous duplications. Intragenic exonic duplications can result in loss-of-function (LoF) or gain-of-function (GoF) alleles that when homozygosed, i.e. brought to homozygous state at a locus by identity by descent or state, could potentially result in autosomal recessive (AR) rare disease traits. However, the detection and functional interpretation of homozygous duplications from exome sequencing data remains a challenge. We developed a framework algorithm, HMZDupFinder, that is designed to detect exonic homozygous duplications from exome sequencing (ES) data. The HMZDupFinder algorithm can efficiently process large datasets and accurately identifies small intragenic duplications, including those associated with rare disease traits. HMZDupFinder called 965 homozygous duplications with three or less exons from 8,707 ES with a recall rate of 70.9% and a precision of 16.1%. We experimentally confirmed 8/10 rare homozygous duplications. Pathogenicity assessment of these copy number variant alleles allowed clinical genomics contextualization for three homozygous duplications alleles, including two affecting known OMIM disease genes EDAR (MIM# 224900), TNNT1(MIM# 605355), and one variant in a novel candidate disease gene: PAAF1.
Collapse
Affiliation(s)
- Haowei Du
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zain Dardas
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Angad Jolly
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Donna Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jawid M Fatih
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Gozde Yesil
- Department of Medical Genetics, Istanbul Medical Faculty, Istanbul 34093, Turkey
| | - Nursel H Elçioglu
- Department of Pediatric Genetics, Marmara University Medical Faculty, Istanbul and Eastern Mediterranean University Faculty of Medicine, Mersin 10, Turkey
| | - Alper Gezdirici
- Department of Medical Genetics, University of Health Sciences, Basaksehir Cam and Sakura City Hospital, 34480 Istanbul, Turkey
| | - Dana Marafi
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Pediatrics, Faculty of Medicine, Kuwait University, Kuwait
| | - Davut Pehlivan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Section of Pediatric Neurology and Developmental Neuroscience, Department of Pediatrics, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital, Houston, TX 77030, USA
| | - Daniel G Calame
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Section of Pediatric Neurology and Developmental Neuroscience, Department of Pediatrics, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital, Houston, TX 77030, USA
| | - Claudia M B Carvalho
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Pacific Northwest Research Institute, Seattle, WA 98122, USA
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Tomasz Gambin
- Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
- Department of Medical Genetics, Institute of Mother and Child, Warsaw, Poland
| | - Zeynep Coban-Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
- Texas Children's Hospital, Houston, TX 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
2
|
Kuśmirek W. Different Strategies for Counting the Depth of Coverage in Copy Number Variation Calling Tools. Bioinform Biol Insights 2022; 16:11779322221115534. [PMID: 35935530 PMCID: PMC9354125 DOI: 10.1177/11779322221115534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 07/02/2022] [Indexed: 12/04/2022] Open
Abstract
There are many copy number variation (CNV) detection tools based on the depth of coverage. A characteristic feature of all tools based on the depth of coverage is the first stage of data processing—counting the depth of coverage in the investigated sequencing regions. However, each tool implements this stage in a slightly different way. Herein, we used data from the 1000 Genomes Project to present the impact of another depth of coverage counting strategies on the results of the CNVs detection process. In the study, we used 7 CNV calling tools: CODEX, CANOES, exomeCopy, ExomeDepth, CLAMMS, CNVkit, and CNVind; from each of these applications, we separated the process of counting the depth of coverage into independent modules. Then, we counted the depth of coverage by mentioned modules, and finally, the obtained depth of coverage tables were used as the input data set to other CNV calling tools. The performed experiments showed that the best methods of counting the depth of coverage are the algorithms implemented in the CLAMMS and CNVkit applications. Both ways allow obtaining much better sets of detected CNVs compared to counting the depth of coverage implemented in other tools. What is more, some CNV detection tools are reasonably resistant to changing the input depth of coverage table. In this study, we proved that the exomeCopy application gives an approximately similar set of the resulting rare CNVs, regardless of the method of counting the depth of coverage table.
Collapse
Affiliation(s)
- Wiktor Kuśmirek
- Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|
3
|
CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage. BMC Bioinformatics 2022; 23:85. [PMID: 35247967 PMCID: PMC8897915 DOI: 10.1186/s12859-022-04617-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 02/22/2022] [Indexed: 11/16/2022] Open
Abstract
Background A typical Copy Number Variations (CNVs) detection process based on the depth of coverage in the Whole Exome Sequencing (WES) data consists of several steps: (I) calculating the depth of coverage in sequencing regions, (II) quality control, (III) normalizing the depth of coverage, (IV) calling CNVs. Previous tools performed one normalization process for each chromosome—all the coverage depths in the sequencing regions from a given chromosome were normalized in a single run. Methods Herein, we present the new CNVind tool for calling CNVs, where the normalization process is conducted separately for each of the sequencing regions. The total number of normalizations is equal to the number of sequencing regions in the investigated dataset. For example, when analyzing a dataset composed of n sequencing regions, CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson’s Correlation as distance metric. Then, the resulting subgroup of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k+1$$\end{document}k+1 sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset. Results and conclusions We used WES data from the 1000 Genomes project to evaluate the impact of independent normalization on CNV calling performance and compared the results with state-of-the-art tools: CODEX and exomeCopy. The results proved that independent normalization allows to improve the rare CNVs detection specificity significantly. For example, for the investigated dataset, we reduced the number of FP calls from over 15,000 to around 5000 while maintaining a constant number of TP calls equal to about 150 CNVs. However, independent normalization of each sequencing region is a computationally expensive process, therefore our pipeline is customized and can be easily run in the cloud computing environment, on the computer cluster, or the single CPU server. To our knowledge, the presented application is the first attempt to implement an innovative approach to independent normalization of the depth of WES data coverage. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04617-x.
Collapse
|
4
|
Gordeeva V, Sharova E, Arapidi G. Progress in Methods for Copy Number Variation Profiling. Int J Mol Sci 2022; 23:ijms23042143. [PMID: 35216262 PMCID: PMC8879278 DOI: 10.3390/ijms23042143] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 02/09/2022] [Accepted: 02/11/2022] [Indexed: 02/04/2023] Open
Abstract
Copy number variations (CNVs) are the predominant class of structural genomic variations involved in the processes of evolutionary adaptation, genomic disorders, and disease progression. Compared with single-nucleotide variants, there have been challenges associated with the detection of CNVs owing to their diverse sizes. However, the field has seen significant progress in the past 20–30 years. This has been made possible due to the rapid development of molecular diagnostic methods which ensure a more detailed view of the genome structure, further complemented by recent advances in computational methods. Here, we review the major approaches that have been used to routinely detect CNVs, ranging from cytogenetics to the latest sequencing technologies, and then cover their specific features.
Collapse
Affiliation(s)
- Veronika Gordeeva
- Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Correspondence:
| | - Elena Sharova
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
| | - Georgij Arapidi
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Shemyakin–Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 117997 Moscow, Russia
| |
Collapse
|
5
|
Markiewicz M, Koperwas J. Evaluation Platform for DDM Algorithms With the Usage of Non-Uniform Data Distribution Strategies. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH 2022. [DOI: 10.4018/ijitsa.290000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Huge amounts of data are collected in numerous independent data storage facilities around the world. However, how the data is distributed between physical locations remains unspecified. Downloading all of the data for the purpose of processing it is undesirable and sometimes even impossible. Various methods have been proposed for performing data mining tasks, but the main problem is the lack of an objective strategy for comparing them. The authors present current research on a novel evaluation platform for distributed data mining (DDM) algorithms. The proposed platform opens up a new field to evaluate algorithms in terms of the quality of the results, transfer used, and speed, but also for the use of a non-uniform data distribution among independent nodes during algorithm evaluation. This work introduces a ‘data partitioning strategy’ term referring to a specific, not necessarily uniform data distribution. A brief evaluation for three clustering algorithms is also reported, showing the usability and simplicity of identifying differences in processing with the use of the platform.
Collapse
|
6
|
Bigio B, Seeleuthner Y, Kerner G, Migaud M, Rosain J, Boisson B, Nasca C, Puel A, Bustamante J, Casanova JL, Abel L, Cobat A. Detection of homozygous and hemizygous complete or partial exon deletions by whole-exome sequencing. NAR Genom Bioinform 2021; 3:lqab037. [PMID: 34046589 PMCID: PMC8140739 DOI: 10.1093/nargab/lqab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 03/19/2021] [Accepted: 05/03/2021] [Indexed: 12/11/2022] Open
Abstract
The detection of copy number variations (CNVs) in whole-exome sequencing (WES) data is important, as CNVs may underlie a number of human genetic disorders. The recently developed HMZDelFinder algorithm can detect rare homozygous and hemizygous (HMZ) deletions in WES data more effectively than other widely used tools. Here, we present HMZDelFinder_opt, an approach that outperforms HMZDelFinder for the detection of HMZ deletions, including partial exon deletions in particular, in WES data from laboratory patient collections that were generated over time in different experimental conditions. We show that using an optimized reference control set of WES data, based on a PCA-derived Euclidean distance for coverage, strongly improves the detection of HMZ complete exon deletions both in real patients carrying validated disease-causing deletions and in simulated data. Furthermore, we develop a sliding window approach enabling HMZDelFinder_opt to identify HMZ partial deletions of exons that are undiscovered by HMZDelFinder. HMZDelFinder_opt is a timely and powerful approach for detecting HMZ deletions, particularly partial exon deletions, in WES data from inherently heterogeneous laboratory patient collections.
Collapse
Affiliation(s)
- Benedetta Bigio
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Yoann Seeleuthner
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, INSERM U1163, Necker Hospital for Sick Children, 75015 Paris, France
| | - Gaspard Kerner
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, INSERM U1163, Necker Hospital for Sick Children, 75015 Paris, France
| | - Mélanie Migaud
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, INSERM U1163, Necker Hospital for Sick Children, 75015 Paris, France
| | - Jérémie Rosain
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, INSERM U1163, Necker Hospital for Sick Children, 75015 Paris, France
| | - Bertrand Boisson
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Carla Nasca
- Laboratory of Neuroendocrinology, The Rockefeller University, New York, NY 10065, USA
| | - Anne Puel
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Jacinta Bustamante
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Jean-Laurent Casanova
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Laurent Abel
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA
| | - Aurelie Cobat
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, INSERM U1163, Necker Hospital for Sick Children, 75015 Paris, France
| |
Collapse
|
7
|
Ehsani R, Drabløs F. Robust Distance Measures for kNN Classification of Cancer Data. Cancer Inform 2020; 19:1176935120965542. [PMID: 33116353 PMCID: PMC7573750 DOI: 10.1177/1176935120965542] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 09/19/2020] [Indexed: 11/23/2022] Open
Abstract
The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a "guilt by association" principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.
Collapse
Affiliation(s)
- Rezvan Ehsani
- Department of Mathematics, University of Zabol, Zabol, Iran
- Department of Bioinformatics, University of Zabol, Zabol, Iran
| | - Finn Drabløs
- Department of Clinical and Molecular Medicine, NTNU – Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
8
|
Detection of copy-number variations from NGS data using read depth information: a diagnostic performance evaluation. Eur J Hum Genet 2020; 29:99-109. [PMID: 32591635 DOI: 10.1038/s41431-020-0672-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 05/20/2020] [Accepted: 06/09/2020] [Indexed: 12/30/2022] Open
Abstract
The detection of copy-number variations (CNVs) from NGS data is underexploited as chip-based or targeted techniques are still commonly used. We assessed the performances of a workflow centered on CANOES, a bioinformatics tool based on read depth information. We applied our workflow to gene panel (GP) and whole-exome sequencing (WES) data, and compared CNV calls to quantitative multiplex PCR of short fluorescent fragments (QMSPF) or array comparative genomic hybridization (aCGH) results. From GP data of 3776 samples, we reached an overall positive predictive value (PPV) of 87.8%. This dataset included a complete comprehensive QMPSF comparison of four genes (60 exons) on which we obtained 100% sensitivity and specificity. From WES data, we first compared 137 samples with aCGH and filtered comparable events (exonic CNVs encompassing enough aCGH probes) and obtained an 87.25% sensitivity. The overall PPV was 86.4% following the targeted confirmation of candidate CNVs from 1056 additional WES. In addition, our CANOES-centered workflow on WES data allowed the detection of CNVs with a resolution of single exons, allowing the detection of CNVs that were missed by aCGH. Overall, switching to an NGS-only approach should be cost-effective as it allows a reduction in overall costs together with likely stable diagnostic yields. Our bioinformatics pipeline is available at: https://gitlab.bioinfo-diag.fr/nc4gpm/canoes-centered-workflow .
Collapse
|
9
|
Fontanilles M, Marguet F, Ruminy P, Basset C, Noel A, Beaussire L, Viennot M, Viailly PJ, Cassinari K, Chambon P, Richard D, Alexandru C, Tennevet I, Langlois O, Di Fiore F, Laquerrière A, Clatot F, Sarafan-Vasseur N. Simultaneous detection of EGFR amplification and EGFRvIII variant using digital PCR-based method in glioblastoma. Acta Neuropathol Commun 2020; 8:52. [PMID: 32303258 PMCID: PMC7165387 DOI: 10.1186/s40478-020-00917-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 03/13/2020] [Indexed: 12/20/2022] Open
Abstract
Epidermal growth factor receptor (EGFR) amplification and EGFR variant III (EGFRvIII, deletion of exons 2-7) are of clinical interest for glioblastoma. The aim was to develop a digital PCR (dPCR)-based method using locked nucleic acid (LNA)-based hydrolysis probes, allowing the simultaneous detection of the EGFR amplification and EGFRvIII variant. Sixty-two patients were included. An exploratory cohort (n = 19) was used to develop the dPCR assay using three selected amplicons within the EGFR gene, targeting intron 1 (EGFR1), junction of exon 3 and intron 3 (EGFR2) and intron 22 (EGFR3). The copy number of EGFR was estimated by the relative quantification of EGFR1, EGFR2 and EGFR3 amplicon droplets compared to the droplets of a reference gene. EGFRvIII was identified by comparing the copy number of the EGFR2 amplicon to either the EGFR1 or EGFR3 amplicon. dPCR results were compared to fluorescence in situ hybridization (FISH) and next-generation sequencing for amplification; and to RT-PCR-based method for EGFRvIII. The dPCR assay was then tested in a validation cohort (n = 43). A total of 8/19 EGFR-amplified and 5/19 EGFRvIII-positive tumors were identified in the exploratory cohort. Compared to FISH, the EGFR3 dPCR assay detected all EGFR-amplified tumors (8/8, 100%) and had the highest concordance with the copy number estimation by NGS. The concordance between RT-PCR and dPCR was also 100% for detecting EGFRvIII using an absolute difference of 10.8 for the copy number between EGFR2 and EGFR3 probes. In the validation cohort, the sensitivity and specificity of dPCR using EGFR3 probes were 100% for the EGFR amplification detection compared to FISH (19/19). EGFRvIII was detected by dPCR in 8 EGFR-amplified patients and confirmed by RT-PCR. Compared to FISH, the EGFR2/EGFR3 dPCR assay was estimated with a one-half cost value. These results highlight that dPCR allowed the simultaneous detection of EGFR amplification and EGFRvIII for glioblastoma.
Collapse
|