1
|
Ahmad RM, Ali BR, Al-Jasmi F, Sinnott RO, Al Dhaheri N, Mohamad MS. A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer. Brief Bioinform 2023; 25:bbad479. [PMID: 38149678 PMCID: PMC10782903 DOI: 10.1093/bib/bbad479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 09/22/2023] [Accepted: 12/04/2023] [Indexed: 12/28/2023] Open
Abstract
Studies continue to uncover contributing risk factors for breast cancer (BC) development including genetic variants. Advances in machine learning and big data generated from genetic sequencing can now be used for predicting BC pathogenicity. However, it is unclear which tool developed for pathogenicity prediction is most suited for predicting the impact and pathogenicity of variant effects. A significant challenge is to determine the most suitable data source for each tool since different tools can yield different prediction results with different data inputs. To this end, this work reviews genetic variant databases and tools used specifically for the prediction of BC pathogenicity. We provide a description of existing genetic variants databases and, where appropriate, the diseases for which they have been established. Through example, we illustrate how they can be used for prediction of BC pathogenicity and discuss their associated advantages and disadvantages. We conclude that the tools that are specialized by training on multiple diverse datasets from different databases for the same disease have enhanced accuracy and specificity and are thereby more helpful to the clinicians in predicting and diagnosing BC as early as possible.
Collapse
Affiliation(s)
- Rahaf M Ahmad
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Tawam road, Al Maqam district, Al Ain, Abu Dhabi, United Arab Emirates
| | - Bassam R Ali
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Tawam road, Al Maqam district, Al Ain, Abu Dhabi, United Arab Emirates
| | - Fatma Al-Jasmi
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Tawam road, Al Maqam district, Al Ain, Abu Dhabi, United Arab Emirates
- Division of Metabolic Genetics, Department of Pediatrics, Tawam Hospital, Al Ain, United Arab Emirates
| | - Richard O Sinnott
- School of Computing and Information System, Faculty of Engineering and Information Technology, The University of Melbourne, Melbourne, Victoria, Australia
| | - Noura Al Dhaheri
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Tawam road, Al Maqam district, Al Ain, Abu Dhabi, United Arab Emirates
- Division of Metabolic Genetics, Department of Pediatrics, Tawam Hospital, Al Ain, United Arab Emirates
| | - Mohd Saberi Mohamad
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Tawam road, Al Maqam district, Al Ain, Abu Dhabi, United Arab Emirates
| |
Collapse
|
2
|
Akula S, Mullaguri SC, Melton NM, Katta A, Naga VSGR, Kandula S, Pedada RK, Subramanian J, Kancha RK. Large-scale pathogenicity prediction analysis of cancer-associated kinase mutations reveals variability in sensitivity and specificity of computational methods. Cancer Med 2023; 12:17468-17474. [PMID: 37409618 PMCID: PMC10501281 DOI: 10.1002/cam4.6324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 05/26/2023] [Accepted: 06/27/2023] [Indexed: 07/07/2023] Open
Abstract
BACKGROUND Mutations in kinases are the most frequent genetic alterations in cancer; however, experimental evidence establishing their cancerous nature is available only for a small fraction of these mutants. AIMS Predicition analysis of kinome mutations is the primary aim of this study. Further objective is to compare the performance of various softwares in pathogenicity prediction of kinase mutations. MATERIALS AND METHODS We employed a set of computational tools to predict the pathogenicity of over forty-two thousand mutations and deposited the kinase-wise data in Mendeley database (Estimated Pathogenicity of Kinase Mutants [EPKiMu]). RESULTS Mutations are more likely to be drivers when being present in the kinase domain (vs. non-kinase domain) and belonging to hotspot residues (vs. non-hotspot residues). We identified that, while predictive tools have low specificity in general, PolyPhen-2 had the best accuracy. Further efforts to combine all four tools by consensus, voting, or other simple methods did not significantly improve accuracy. DISCUSSION The study provides a large dataset of kinase mutations along with their predicted pathogenicity that can be used as a training set for future studies. Furthermore, a comparative sensitivity and selectivity of commonly used computational tools is presented. CONCLUSION Primary-structure-based in silico tools identified more cancerous/deleterious mutations in the kinase domains and at the hot spot residues while having higher sensitivity than specificity in detecting deleterious mutations.
Collapse
Affiliation(s)
- Sravani Akula
- Molecular Medicine and Therapeutics Laboratory, CPMBOsmania UniversityHyderabadIndia
| | | | - Niklas Max Melton
- Thoracic Oncology, Inova Schar Cancer InstituteFairfaxVirginiaUSA
- Applied Computational Intelligence LabMissouri University of Science and TechnologyRollaMissouriUSA
| | - Archana Katta
- Molecular Medicine and Therapeutics Laboratory, CPMBOsmania UniversityHyderabadIndia
| | | | - Shyamson Kandula
- Molecular Medicine and Therapeutics Laboratory, CPMBOsmania UniversityHyderabadIndia
| | - Raj Kumar Pedada
- Molecular Medicine and Therapeutics Laboratory, CPMBOsmania UniversityHyderabadIndia
| | | | - Rama Krishna Kancha
- Molecular Medicine and Therapeutics Laboratory, CPMBOsmania UniversityHyderabadIndia
| |
Collapse
|
3
|
Aguirre J, Padilla N, Özkan S, Riera C, Feliubadaló L, de la Cruz X. Choosing Variant Interpretation Tools for Clinical Applications: Context Matters. Int J Mol Sci 2023; 24:11872. [PMID: 37511631 PMCID: PMC10380979 DOI: 10.3390/ijms241411872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 07/10/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open
Abstract
Pathogenicity predictors are computational tools that classify genetic variants as benign or pathogenic; this is currently a major challenge in genomic medicine. With more than fifty such predictors available, selecting the most suitable tool for clinical applications like genetic screening, molecular diagnostics, and companion diagnostics has become increasingly challenging. To address this issue, we have developed a cost-based framework that naturally considers the various components of the problem. This framework encodes clinical scenarios using a minimal set of parameters and treats pathogenicity predictors as rejection classifiers, a common practice in clinical applications where low-confidence predictions are routinely rejected. We illustrate our approach in four examples where we compare different numbers of pathogenicity predictors for missense variants. Our results show that no single predictor is optimal for all clinical scenarios and that considering rejection yields a different perspective on classifiers.
Collapse
Affiliation(s)
- Josu Aguirre
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, P/Vall d'Hebron, 119-129, 08035 Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, P/Vall d'Hebron, 119-129, 08035 Barcelona, Spain
| | - Selen Özkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, P/Vall d'Hebron, 119-129, 08035 Barcelona, Spain
| | - Casandra Riera
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, P/Vall d'Hebron, 119-129, 08035 Barcelona, Spain
| | - Lídia Feliubadaló
- Hereditary Cancer Program, Program in Molecular Mechanisms and Experimental Therapy in Oncology (Oncobell), IDIBELL, Catalan Institute of Oncology, 08908 L'Hospitalet de Llobregat, Spain
- Centro de Investigación Biomédica en Red de Cáncer (CIBERONC), 28929 Madrid, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, P/Vall d'Hebron, 119-129, 08035 Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain
| |
Collapse
|
4
|
Cevik S, Biswas SB, Biswas-Fiss EE. Structural and Pathogenic Impacts of ABCA4 Variants in Retinal Degenerations-An In-Silico Study. Int J Mol Sci 2023; 24:ijms24087280. [PMID: 37108442 PMCID: PMC10138569 DOI: 10.3390/ijms24087280] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 04/11/2023] [Accepted: 04/12/2023] [Indexed: 04/29/2023] Open
Abstract
The retina-specific ATP-binding cassette transporter protein ABCA4 is responsible for properly continuing the visual cycle by removing toxic retinoid byproducts of phototransduction. Functional impairment caused by ABCA4 sequence variations is the leading cause of autosomal recessive inherited retinal disorders, including Stargardt disease, retinitis pigmentosa, and cone-rod dystrophy. To date, more than 3000 ABCA4 genetic variants have been identified, approximately 40 percent of which have not been able to be classified for pathogenicity assessments. This study examined 30 missense ABCA4 variants using AlphaFold2 protein modeling and computational structure analysis for pathogenicity prediction. All variants classified as pathogenic (n = 10) were found to have deleterious structural consequences. Eight of the ten benign variants were structurally neutral, while the remaining two resulted in mild structural changes. This study's results provided multiple lines of computational pathogenicity evidence for eight ABCA4 variants of uncertain clinical significance. Overall, in silico analyses of ABCA4 can provide a valuable tool for understanding the molecular mechanisms of retinal degeneration and their pathogenic impact.
Collapse
Affiliation(s)
- Senem Cevik
- Department of Medical and Molecular Sciences, College of Health Sciences, University of Delaware, 16 West Main Street, Suite 302 WHL, Newark, DE 19716, USA
- Ammon Pinizzotto Biopharmaceutical Innovation Center, 590 Avenue 1743, Newark, DE 19713, USA
| | - Subhasis B Biswas
- Department of Medical and Molecular Sciences, College of Health Sciences, University of Delaware, 16 West Main Street, Suite 302 WHL, Newark, DE 19716, USA
- Ammon Pinizzotto Biopharmaceutical Innovation Center, 590 Avenue 1743, Newark, DE 19713, USA
| | - Esther E Biswas-Fiss
- Department of Medical and Molecular Sciences, College of Health Sciences, University of Delaware, 16 West Main Street, Suite 302 WHL, Newark, DE 19716, USA
- Ammon Pinizzotto Biopharmaceutical Innovation Center, 590 Avenue 1743, Newark, DE 19713, USA
| |
Collapse
|
5
|
Garcia FADO, de Andrade ES, Palmero EI. Insights on variant analysis in silico tools for pathogenicity prediction. Front Genet 2022; 13:1010327. [PMID: 36568376 PMCID: PMC9774026 DOI: 10.3389/fgene.2022.1010327] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 11/14/2022] [Indexed: 12/03/2022] Open
Abstract
Molecular biology is currently a fast-advancing science. Sequencing techniques are getting cheaper, but the interpretation of genetic variants requires expertise and computational power, therefore is still a challenge. Next-generation sequencing releases thousands of variants and to classify them, researchers propose protocols with several parameters. Here we present a review of several in silico pathogenicity prediction tools involved in the variant prioritization/classification process used by some international protocols for variant analysis and studies evaluating their efficiency.
Collapse
Affiliation(s)
| | | | - Edenir Inez Palmero
- Molecular Oncology Research Center—Barretos Cancer Hospital, Barretos, Brazil,National Institute of Cancer, Rio de Janeiro, Brazil,*Correspondence: Edenir Inez Palmero,
| |
Collapse
|
6
|
Quinodoz M, Peter VG, Cisarova K, Royer-Bertrand B, Stenson PD, Cooper DN, Unger S, Superti-Furga A, Rivolta C. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. Am J Hum Genet 2022; 109:457-70. [PMID: 35120630 DOI: 10.1016/j.ajhg.2022.01.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 01/11/2022] [Indexed: 12/11/2022] Open
Abstract
We used a machine learning approach to analyze the within-gene distribution of missense variants observed in hereditary conditions and cancer. When applied to 840 genes from the ClinVar database, this approach detected a significant non-random distribution of pathogenic and benign variants in 387 (46%) and 172 (20%) genes, respectively, revealing that variant clustering is widespread across the human exome. This clustering likely occurs as a consequence of mechanisms shaping pathogenicity at the protein level, as illustrated by the overlap of some clusters with known functional domains. We then took advantage of these findings to develop a pathogenicity predictor, MutScore, that integrates qualitative features of DNA substitutions with the new additional information derived from this positional clustering. Using a random forest approach, MutScore was able to identify pathogenic missense mutations with very high accuracy, outperforming existing predictive tools, especially for variants associated with autosomal-dominant disease and cancer. Thus, the within-gene clustering of pathogenic and benign DNA changes is an important and previously underappreciated feature of the human exome, which can be harnessed to improve the prediction of pathogenicity and disambiguation of DNA variants of uncertain significance.
Collapse
|
7
|
Mahecha D, Nuñez H, Lattig MC, Duitama J. Machine Learning Models for Accurate Prioritization of Variants of Uncertain Significance. Hum Mutat 2022; 43:449-460. [PMID: 35143088 DOI: 10.1002/humu.24339] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 01/04/2022] [Accepted: 01/23/2022] [Indexed: 11/08/2022]
Abstract
The growing use of next generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of Variants of Uncertain Significance (VUS). In this manuscript we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron (MLP). To train the models, we extracted high quality variants from ClinVar that were previously classified as VUS. For each variant, we retrieved 9 conservation scores, the loss of function tool and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross validation with a grid search. The three models were tested on a non-overlapping set of variants that had been classified as VUS any time along the last three years but had been reclassified in august 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF based model yielded the best performance across different variant types and was used to create VusPrize, an open source software tool for prioritization of variants of uncertain significance. We believe that our model can improve the process of genetic diagnosis in research and clinical settings. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Daniel Mahecha
- SIGEN, Alianza Universidad de los Andes - Fundación Santa Fe de Bogota, Colombia.,Systems and Computing Engineering Department, Universidad de los Andes, Colombia
| | - Haydemar Nuñez
- Systems and Computing Engineering Department, Universidad de los Andes, Colombia
| | - Maria C Lattig
- SIGEN, Alianza Universidad de los Andes - Fundación Santa Fe de Bogota, Colombia.,Facultad de Ciencias, Universidad de los Andes
| | - Jorge Duitama
- Systems and Computing Engineering Department, Universidad de los Andes, Colombia
| |
Collapse
|
8
|
Chen HC, Wang J, Liu Q, Shyr Y. A domain damage index to prioritizing the pathogenicity of missense variants. Hum Mutat 2021; 42:1503-1517. [PMID: 34350656 PMCID: PMC8511099 DOI: 10.1002/humu.24269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Revised: 07/08/2021] [Accepted: 07/30/2021] [Indexed: 11/09/2022]
Abstract
Prioritizing causal variants is one major challenge for the clinical application of sequencing data. Prompted by the observation that 74.3% of missense pathogenic variants locate in protein domains, we developed an approach named domain damage index (DDI). DDI identifies protein domains depleted of rare missense variations in the general population, which can be further used as a metric to prioritize variants. DDI is significantly correlated with phylogenetic conservation, variant-level metrics, and reported pathogenicity. DDI achieved great performance for distinguishing pathogenic variants from benign ones in three benchmark datasets. The combination of DDI with the other two best approaches improved the performance of each individual method considerably, suggesting DDI provides a powerful and complementary way of variant prioritization.
Collapse
Affiliation(s)
- Hua-Chang Chen
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Jing Wang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| |
Collapse
|
9
|
Blake S, Hemming I, Heng JIT, Agostino M. Structure-Based Approaches to Classify the Functional Impact of ZBTB18 Missense Variants in Health and Disease. ACS Chem Neurosci 2021; 12:979-989. [PMID: 33621064 DOI: 10.1021/acschemneuro.0c00758] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
The Cys2His2 type zinc finger is a motif found in many eukaryotic transcription factor proteins that facilitates binding to genomic DNA so as to influence cellular gene expression. One such transcription factor is ZBTB18, characterized as a repressor that orchestrates the development of mammalian tissues including skeletal muscle and brain during embryogenesis. In humans, it has been recognized that disease-associated ZBTB18 missense variants mapping to the coding sequence of the zinc finger domain influence sequence-specific DNA binding, disrupt transcriptional regulation, and impair neural circuit formation in the brain. Furthermore, general population ZBTB18 missense variants that influence DNA binding and transcriptional regulation have also been documented within this domain; however, the molecular traits that explain why some variants cause disease while others do not are poorly understood. Here, we have applied five structure-based approaches to evaluate their ability to discriminate between disease-associated and general population ZBTB18 missense variants. We found that thermodynamic integration and Residue Scanning in the Schrodinger Biologics Suite were the best approaches for distinguishing disease-associated variants from general population variants. Our results demonstrate the effectiveness of structure-based approaches for the functional characterization of missense alleles to DNA binding, zinc finger transcription factor protein-coding genes that underlie human health and disease.
Collapse
Affiliation(s)
- Steven Blake
- Curtin Health Innovation Research Institute, Curtin University, Bentley, Western Australia 6102, Australia
- Ralph and Patricia Sarich Neuroscience Research Institute, Nedlands, Western Australia 6009, Australia
- School of Pharmacy and Biomedical Sciences, Curtin University, Bentley, Western Australia 6845, Australia
| | - Isabel Hemming
- Curtin Health Innovation Research Institute, Curtin University, Bentley, Western Australia 6102, Australia
- Ralph and Patricia Sarich Neuroscience Research Institute, Nedlands, Western Australia 6009, Australia
- The Faculty of Health and Medical Sciences, Medical School, The University of Western Australia, Crawley, Western Australia 6009, Australia
| | - Julian Ik-Tsen Heng
- Curtin Health Innovation Research Institute, Curtin University, Bentley, Western Australia 6102, Australia
- Ralph and Patricia Sarich Neuroscience Research Institute, Nedlands, Western Australia 6009, Australia
| | - Mark Agostino
- Curtin Health Innovation Research Institute, Curtin University, Bentley, Western Australia 6102, Australia
- School of Pharmacy and Biomedical Sciences, Curtin University, Bentley, Western Australia 6845, Australia
- Curtin Institute for Computation, Curtin University, Bentley, Western Australia, Australia
| |
Collapse
|
10
|
Woodard J, Zhang C, Zhang Y. ADDRESS: A Database of Disease-associated Human Variants Incorporating Protein Structure and Folding Stabilities. J Mol Biol 2021; 433:166840. [PMID: 33539887 DOI: 10.1016/j.jmb.2021.166840] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 01/17/2021] [Accepted: 01/20/2021] [Indexed: 11/22/2022]
Abstract
Numerous human diseases are caused by mutations in genomic sequences. Since amino acid changes affect protein function through mechanisms often predictable from protein structure, the integration of structural and sequence data enables us to estimate with greater accuracy whether and how a given mutation will lead to disease. Publicly available annotated databases enable hypothesis assessment and benchmarking of prediction tools. However, the results are often presented as summary statistics or black box predictors, without providing full descriptive information. We developed a new semi-manually curated human variant database presenting information on the protein contact-map, sequence-to-structure mapping, amino acid identity change, and stability prediction for the popular UniProt database. We found that the profiles of pathogenic and benign missense polymorphisms can be effectively deduced using decision trees and comparative analyses based on the presented dataset. The database is made publicly available through https://zhanglab.ccmb.med.umich.edu/ADDRESS.
Collapse
|
11
|
Zhang X, Walsh R, Whiffin N, Buchan R, Midwinter W, Wilk A, Govind R, Li N, Ahmad M, Mazzarotto F, Roberts A, Theotokis PI, Mazaika E, Allouba M, de Marvao A, Pua CJ, Day SM, Ashley E, Colan SD, Michels M, Pereira AC, Jacoby D, Ho CY, Olivotto I, Gunnarsson GT, Jefferies JL, Semsarian C, Ingles J, O'Regan DP, Aguib Y, Yacoub MH, Cook SA, Barton PJR, Bottolo L, Ware JS. Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet Med 2021; 23:69-79. [PMID: 33046849 PMCID: PMC7790749 DOI: 10.1038/s41436-020-00972-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2020] [Revised: 09/08/2020] [Accepted: 09/09/2020] [Indexed: 11/18/2022] Open
Abstract
PURPOSE Accurate discrimination of benign and pathogenic rare variation remains a priority for clinical genome interpretation. State-of-the-art machine learning variant prioritization tools are imprecise and ignore important parameters defining gene-disease relationships, e.g., distinct consequences of gain-of-function versus loss-of-function variants. We hypothesized that incorporating disease-specific information would improve tool performance. METHODS We developed a disease-specific variant classifier, CardioBoost, that estimates the probability of pathogenicity for rare missense variants in inherited cardiomyopathies and arrhythmias. We assessed CardioBoost's ability to discriminate known pathogenic from benign variants, prioritize disease-associated variants, and stratify patient outcomes. RESULTS CardioBoost has high global discrimination accuracy (precision recall area under the curve [AUC] 0.91 for cardiomyopathies; 0.96 for arrhythmias), outperforming existing tools (4-24% improvement). CardioBoost obtains excellent accuracy (cardiomyopathies 90.2%; arrhythmias 91.9%) for variants classified with >90% confidence, and increases the proportion of variants classified with high confidence more than twofold compared with existing tools. Variants classified as disease-causing are associated with both disease status and clinical severity, including a 21% increased risk (95% confidence interval [CI] 11-29%) of severe adverse outcomes by age 60 in patients with hypertrophic cardiomyopathy. CONCLUSIONS A disease-specific variant classifier outperforms state-of-the-art genome-wide tools for rare missense variants in inherited cardiac conditions ( https://www.cardiodb.org/cardioboost/ ), highlighting broad opportunities for improved pathogenicity prediction through disease specificity.
Collapse
Affiliation(s)
- Xiaolei Zhang
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Roddy Walsh
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Nicola Whiffin
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Rachel Buchan
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - William Midwinter
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Alicja Wilk
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Risha Govind
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Nicholas Li
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
- MRC London Institute of Medical Sciences, Imperial College London, London, United Kingdom
| | - Mian Ahmad
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Francesco Mazzarotto
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiomyopathy Unit, Careggi University Hospital, Florence, Italy
- Department of Clinical and Experimental Medicine, University of Florence, Florence, Italy
| | - Angharad Roberts
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Pantazis I Theotokis
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Erica Mazaika
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Mona Allouba
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Aswan Heart Centre, Magdi Yacoub Heart Foundation, Aswan, Egypt
| | - Antonio de Marvao
- MRC London Institute of Medical Sciences, Imperial College London, London, United Kingdom
| | | | - Sharlene M Day
- Division of Cardiovascular Medicine and Penn Cardiovascular Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Euan Ashley
- Division of Cardiovascular Medicine, Stanford University Medical Center, Stanford, CA, USA
| | - Steven D Colan
- Department of Cardiology, Boston Children's Hospital, Boston, MA, USA
| | - Michelle Michels
- Department of Cardiology, Thoraxcenter, Erasmus MC Rotterdam, Rotterdam, Netherlands
| | - Alexandre C Pereira
- Heart Institute (InCor), University of Sao Paulo Medical School, Sao Paulo, Brazil
| | - Daniel Jacoby
- Department of Internal Medicine, Yale University, New Haven, CT, USA
| | - Carolyn Y Ho
- Cardiovascular Division, Brigham and Women's Hospital, Boston, MA, USA
| | - Iacopo Olivotto
- Cardiomyopathy Unit, Careggi University Hospital, Florence, Italy
| | | | - John L Jefferies
- The Cardiovascular Institute, University of Tennessee, Memphis, TN, USA
| | - Chris Semsarian
- Centenary Institute, The University of Sydney, Sydney, Australia
- Department of Cardiology, Royal Prince Alfred Hospital, Sydney, Australia
| | - Jodie Ingles
- Centenary Institute, The University of Sydney, Sydney, Australia
| | - Declan P O'Regan
- MRC London Institute of Medical Sciences, Imperial College London, London, United Kingdom
| | - Yasmine Aguib
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Aswan Heart Centre, Magdi Yacoub Heart Foundation, Aswan, Egypt
| | - Magdi H Yacoub
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Aswan Heart Centre, Magdi Yacoub Heart Foundation, Aswan, Egypt
| | - Stuart A Cook
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
- National Heart Centre, Singapore, Singapore
- Duke-National University of Singapore, Singapore, Singapore
| | - Paul J R Barton
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom
| | - Leonardo Bottolo
- Department of Medical Genetics, University of Cambridge, Cambridge, United Kingdom.
- Alan Turing Institute, London, United Kingdom.
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom.
| | - James S Ware
- National Heart and Lung Institute, Imperial College London, London, United Kingdom.
- Cardiovascular Research Centre, Royal Brompton and Harefield NHS, Foundation Trust London, London, United Kingdom.
- MRC London Institute of Medical Sciences, Imperial College London, London, United Kingdom.
| |
Collapse
|
12
|
Ranganathan Ganakammal S, Alexov E. An Ensemble Approach to Predict the Pathogenicity of Synonymous Variants. Genes (Basel) 2020; 11:E1102. [PMID: 32967157 DOI: 10.3390/genes11091102] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/08/2020] [Accepted: 09/17/2020] [Indexed: 12/18/2022] Open
Abstract
Single-nucleotide variants (SNVs) are a major form of genetic variation in the human genome that contribute to various disorders. There are two types of SNVs, namely non-synonymous (missense) variants (nsSNVs) and synonymous variants (sSNVs), predominantly involved in RNA processing or gene regulation. sSNVs, unlike missense or nsSNVs, do not alter the amino acid sequences, thereby making challenging candidates for downstream functional studies. Numerous computational methods have been developed to evaluate the clinical impact of nsSNVs, but very few methods are available for understanding the effects of sSNVs. For this analysis, we have downloaded sSNVs from the ClinVar database with various features such as conservation, DNA-RNA, and splicing properties. We performed feature selection and implemented an ensemble random forest (RF) classification algorithm to build a classifier to predict the pathogenicity of the sSNVs. We demonstrate that the ensemble predictor with selected features (20 features) enhances the classification of sSNVs into two categories, pathogenic and benign, with high accuracy (87%), precision (79%), and recall (91%). Furthermore, we used this prediction model to reclassify sSNVs with unknown clinical significance. Finally, the method is very robust and can be used to predict the effect of other unknown sSNVs.
Collapse
|
13
|
Guéguen P, Dupuis A, Py JY, Desprès A, Masson E, Le Marechal C, Cooper DN, Gachet C, Chen JM, Férec C. Pathogenic and likely pathogenic variants in at least five genes account for approximately 3% of mild isolated nonsyndromic thrombocytopenia. Transfusion 2020; 60:2419-2431. [PMID: 32757236 DOI: 10.1111/trf.15992] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 06/12/2020] [Accepted: 06/15/2020] [Indexed: 12/12/2022]
Abstract
BACKGROUND Thrombocytopenia has a variety of different etiologies, both acquired and hereditary. Inherited thrombocytopenia may be associated with other symptoms (syndromic forms) or may be strictly isolated. To date, only about half of all the familial forms of thrombocytopenia have been accounted for in terms of well-defined genetic abnormalities. However, data are limited on the nature and frequency of the underlying causative genetic variants in individuals with mild isolated nonsyndromic thrombocytopenia. STUDY DESIGN AND METHODS Thirteen known or candidate genes for isolated thrombocytopenia were included in a gene panel analysis in which targeted next-generation sequencing was performed on 448 French blood donors with mild isolated nonsyndromic thrombocytopenia. RESULTS A total of 68 rare variants, including missense, splice site, frameshift, nonsense, and in-frame variants (all heterozygous) were identified in 11 of the 13 genes screened. Twenty-nine percent (N = 20) of the variants detected were absent from both the French Exome Project and gnomAD exome databases. Using stringent criteria and an unbiased approach, we classified seven predicted loss-of-function variants (three in ITGA2B and four in TUBB1) and four missense variants (one in GP1BA, two in ITGB3 and one in ACTN1) as being pathogenic or likely pathogenic. Altogether, they were found in 13 members (approx. 3%) of our studied cohort. CONCLUSION We present the results of gene panel sequencing of known and candidate thrombocytopenia genes in mild isolated nonsyndromic thrombocytopenia. Pathogenic and likely pathogenic variants in five known thrombocytopenia genes were identified, accounting for approximately 3% of individuals with the condition.
Collapse
Affiliation(s)
- Paul Guéguen
- CHRU Brest, Brest, France.,EFS, Univ Brest, Inserm, UMR 1078, GGB, Brest, France
| | - Arnaud Dupuis
- Université de Strasbourg, Institut National de la Santé et de la Recherche Médicale, Etablissement Français du Sang Grand Est, Unité Mixte de Recherche-S 1255, Fédération de Médecine Translationnelle de Strasbourg, Strasbourg, France
| | - Jean-Yves Py
- EFS Centre-Pays de la Loire, Site d'Orléans, Orléans, France
| | | | - Emmanuelle Masson
- CHRU Brest, Brest, France.,EFS, Univ Brest, Inserm, UMR 1078, GGB, Brest, France
| | - Cédric Le Marechal
- CHRU Brest, Brest, France.,EFS, Univ Brest, Inserm, UMR 1078, GGB, Brest, France
| | - David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, UK
| | - Christian Gachet
- Université de Strasbourg, Institut National de la Santé et de la Recherche Médicale, Etablissement Français du Sang Grand Est, Unité Mixte de Recherche-S 1255, Fédération de Médecine Translationnelle de Strasbourg, Strasbourg, France
| | | | - Claude Férec
- CHRU Brest, Brest, France.,EFS, Univ Brest, Inserm, UMR 1078, GGB, Brest, France
| |
Collapse
|
14
|
Turinsky AL, Choufani S, Lu K, Liu D, Mashouri P, Min D, Weksberg R, Brudno M. EpigenCentral: Portal for DNA methylation data analysis and classification in rare diseases. Hum Mutat 2020; 41:1722-1733. [PMID: 32623772 DOI: 10.1002/humu.24076] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 06/12/2020] [Accepted: 07/02/2020] [Indexed: 01/09/2023]
Abstract
Epigenetic processes play a key role in regulating gene expression. Genetic variants that disrupt chromatin-modifying proteins are associated with a broad range of diseases, some of which have specific epigenetic patterns, such as aberrant DNA methylation (DNAm), which may be used as disease biomarkers. While much of the epigenetic research has focused on cancer, there is a paucity of resources devoted to neurodevelopmental disorders (NDDs), which include autism spectrum disorder and many rare, clinically overlapping syndromes. To address this challenge, we created EpigenCentral, a free web resource for biomedical researchers, molecular diagnostic laboratories, and clinical practitioners to perform the interactive classification and analysis of DNAm data related to NDDs. It allows users to search for known disease-associated patterns in their DNAm data, classify genetic variants as pathogenic or benign to assist in molecular diagnostics, or analyze patterns of differential methylation in their data through a simple web form. EpigenCentral is freely available at http://epigen.ccm.sickkids.ca/.
Collapse
Affiliation(s)
- Andrei L Turinsky
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Sanaa Choufani
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Kevin Lu
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Da Liu
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Pouria Mashouri
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Daniel Min
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Rosanna Weksberg
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Division of Clinical and Metabolic Genetics, The Hospital for Sick Children, Toronto, Ontario, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.,Department of Pediatrics, University of Toronto, Toronto, Ontario, Canada.,Institute of Medical Science, School of Graduate Studies, University of Toronto, Toronto, Ontario, Canada
| | - Michael Brudno
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,Centre for Computational Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.,Techna Institute, University Health Network, Toronto, Ontario, Canada
| |
Collapse
|
15
|
Vihinen M. Problems in variation interpretation guidelines and in their implementation in computational tools. Mol Genet Genomic Med 2020; 8:e1206. [PMID: 32160417 PMCID: PMC7507483 DOI: 10.1002/mgg3.1206] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Accepted: 02/23/2020] [Indexed: 12/12/2022] Open
Abstract
Background ACMG/AMP and AMP/ASCO/CAP have released guidelines for variation interpretation, and ESHG for diagnostic sequencing. These guidelines contain recommendations including the use of computational prediction methods. The guidelines per se and the way they are implemented cause some problems. Methods Logical reasoning based on domain knowledge. Results According to the guidelines, several methods have to be used and they have to agree. This means that the methods with the poorest performance overrule the better ones. The choice of the prediction method(s) should be made by experts based on systematic benchmarking studies reporting all the relevant performance measures. Currently variation interpretation methods have been applied mainly to amino acid substitutions and splice site variants; however, predictors for some other types of variations are available and there will be tools for new application areas in the near future. Common problems in prediction method usage are discussed. The number of features used for method training or the number of variation types predicted by a tool are not indicators of method performance. Many published gene, protein or disease‐specific benchmark studies suffer from too small dataset rendering the results useless. In the case of binary predictors, equal number of positive and negative cases is beneficial for training, the imbalance has to be corrected for performance assessment. Predictors cannot be better than the data they are based on and used for training and testing. Minor allele frequency (MAF) can help to detect likely benign cases, but the recommended MAF threshold is apparently too high. The fact that many rare variants are disease‐causing or ‐related does not mean that rare variants in general would be harmful. How large a portion of the tested variants a tool can predict (coverage) is not a quality measure. Conclusion Methods used for variation interpretation have to be carefully selected. It should be possible to use only one predictor, with proven good performance or a limited number of complementary predictors with state‐of‐the‐art performance. Bear in mind that diseases and pathogenicity have a continuum and variants are not dichotomic i.e. either pathogenic or benign, either.
Collapse
Affiliation(s)
- Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, Sweden
| |
Collapse
|
16
|
Alirezaie N, Kernohan KD, Hartley T, Majewski J, Hocking TD. ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. Am J Hum Genet 2018; 103:474-83. [PMID: 30220433 DOI: 10.1016/j.ajhg.2018.08.005] [Citation(s) in RCA: 108] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 08/08/2018] [Indexed: 02/08/2023] Open
Abstract
Advances in high-throughput DNA sequencing have revolutionized the discovery of variants in the human genome; however, interpreting the phenotypic effects of those variants is still a challenge. While several computational approaches to predict variant impact are available, their accuracy is limited and further improvement is needed. Here, we introduce ClinPred, an efficient tool for identifying disease-relevant nonsynonymous variants. Our predictor incorporates two machine learning algorithms that use existing pathogenicity scores and, notably, benefits from inclusion of normal population allele frequency from the gnomAD database as an input feature. Another major strength of our approach is the use of ClinVar-a rapidly growing database that allows selection of confidently annotated disease-causing variants-as a training set. Compared to other methods, ClinPred showed superior accuracy for predicting pathogenicity, achieving the highest area under the curve (AUC) score and increasing both the specificity and sensitivity in different test datasets. It also obtained the best performance according to various other metrics. Moreover, ClinPred performance remained robust with respect to disease type (cancer or rare disease) and mechanism (gain or loss of function). Importantly, we observed that adding allele frequency as a predictive feature-as opposed to setting fixed allele frequency cutoffs-boosts the performance of prediction. We provide pre-computed ClinPred scores for all possible human missense variants in the exome to facilitate its use by the community.
Collapse
|
17
|
Abstract
Molecular genetic analysis of inherited bleeding disorders has been practised for over 30 years. Technological changes have enabled advances, from analyses using extragenic linked markers to next-generation DNA sequencing and microarray analysis. Two approaches for genetic analysis are described, each suiting their environment. The Christian Medical Centre in Vellore, India, uses conformation-sensitive gel electrophoresis mutation screening of multiplexed PCR products to identify candidate mutations, followed by Sanger sequencing confirmation of variants identified. Specific analyses for F8 intron 1 and 22 inversions are also undertaken. The MyLifeOurFuture US project between the American Thrombosis and Hemostasis Network, the National Hemophilia Foundation, Bloodworks Northwest and Biogen uses molecular inversion probes (MIP) to capture target exons, splice sites plus 5' and 3' sequences and to detect F8 intron 1 and 22 inversions. This allows screening for all F8 and F9 variants in one sequencing run of multiple samples (196 or 392). Sequence variants identified are subsequently confirmed by a diagnostic laboratory. After having identified variants in genes of interest through these processes, a systematic procedure determining their likely pathogenicity should be applied. Several scientific societies have prepared guidelines. Systematic analysis of the available evidence facilitates reproducible scoring of likely pathogenicity. Documentation of frequency in population databases of variant prevalence and in locus-specific mutation databases can provide initial information on likely pathogenicity. Whereas null mutations are often pathogenic, missense and splice site variants often require in silico analyses to predict likely pathogenicity and using an accepted suite of tools can help standardize their documentation.
Collapse
Affiliation(s)
- E Edison
- Department of Haematology, Christian Medical College, Vellore, India
| | - B A Konkle
- Bloodworks Northwest and University of Washington, Seattle, WA, USA
| | - A C Goodeve
- Sheffield Diagnostic Genetics Service, Sheffield Children's NHS Foundation Trust, Sheffield, UK.,Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK
| |
Collapse
|
18
|
Salgado D, Desvignes JP, Rai G, Blanchard A, Miltgen M, Pinard A, Lévy N, Collod-Béroud G, Béroud C. UMD-Predictor: A High-Throughput Sequencing Compliant System for Pathogenicity Prediction of any Human cDNA Substitution. Hum Mutat 2016; 37:439-46. [PMID: 26842889 PMCID: PMC5067603 DOI: 10.1002/humu.22965] [Citation(s) in RCA: 90] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 01/11/2016] [Indexed: 01/18/2023]
Abstract
Whole‐exome sequencing (WES) is increasingly applied to research and clinical diagnosis of human diseases. It typically results in large amounts of genetic variations. Depending on the mode of inheritance, only one or two correspond to pathogenic mutations responsible for the disease and present in affected individuals. Therefore, it is crucial to filter out nonpathogenic variants and limit downstream analysis to a handful of candidate mutations. We have developed a new computational combinatorial system UMD‐Predictor (http://umd‐predictor.eu) to efficiently annotate cDNA substitutions of all human transcripts for their potential pathogenicity. It combines biochemical properties, impact on splicing signals, localization in protein domains, variation frequency in the global population, and conservation through the BLOSUM62 global substitution matrix and a protein‐specific conservation among 100 species. We compared its accuracy with the seven most used and reliable prediction tools, using the largest reference variation datasets including more than 140,000 annotated variations. This system consistently demonstrated a better accuracy, specificity, Matthews correlation coefficient, diagnostic odds ratio, speed, and provided the shortest list of candidate mutations for WES. Webservices allow its implementation in any bioinformatics pipeline for next‐generation sequencing analysis. It could benefit to a wide range of users and applications varying from gene discovery to clinical diagnosis.
Collapse
Affiliation(s)
- David Salgado
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Jean-Pierre Desvignes
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Ghadi Rai
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Arnaud Blanchard
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Morgane Miltgen
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Amélie Pinard
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Nicolas Lévy
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France.,APHM, Hôpital TIMONE Enfants, Laboratoire de Génétique Moléculaire, Marseille 13385, France
| | - Gwenaëlle Collod-Béroud
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France
| | - Christophe Béroud
- Aix-Marseille Université, GMGF, Marseille 13385, France.,Inserm, UMR_S 910, Marseille 13385, France.,APHM, Hôpital TIMONE Enfants, Laboratoire de Génétique Moléculaire, Marseille 13385, France
| |
Collapse
|
19
|
Vazquez M, Pons T, Brunak S, Valencia A, Izarzugaza JMG. wKinMut-2: Identification and Interpretation of Pathogenic Variants in Human Protein Kinases. Hum Mutat 2015; 37:36-42. [PMID: 26443060 DOI: 10.1002/humu.22914] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 09/22/2015] [Indexed: 12/31/2022]
Abstract
Most genomic alterations are tolerated while only a minor fraction disrupts molecular function sufficiently to drive disease. Protein kinases play a central biological function and the functional consequences of their variants are abundantly characterized. However, this heterogeneous information is often scattered across different sources, which makes the integrative analysis complex and laborious. wKinMut-2 constitutes a solution to facilitate the interpretation of the consequences of human protein kinase variation. Nine methods predict their pathogenicity, including a kinase-specific random forest approach. To understand the biological mechanisms causative of human diseases and cancer, information from pertinent reference knowledge bases and the literature is automatically mined, digested, and homogenized. Variants are visualized in their structural contexts and residues affecting catalytic and drug binding are identified. Known protein-protein interactions are reported. Altogether, this information is intended to assist the generation of new working hypothesis to be corroborated with ulterior experimental work. The wKinMut-2 system, along with a user manual and examples, is freely accessible at http://kinmut2.bioinfo.cnio.es, the code for local installations can be downloaded from https://github.com/Rbbt-Workflows/KinMut2.
Collapse
Affiliation(s)
- Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Tirso Pons
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen 2200, Denmark.,Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Jose M G Izarzugaza
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| |
Collapse
|
20
|
van der Velde KJ, Kuiper J, Thompson BA, Plazzer J, van Valkenhoef G, de Haan M, Jongbloed JD, Wijmenga C, de Koning TJ, Abbott KM, Sinke R, Spurdle AB, Macrae F, Genuardi M, Sijmons RH, Swertz MA. Evaluation of CADD Scores in Curated Mismatch Repair Gene Variants Yields a Model for Clinical Validation and Prioritization. Hum Mutat 2015; 36:712-9. [PMID: 25871441 PMCID: PMC4973827 DOI: 10.1002/humu.22798] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 03/30/2015] [Indexed: 12/02/2022]
Abstract
Next-generation sequencing in clinical diagnostics is providing valuable genomic variant data, which can be used to support healthcare decisions. In silico tools to predict pathogenicity are crucial to assess such variants and we have evaluated a new tool, Combined Annotation Dependent Depletion (CADD), and its classification of gene variants in Lynch syndrome by using a set of 2,210 DNA mismatch repair gene variants. These had already been classified by experts from InSiGHT's Variant Interpretation Committee. Overall, we found CADD scores do predict pathogenicity (Spearman's ρ = 0.595, P < 0.001). However, we discovered 31 major discrepancies between the InSiGHT classification and the CADD scores; these were explained in favor of the expert classification using population allele frequencies, cosegregation analyses, disease association studies, or a second-tier test. Of 751 variants that could not be clinically classified by InSiGHT, CADD indicated that 47 variants were worth further study to confirm their putative pathogenicity. We demonstrate CADD is valuable in prioritizing variants in clinically relevant genes for further assessment by expert classification teams.
Collapse
Affiliation(s)
- K. Joeri van der Velde
- Genomics Coordination CenterUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Joël Kuiper
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
- Department of EpidemiologyUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Bryony A. Thompson
- Department of Genetics and Computational BiologyQIMR Berghofer Medical Research InstituteBrisbaneAustralia
| | - John‐Paul Plazzer
- Department of Colorectal Medicine and GeneticsRoyal Melbourne HospitalMelbourneAustralia
| | - Gert van Valkenhoef
- Department of EpidemiologyUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Mark de Haan
- Genomics Coordination CenterUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Jan D.H. Jongbloed
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Cisca Wijmenga
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Tom J. de Koning
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Kristin M. Abbott
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Richard Sinke
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Amanda B. Spurdle
- Department of Genetics and Computational BiologyQIMR Berghofer Medical Research InstituteBrisbaneAustralia
| | - Finlay Macrae
- Department of Colorectal Medicine and GeneticsRoyal Melbourne HospitalMelbourneAustralia
- Department of MedicineThe Royal Melbourne HospitalUniversity of MelbourneMelbourneAustralia
| | - Maurizio Genuardi
- Institute of Medical Genetics“A. Gemelli” School of MedicineCatholic University of the Sacred HeartRomeItaly
| | - Rolf H. Sijmons
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - Morris A. Swertz
- Genomics Coordination CenterUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
| | - InSiGHT Group
- Department of GeneticsUniversity Medical Center GroningenUniversity of GroningenGroningenThe Netherlands
- Department of Genetics and Computational BiologyQIMR Berghofer Medical Research InstituteBrisbaneAustralia
- Department of Colorectal Medicine and GeneticsRoyal Melbourne HospitalMelbourneAustralia
- Department of MedicineThe Royal Melbourne HospitalUniversity of MelbourneMelbourneAustralia
- Institute of Medical Genetics“A. Gemelli” School of MedicineCatholic University of the Sacred HeartRomeItaly
| |
Collapse
|
21
|
Mueller SC, Backes C, Haas J, Katus HA, Meder B, Meese E, Keller A. Pathogenicity prediction of non-synonymous single nucleotide variants in dilated cardiomyopathy. Brief Bioinform 2015; 16:769-79. [PMID: 25638801 DOI: 10.1093/bib/bbu054] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Indexed: 02/03/2023] Open
Abstract
Non-synonymous single nucleotide variants (nsSNVs) in coding DNA regions can result in phenotypic differences between individuals; however, only some nsSNVs are causative for a certain disease. As just a fraction of respective nsSNVs is annotated in databases, computational biology tools are applied to predict the pathogenicity in silico. In addition to applications in oncology, novel molecular diagnostic tests have been developed for cardiovascular disorders as a leading cause of morbidity and mortality in industrialized nations. We explored the concordance and performance of 13 nsSNV pathogenicity prediction tools on panel sequencing results of dilated cardiomyopathy. The analyzed data set from the INHERITANCE study contained 842 nsSNVs discovered in 639 patients, screened for the full sequence of 76 genes related to cardiomyopathies. The single tools prediction revealed a surprisingly high heterogeneity and discordance based on the implemented prediction method. Known disease associations were not reported by the tools, limiting usability in clinics. Because different tools have different advantages, we combined their results. By clustering of correlated methods using similar prediction strategies and calculating a majority vote-based consensus, we found that the prediction accuracy and sensitivity can be further improved. Although challenges remain, different in silico tools bear the potential to predict the malignancy of nsSNVs, especially if different algorithms are combined. Most tools rely mainly on sequence features; beyond these, structural information is important to analyze the relationship of nsSNVs with disease phenotypes. Likewise, current tools consider single nsSNVs, which may, however, show a cumulative effect and turn neutral mutations in an ensemble into pathogenic variants.
Collapse
|
22
|
Castellana S, Rónai J, Mazza T. MitImpact: an exhaustive collection of pre-computed pathogenicity predictions of human mitochondrial non-synonymous variants. Hum Mutat 2014; 36:E2413-22. [PMID: 25516408 DOI: 10.1002/humu.22720] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Mitochondrial DNA carries a tiny, but fundamental portion of the eukaryotic genetic code. As its nuclear counterpart, it is susceptible to point mutations. Their level of pathogenicity has been assessed for the newly discovered mutations only, leaving some degree of uncertainty on the potential impact of the unknown mutations. Here we present Mitochondrial mutation Impact (MitImpact), a queryable lightweight web interface to a reasoned collection of structurally and evolutionary annotated pathogenicity predictions, obtained by assembling pre-computed with on-the-fly-computed sets of pathogenicity estimations, for all the possible mitochondrial missense variants. It presents itself as a resource for fast and reliable evaluation of gene-specific susceptibility of unknown and verified amino acid changes. MitImpact is freely available at http://bioinformatics.css-mendel.it/ (tools section). ©2014 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Stefano Castellana
- IRCCS Casa Sollievo della Sofferenza, Istituto Mendel, Bioinformatics Unit. Viale Regina Margherita, 261. 00198, Roma, Italy
| | | | | |
Collapse
|
23
|
Kassahn KS, Scott HS, Caramins MC. Integrating massively parallel sequencing into diagnostic workflows and managing the annotation and clinical interpretation challenge. Hum Mutat 2014; 35:413-23. [PMID: 24510514 DOI: 10.1002/humu.22525] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Accepted: 01/30/2014] [Indexed: 11/07/2022]
Abstract
Massively parallel sequencing has become a powerful tool for the clinical management of patients with applications in diagnosis, guidance of treatment, prediction of drug response, and carrier screening. A considerable challenge for the clinical implementation of these technologies is the management of the vast amount of sequence data generated, in particular the annotation and clinical interpretation of genomic variants. Here, we describe annotation steps that can be automated and common strategies employed for variant prioritization. The definition of best practice standards for variant annotation and prioritization is still ongoing; at present, there is limited consensus regarding an optimal clinical sequencing pipeline. We provide considerations to help define these. For the first time, clinical genetics and genomics is not limited by our ability to sequence, but our ability to clinically interpret and use genomic information in health management. We argue that the development of standardized variant annotation and interpretation approaches and software tools implementing these warrants further support. As we gain a better understanding of the significance of genomic variation through research, patients will be able to benefit from the full scope that these technologies offer.
Collapse
Affiliation(s)
- Karin S Kassahn
- Genetic and Molecular Pathology, SA Pathology, Women's and Children's Hospital, North Adelaide, South Australia, 5006, Australia; School of Molecular and Biomedical Science, University of Adelaide, Adelaide, South Australia, 5000, Australia
| | | | | |
Collapse
|
24
|
Izarzugaza JMG, Krallinger M, Valencia A. Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining. Front Physiol 2012; 3:323. [PMID: 23055974 PMCID: PMC3449330 DOI: 10.3389/fphys.2012.00323] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2012] [Accepted: 07/23/2012] [Indexed: 11/30/2022] Open
Abstract
Protein kinases play a crucial role in a plethora of significant physiological functions and a number of mutations in this superfamily have been reported in the literature to disrupt protein structure and/or function. Computational and experimental research aims to discover the mechanistic connection between mutations in protein kinases and disease with the final aim of predicting the consequences of mutations on protein function and the subsequent phenotypic alterations. In this article, we will review the possibilities and limitations of current computational methods for the prediction of the pathogenicity of mutations in the protein kinase superfamily. In particular we will focus on the problem of benchmarking the predictions with independent gold standard datasets. We will propose a pipeline for the curation of mutations automatically extracted from the literature. Since many of these mutations are not included in the databases that are commonly used to train the computational methods to predict the pathogenicity of protein kinase mutations we propose them to build a valuable gold standard dataset in the benchmarking of a number of these predictors. Finally, we will discuss how text mining approaches constitute a powerful tool for the interpretation of the consequences of mutations in the context of disease genome analysis with particular focus on cancer.
Collapse
Affiliation(s)
- Jose M G Izarzugaza
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre Madrid, Spain
| | | | | |
Collapse
|
25
|
Gunay-Aygun M, Tuchman M, Font-Montgomery E, Lukose L, Edwards H, Garcia A, Ausavarat S, Ziegler SG, Piwnica-Worms K, Bryant J, Bernardini I, Fischer R, Huizing M, Guay-Woodford L, Gahl WA. PKHD1 sequence variations in 78 children and adults with autosomal recessive polycystic kidney disease and congenital hepatic fibrosis. Mol Genet Metab 2010; 99:160-73. [PMID: 19914852 PMCID: PMC2818513 DOI: 10.1016/j.ymgme.2009.10.010] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/07/2009] [Revised: 10/14/2009] [Accepted: 10/14/2009] [Indexed: 01/15/2023]
Abstract
PKHD1, the gene mutated in autosomal recessive polycystic kidney disease (ARPKD)/congenital hepatic fibrosis (CHF), is an exceptionally large and complicated gene that consists of 86 exons and has a number of alternatively spliced transcripts. Its longest open reading frame contains 67 exons that encode a 4074 amino acid protein called fibrocystin or polyductin. The phenotypes caused by PKHD1 mutations are similarly complicated, ranging from perinatally-fatal PKD to CHF presenting in adulthood with mild kidney disease. To date, more than 300 mutations have been described throughout PKHD1. Most reported cohorts include a large proportion of perinatal-onset ARPKD patients; mutation detection rates vary between 42% and 87%. Here we report PKHD1 sequencing results on 78 ARPKD/CHF patients from 68 families. Differing from previous investigations, our study required survival beyond 6 months and included many adults with a CHF-predominant phenotype. We identified 77 PKHD1 variants (41 novel) including 19 truncating, 55 missense, 2 splice, and 1 small in-frame deletion. Using computer-based prediction tools (GVGD, PolyPhen, SNAP), we achieved a mutation detection rate of 79%, ranging from 63% in the CHF-predominant group to 82% in the remaining families. Prediction of the pathogenicity of missense variants will remain challenging until a functional assay is available. In the meantime, use of PKHD1 sequencing data for clinical decisions requires caution, especially when only novel or rare missense variants are identified.
Collapse
Affiliation(s)
- Meral Gunay-Aygun
- Medical Genetics Branch, National Human Genome Research Institute, Bethesda, MD 20892, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|