1
|
Banerjee A, Bogetti AT, Bahar I. Accurate identification and mechanistic evaluation of pathogenic missense variants with Rhapsody-2. Proc Natl Acad Sci U S A 2025; 122:e2418100122. [PMID: 40314982 PMCID: PMC12067267 DOI: 10.1073/pnas.2418100122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 04/06/2025] [Indexed: 05/03/2025] Open
Abstract
Understanding the effects of missense mutations or single amino acid variants (SAVs) on protein function is crucial for elucidating the molecular basis of diseases/disorders and designing rational therapies. We introduce here Rhapsody-2, a machine learning tool for discriminating pathogenic and neutral SAVs, significantly expanding on a precursor limited by the availability of structural data. With the advent of AlphaFold2 as a powerful tool for structure prediction, Rhapsody-2 is trained on a significantly expanded dataset of 117,525 SAVs corresponding to 12,094 human proteins reported in the ClinVar database. Adopting a broad set of descriptors composed of sequence evolutionary, structural, dynamic, and energetics features in the training algorithm, Rhapsody-2 achieved an AUROC of 0.94 in 10-fold cross-validation when all SAVs of a particular test protein (mutant) were excluded from the training set. Benchmarking against a variety of testing datasets demonstrated the high performance of Rhapsody-2. While sequence evolutionary descriptors play a dominant role in pathogenicity prediction, those based on structural dynamics provide a mechanistic interpretation. Notably, residues involved in allosteric communication and those distinguished by pronounced fluctuations in the high-frequency modes of motion or subject to spatial constraints in soft modes usually give rise to pathogenicity when mutated. Overall, Rhapsody-2 provides an efficient and transparent tool for accurately predicting the pathogenicity of SAVs and unraveling the mechanistic basis of the observed behavior, thus advancing our understanding of genotype-to-phenotype relations.
Collapse
Affiliation(s)
- Anupam Banerjee
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY11794
- Department of Biochemistry and Cell Biology, Renaissance School of Medicine, Stony Brook University, Stony Brook, NY11794
| | - Anthony T. Bogetti
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY11794
- Department of Biochemistry and Cell Biology, Renaissance School of Medicine, Stony Brook University, Stony Brook, NY11794
| | - Ivet Bahar
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY11794
- Department of Biochemistry and Cell Biology, Renaissance School of Medicine, Stony Brook University, Stony Brook, NY11794
| |
Collapse
|
2
|
Tekpinar M, David L, Henry T, Carbone A. PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects. Genome Biol 2025; 26:113. [PMID: 40329382 PMCID: PMC12054230 DOI: 10.1186/s13059-025-03581-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 04/17/2025] [Indexed: 05/08/2025] Open
Abstract
Predicting the functional impact of point mutations is a critical challenge in genomics. PRESCOTT reconstructs complete mutational landscapes, identifies mutation-sensitive regions, and categorizes missense variants as benign, pathogenic, or variants of uncertain significance. Leveraging protein sequences, structural models, and population-specific allele frequencies, PRESCOTT surpasses existing methods in classifying ClinVar variants, the ACMG dataset, and over 1800 proteins from the Human Protein Dataset. Its online server facilitates mutation effect predictions for any protein and variant, and includes a database of over 19,000 human proteins, ready for population-specific analyses. Open access to residue-specific scores offers transparency and valuable insights for genomic medicine.
Collapse
Affiliation(s)
- Mustafa Tekpinar
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Laurent David
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Thomas Henry
- Centre International de Recherche en Infectiologie (CIRI), Inserm U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, ENS de Lyon, Univ Lyon, Lyon, 69007, France
| | - Alessandra Carbone
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France.
- Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
3
|
Zhou K, Gheybi K, Soh PXY, Hayes VM. Evaluating variant pathogenicity prediction tools to establish African inclusive guidelines for germline genetic testing. COMMUNICATIONS MEDICINE 2025; 5:157. [PMID: 40328947 PMCID: PMC12056225 DOI: 10.1038/s43856-025-00883-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Accepted: 04/24/2025] [Indexed: 05/08/2025] Open
Abstract
BACKGROUND Genetic germline testing is restricted for African patients. Lack of ancestrally relevant genomic data perpetuated by African diversity has resulted in European-biased curated clinical variant databases and pathogenic prediction guidelines. While numerous variant pathogenicity prediction tools (VPPTs) exist, their performance has yet to be established within the context of African diversity. METHODS To address this limitation, we assessed 54 VPPTs for predictive performance (sensitivity, specificity, false positive and negative rates) across 145,291 known pathogenic or benign variants derived from 50 Southern African and 50 European men matched for advanced prostate cancer. Prioritising VPPTs for optimal ancestral performance, we screened 5.3 million variants of unknown significance for predicted functional and oncogenic potential. RESULTS We observe a 2.1- and 4.1-fold increase in the number of known and predicted rare pathogenic or benign variants, respectively, against a 1.6-fold decrease in the number of available interrogated variants in our European over African data. Although sensitivity was significantly lower for our African data overall (0.66 vs 0.71, p = 9.86E-06), MetaSVM, CADD, Eigen-raw, BayesDel-noAF, phyloP100way-vertebrate and MVP outperformed irrespective of ancestry. Conversely, MutationTaster, DANN, LRT and GERP-RS were African-specific top performers, while MutationAssessor, PROVEAN, LIST-S2 and REVEL are European-specific. Using these pathogenic prediction workflows, we narrow the ancestral gap for potentially deleterious and oncogenic variant prediction in favour of our African data by 1.15- and 1.1-fold, respectively. CONCLUSION Although VPPT sensitivity favours European data, our findings provide guidelines for VPPT selection to maximise rare pathogenic variant prediction for African disease studies.
Collapse
Affiliation(s)
- Kangping Zhou
- Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, Sydney, NSW, Australia
| | - Kazzem Gheybi
- Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, Sydney, NSW, Australia
| | - Pamela X Y Soh
- Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, Sydney, NSW, Australia
| | - Vanessa M Hayes
- Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, Sydney, NSW, Australia.
- Manchester Cancer Research Centre, University of Manchester, Manchester, UK.
- School of Health Systems and Public Health, Faculty of Health Sciences, University of Pretoria, Pretoria, South Africa.
| |
Collapse
|
4
|
Livesey BJ, Marsh JA. Variant effect predictor correlation with functional assays is reflective of clinical classification performance. Genome Biol 2025; 26:104. [PMID: 40264194 PMCID: PMC12016141 DOI: 10.1186/s13059-025-03575-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 04/11/2025] [Indexed: 04/24/2025] Open
Abstract
BACKGROUND Understanding the relationship between protein sequence and function is crucial for accurate classification of missense variants. Variant effect predictors (VEPs) play a vital role in deciphering this complex relationship, yet evaluating their performance remains challenging for several reasons, including data circularity, where the same or related data is used for training and assessment. High-throughput experimental strategies like deep mutational scanning (DMS) offer a promising solution. RESULTS In this study, we extend upon our previous benchmarking approach, assessing the performance of 97 VEPs using missense DMS measurements from 36 different human proteins. In addition, a new pairwise, VEP-centric approach mitigates the impact of missing predictions on overall performance comparison. We observe a strong correspondence between VEP performance in DMS-based benchmarks and clinical variant classification, especially for predictors that have not been directly trained on human clinical variants. CONCLUSIONS Our results suggest that comparing VEP performance against diverse functional assays represents a reliable strategy for assessing their relative performance in clinical variant classification. However, major challenges in clinical interpretation of VEP scores persist, highlighting the need for further research to fully leverage computational predictors for genetic diagnosis. We also address practical considerations for end users in terms of choice of methodology.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
5
|
Zhao Y, Lan T, Zhong G, Hagen J, Pan H, Chung WK, Shen Y. A probabilistic graphical model for estimating selection coefficient of nonsynonymous variants from human population sequence data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.12.11.23299809. [PMID: 38168397 PMCID: PMC10760286 DOI: 10.1101/2023.12.11.23299809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
Accurately predicting the effect of missense variants is important in discovering disease risk genes and clinical genetic diagnostics. Commonly used computational methods predict pathogenicity, which does not capture the quantitative impact on fitness in humans. We developed a method, MisFit, to estimate missense fitness effect using a graphical model. MisFit jointly models the effect at a molecular level (𝑑) and a population level (selection coefficient, 𝑠), assuming that in the same gene, missense variants with similar 𝑑 have similar 𝑠. We trained it by maximizing probability of observed allele counts in 236,017 European individuals. We show that 𝑠 is informative in predicting allele frequency across ancestries and consistent with the fraction of de novo mutations in sites under strong selection. Further, 𝑠 outperforms previous methods in prioritizing de novo missense variants in individuals with neurodevelopmental disorders. In conclusion, MisFit accurately predicts 𝑠 and yields new insights from genomic data.
Collapse
Affiliation(s)
- Yige Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY 10032
| | - Tian Lan
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
| | - Guojie Zhong
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY 10032
| | - Jake Hagen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- . Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115
| | - Hongbing Pan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
| | - Wendy K. Chung
- . Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115
| | - Yufeng Shen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
- JP Sulzberger Columbia Genome Center, Columbia University, New York, NY 10032
| |
Collapse
|
6
|
Hamelin D, Scicluna M, Saadie I, Mostefai F, Grenier J, Baron C, Caron E, Hussin J. Predicting pathogen evolution and immune evasion in the age of artificial intelligence. Comput Struct Biotechnol J 2025; 27:1370-1382. [PMID: 40235636 PMCID: PMC11999473 DOI: 10.1016/j.csbj.2025.03.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Revised: 03/21/2025] [Accepted: 03/26/2025] [Indexed: 04/17/2025] Open
Abstract
The genomic diversification of viral pathogens during viral epidemics and pandemics represents a major adaptive route for infectious agents to circumvent therapeutic and public health initiatives. Historically, strategies to address viral evolution have relied on responding to emerging variants after their detection, leading to delays in effective public health responses. Because of this, a long-standing yet challenging objective has been to forecast viral evolution by predicting potentially harmful viral mutations prior to their emergence. The promises of artificial intelligence (AI) coupled with the exponential growth of viral data collection infrastructures spurred by the COVID-19 pandemic, have resulted in a research ecosystem highly conducive to this objective. Due to the COVID-19 pandemic accelerating the development of pandemic mitigation and preparedness strategies, many of the methods discussed here were designed in the context of SARS-CoV-2 evolution. However, most of these pipelines were intentionally designed to be adaptable across RNA viruses, with several strategies already applied to multiple viral species. In this review, we explore recent breakthroughs that have facilitated the forecasting of viral evolution in the context of an ongoing pandemic, with particular emphasis on deep learning architectures, including the promising potential of language models (LM). The approaches discussed here employ strategies that leverage genomic, epidemiologic, immunologic and biological information.
Collapse
Affiliation(s)
- D.J. Hamelin
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Mila - Quebec AI Institute, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| | - M. Scicluna
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Mila - Quebec AI Institute, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| | - I. Saadie
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| | - F. Mostefai
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Mila - Quebec AI Institute, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| | - J.C. Grenier
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
| | - C. Baron
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Mila - Quebec AI Institute, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| | - E. Caron
- CHU Sainte-Justine Research Center, Université de Montréal, Montréal, Quebec, Canada
- Yale Center for Immuno-Oncology, Yale Center for Systems and Engineering Immunology, Yale Center for Infection and Immunity, Yale School of Medicine, New Haven, CT, USA
| | - J.G. Hussin
- Montreal Heart Institute, Université de Montréal, Montréal, Quebec, Canada
- Mila - Quebec AI Institute, Montréal, Quebec, Canada
- Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
- Department of Medicine, Faculty of Medicine, Université de Montréal, Montréal, Quebec, Canada
| |
Collapse
|
7
|
Radjasandirane R, Diharce J, Gelly JC, de Brevern AG. Insights for variant clinical interpretation based on a benchmark of 65 variant effect predictors. Genomics 2025; 117:111036. [PMID: 40127826 DOI: 10.1016/j.ygeno.2025.111036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 02/20/2025] [Accepted: 03/20/2025] [Indexed: 03/26/2025]
Abstract
Single amino acid substitutions in protein sequences are generally harmless, but a certain number of these changes can lead to disease. Accurately predicting the effect of genetic variants is crucial for clinicians as it accelerates the diagnosis of patients with missense variants associated with health problems. Many computational tools have been developed to predict the pathogenicity of genetic variants with various approaches. Analysing the performance of these different computational tools is crucial to provide guidance to both future users and especially clinicians. In this study, a large-scale investigation of 65 tools was conducted. Variants from both clinical and functional contexts were used, incorporating data from the ClinVar database and bibliographic sources. The analysis showed that AlphaMissense often performed very well and was in fact one of the best options among the existing tools. In addition, as expected, meta-predictors perform well on average. Tools using evolutionary information showed the best performance for functional variants. These results also highlighted some heterogeneity in the difficulty of predicting some specific variants while others are always well categorized. Strikingly, the majority of variants from the ClinVar database appear to be easy to predict, while variants from other sources of data are more challenging. This raises questions about the use of ClinVar and the dataset used to validate tools accuracy. In addition, these results show that this variant predictability can be divided into three distinct classes: easy, moderate and hard to predict. We analyzed the parameters leading to these differences and showed that the classes are related to structural and functional information.
Collapse
Affiliation(s)
- Ragousandirane Radjasandirane
- Université Paris Cité and Université de la Réunion, INSERM, EFS, BIGR U1134, DSIMB Bioinformatics team, F-75015 Paris, France
| | - Julien Diharce
- Université Paris Cité and Université de la Réunion, INSERM, EFS, BIGR U1134, DSIMB Bioinformatics team, F-75015 Paris, France
| | - Jean-Christophe Gelly
- Université Paris Cité and Université de la Réunion, INSERM, EFS, BIGR U1134, DSIMB Bioinformatics team, F-75015 Paris, France
| | - Alexandre G de Brevern
- Université Paris Cité and Université de la Réunion, INSERM, EFS, BIGR U1134, DSIMB Bioinformatics team, F-75015 Paris, France.
| |
Collapse
|
8
|
Banerjee A, Bogetti A, Bahar I. Accurate Identification and Mechanistic Evaluation of Pathogenic Missense Variants with Rhapsody-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.17.638727. [PMID: 40027614 PMCID: PMC11870481 DOI: 10.1101/2025.02.17.638727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Understanding the effects of missense mutations or single amino acid variants (SAVs) on protein function is crucial for elucidating the molecular basis of diseases/disorders and designing rational therapies. We introduce here Rhapsody-2, a machine learning tool for discriminating pathogenic and neutral SAVs, significantly expanding on a precursor limited by the availability of structural data. With the advent of AlphaFold2 as a powerful tool for structure prediction, Rhapsody-2 is trained on a significantly expanded dataset of 117,525 SAVs corresponding to 12,094 human proteins reported in the ClinVar database. Adopting a broad set of descriptors composed of sequence evolutionary, structural, dynamic, and energetics features in the training algorithm, Rhapsody-2 achieved an AUROC of 0.94 in 10-fold cross-validation when all SAVs of a particular test protein (mutant) were excluded from the training set. Benchmarking against a variety of testing datasets demonstrated the high performance of Rhapsody-2. While sequence evolutionary descriptors play a dominant role in pathogenicity prediction, those based on structural dynamics provide a mechanistic interpretation. Notably, residues involved in allosteric communication, and those distinguished by pronounced fluctuations in the high frequency modes of motion or subject to spatial constraints in soft modes usually give rise to pathogenicity when mutated. Overall, Rhapsody-2 provides an efficient and transparent tool for accurately predicting the pathogenicity of SAVs and unraveling the mechanistic basis of the observed behavior, thus advancing our understanding of genotype-to-phenotype relations.
Collapse
Affiliation(s)
- Anupam Banerjee
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, New York 11794, USA
| | - Anthony Bogetti
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, New York 11794, USA
| | - Ivet Bahar
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, New York 11794, USA
- Department of Biochemistry and Cell Biology, Renaissance School of Medicine, Stony Brook University, New York 11794, USA
| |
Collapse
|
9
|
Ozkan S, Padilla N, de la Cruz X. QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning. Hum Genet 2025; 144:191-208. [PMID: 39048855 PMCID: PMC11976337 DOI: 10.1007/s00439-024-02692-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
Collapse
Affiliation(s)
- Selen Ozkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
10
|
Zhang Y, Leung AK, Kang JJ, Sun Y, Wu G, Li L, Sun J, Cheng L, Qiu T, Zhang J, Wierbowski SD, Gupta S, Booth JG, Yu H. A multiscale functional map of somatic mutations in cancer integrating protein structure and network topology. Nat Commun 2025; 16:975. [PMID: 39856048 PMCID: PMC11760531 DOI: 10.1038/s41467-024-54176-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 11/04/2024] [Indexed: 01/27/2025] Open
Abstract
A major goal of cancer biology is to understand the mechanisms driven by somatically acquired mutations. Two distinct methodologies-one analyzing mutation clustering within protein sequences and 3D structures, the other leveraging protein-protein interaction network topology-offer complementary strengths. We present NetFlow3D, a unified, end-to-end 3D structurally-informed protein interaction network propagation framework that maps the multiscale mechanistic effects of mutations. Built upon the Human Protein Structurome, which incorporates the 3D structures of every protein and the binding interfaces of all known protein interactions, NetFlow3D integrates atomic, residue, protein and network-level information: It clusters mutations on 3D protein structures to identify driver mutations and propagates their impacts anisotropically across the protein interaction network, guided by the involved interaction interfaces, to reveal systems-level impacts. Applied to 33 cancer types, NetFlow3D identifies 2 times more 3D clusters and incorporates 8 times more proteins in significantly interconnected network modules compared to traditional methods.
Collapse
Affiliation(s)
- Yingying Zhang
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, 14853, NY, USA
| | - Alden K Leung
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Jin Joo Kang
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Yu Sun
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Guanxi Wu
- College of Agriculture and Life Sciences, Cornell University, Ithaca, 14853, NY, USA
| | - Le Li
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Jiayang Sun
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Lily Cheng
- Department of Science and Technology Studies, Cornell University, Ithaca, 14853, NY, USA
| | - Tian Qiu
- School of Electrical and Computer Engineering, Cornell University, Ithaca, 14853, NY, USA
| | - Junke Zhang
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Shayne D Wierbowski
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - Shagun Gupta
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA
| | - James G Booth
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA
- Department of Statistics and Data Science, Cornell University, Ithaca, 14853, NY, USA
| | - Haiyuan Yu
- Department of Computational Biology, Cornell University, Ithaca, 14853, NY, USA.
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, NY, USA.
| |
Collapse
|
11
|
Das S, Patel V, Chakravarty S, Ghosh A, Mukhopadhyay A, Biswas NK. An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer. BioData Min 2025; 18:7. [PMID: 39833905 PMCID: PMC11744934 DOI: 10.1186/s13040-024-00420-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 12/26/2024] [Indexed: 01/22/2025] Open
Abstract
BACKGROUND AND OBJECTIVE Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC). METHODS The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms-logistic regression, random forest, and support vector machine-along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods. RESULTS The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability. CONCLUSIONS The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.
Collapse
Affiliation(s)
- Subrata Das
- Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical Genomics, Kalyani, West Bengal, India
| | - Vatsal Patel
- Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical Genomics, Kalyani, West Bengal, India
| | - Shouvik Chakravarty
- Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical Genomics, Kalyani, West Bengal, India
- Biotechnology Research and Innovation Council-Regional Centre for Biotechnology (BRIC- RCB), Faridabad, India
| | - Arnab Ghosh
- Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical Genomics, Kalyani, West Bengal, India
- Biotechnology Research and Innovation Council-Regional Centre for Biotechnology (BRIC- RCB), Faridabad, India
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, 741235, India.
| | - Nidhan K Biswas
- Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical Genomics, Kalyani, West Bengal, India.
| |
Collapse
|
12
|
Zhao W, Tao Y, Xiong J, Liu L, Wang Z, Shao C, Shang L, Hu Y, Xu Y, Su Y, Yu J, Feng T, Xie J, Xu H, Zhang Z, Peng J, Wu J, Zhang Y, Zhu S, Xia K, Tang B, Zhao G, Li J, Li B. GoFCards: an integrated database and analytic platform for gain of function variants in humans. Nucleic Acids Res 2025; 53:D976-D988. [PMID: 39578693 PMCID: PMC11701611 DOI: 10.1093/nar/gkae1079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 10/20/2024] [Accepted: 10/28/2024] [Indexed: 11/24/2024] Open
Abstract
Gain-of-function (GOF) variants, which introduce new or amplify protein functions, are essential for understanding disease mechanisms. Despite advances in genomics and functional research, identifying and analyzing pathogenic GOF variants remains challenging owing to fragmented data and database limitations, underscoring the difficulty in accessing critical genetic information. To address this challenge, we manually reviewed the literature, pinpointing 3089 single-nucleotide variants and 72 insertions and deletions in 579 genes associated with 1299 diseases from 2069 studies, and integrated these with the 3.5 million predicted GOF variants. Our approach is complemented by a proprietary scoring system that prioritizes GOF variants on the basis of the evidence supporting their GOF effects and provides predictive scores for variants that lack existing documentation. We then developed a database named GoFCards for general geneticists and clinicians to easily obtain GOF variants in humans (http://www.genemed.tech/gofcards). This database also contains data from >150 sources and offers comprehensive variant-level and gene-level annotations, with the aim of providing users with convenient access to detailed and relevant genetic information. Furthermore, GoFCards empowers users with limited bioinformatic skills to analyze and annotate genetic data, and prioritize GOF variants. GoFCards offers an efficient platform for interpreting GOF variants and thereby advancing genetic research.
Collapse
Affiliation(s)
- Wenjing Zhao
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
- Department of Medical Genetics, NHC Key Laboratory of Healthy Birth and Birth Defect Prevention in Western China, The First People's Hospital of Yunnan Province, No. 157 Jinbi Road, Xishan District, Kunming, Yunnan 650000, China
- School of Medicinie, Kunming University of Science and Technology, No. 727 Jingming South Road, Chenggong District, Kunming, Yunnan 650000, China
| | - Youfu Tao
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Jiayi Xiong
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
| | - Lei Liu
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Zhongqing Wang
- School of Medicinie, Kunming University of Science and Technology, No. 727 Jingming South Road, Chenggong District, Kunming, Yunnan 650000, China
| | - Chuhan Shao
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Ling Shang
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Yue Hu
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Yishu Xu
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Yingluo Su
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Jiahui Yu
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Tianyi Feng
- Xiangya School of Medicine, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Junyi Xie
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Huijuan Xu
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Zijun Zhang
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Jiayi Peng
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Jianbin Wu
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Yuchang Zhang
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Shaobo Zhu
- School of Life Science, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha, Hunan 410008, China
| | - Kun Xia
- MOE Key Laboratory of Pediatric Rare Diseases & Hunan Key Laboratory of Medical Genetics, Central South University, No. 110 Xiangya Road, Furong District, Changsha, Hunan 410008, China
| | - Beisha Tang
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
- Department of Neurology & Multi-omics Research Center for Brain Disorders, The First Affiliated Hospital University of South China, 69 Chuan Shan Road, Shi Gu District, Hengyang, Hunan 421000, China
- Key Laboratory of Hunan Province in Neurodegenerative Disorders, Department of Neurology, Xiangya Hospital, Central South University, No. 87 Xiangya Road, Furong District, Changsha,Hunan 410008, China
| | - Guihu Zhao
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
| | - Jinchen Li
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
- Key Laboratory of Hunan Province in Neurodegenerative Disorders, Department of Neurology, Xiangya Hospital, Central South University, No. 87 Xiangya Road, Furong District, Changsha,Hunan 410008, China
- Bioinformatics Center, Furong Laboratory & Xiangya Hospital, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
| | - Bin Li
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital & Center for Medical Genetics, School of Life Sciences, Central South University, No. 87 Xiangya Road, Furong District, Changsha, Hunan 410008, China
| |
Collapse
|
13
|
Li C, Luo Y, Xie Y, Zhang Z, Liu Y, Zou L, Xiao F. Structural and functional prediction, evaluation, and validation in the post-sequencing era. Comput Struct Biotechnol J 2024; 23:446-451. [PMID: 38223342 PMCID: PMC10787220 DOI: 10.1016/j.csbj.2023.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/20/2023] [Accepted: 12/22/2023] [Indexed: 01/16/2024] Open
Abstract
The surge of genome sequencing data has underlined substantial genetic variants of uncertain significance (VUS). The decryption of VUS discovered by sequencing poses a major challenge in the post-sequencing era. Although experimental assays have progressed in classifying VUS, only a tiny fraction of the human genes have been explored experimentally. Thus, it is urgently needed to generate state-of-the-art functional predictors of VUS in silico. Artificial intelligence (AI) is an invaluable tool to assist in the identification of VUS with high efficiency and accuracy. An increasing number of studies indicate that AI has brought an exciting acceleration in the interpretation of VUS, and our group has already used AI to develop protein structure-based prediction models. In this review, we provide an overview of the previous research on AI-based prediction of missense variants, and elucidate the challenges and opportunities for protein structure-based variant prediction in the post-sequencing era.
Collapse
Affiliation(s)
- Chang Li
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Yixuan Luo
- Beijing Normal University, Beijing, China
| | - Yibo Xie
- Information Center, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Zaifeng Zhang
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Ye Liu
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Lihui Zou
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Fei Xiao
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- Beijing Normal University, Beijing, China
| |
Collapse
|
14
|
Hou C, Shen Y. SeqDance: A Protein Language Model for Representing Protein Dynamic Properties. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.11.617911. [PMID: 39464109 PMCID: PMC11507661 DOI: 10.1101/2024.10.11.617911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Proteins perform their functions by folding amino acid sequences into dynamic structural ensembles. Despite the important role of protein dynamics, their complexity and the absence of efficient representation methods have limited their integration into studies on protein function and mutation fitness, especially in deep learning applications. To address this, we present SeqDance, a protein language model designed to learn representation of protein dynamic properties directly from sequence alone. SeqDance is pre-trained on dynamic biophysical properties derived from over 30,400 molecular dynamics trajectories and 28,600 normal mode analyses. Our results show that SeqDance effectively captures local dynamic interactions, co-movement patterns, and global conformational features, even for proteins lacking homologs in the pre-training set. Additionally, we showed that SeqDance enhances the prediction of protein fitness landscapes, disorder-to-order transition binding regions, and phase-separating proteins. By learning dynamic properties from sequence, SeqDance complements conventional evolution- and static structure-based methods, offering new insights into protein behavior and function.
Collapse
Affiliation(s)
- Chao Hou
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
| | - Yufeng Shen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
- JP Sulzberger Columbia Genome Center, Columbia University, New York, NY 10032
| |
Collapse
|
15
|
Cheng P, Mao C, Tang J, Yang S, Cheng Y, Wang W, Gu Q, Han W, Chen H, Li S, Chen Y, Zhou J, Li W, Pan A, Zhao S, Huang X, Zhu S, Zhang J, Shu W, Wang S. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 2024; 34:630-647. [PMID: 38969803 PMCID: PMC11369238 DOI: 10.1038/s41422-024-00989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 06/03/2024] [Indexed: 07/07/2024] Open
Abstract
Mutations in amino acid sequences can provoke changes in protein function. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Protein Mutational Effect Predictor (ProMEP), a general and multiple sequence alignment-free method that enables zero-shot prediction of mutation effects. A multimodal deep representation learning model embedded in ProMEP was developed to comprehensively learn both sequence and structure contexts from ~160 million proteins. ProMEP achieves state-of-the-art performance in mutational effect prediction and accomplishes a tremendous improvement in speed, enabling efficient and intelligent protein engineering. Specifically, ProMEP accurately forecasts mutational consequences on the gene-editing enzymes TnpB and TadA, and successfully guides the development of high-performance gene-editing tools with their engineered variants. The gene-editing efficiency of a 5-site mutant of TnpB reaches up to 74.04% (vs 24.66% for the wild type); and the base editing tool developed on the basis of a TadA 15-site mutant (in addition to the A106V/D108N double mutation that renders deoxyadenosine deaminase activity to TadA) exhibits an A-to-G conversion frequency of up to 77.27% (vs 69.80% for ABE8e, a previous TadA-based adenine base editor) with significantly reduced bystander and off-target effects compared to ABE8e. ProMEP not only showcases superior performance in predicting mutational effects on proteins but also demonstrates a great capability to guide protein engineering. Therefore, ProMEP enables efficient exploration of the gigantic protein space and facilitates practical design of proteins, thereby advancing studies in biomedicine and synthetic biology.
Collapse
Affiliation(s)
- Peng Cheng
- Bioinformatics Center of AMMS, Beijing, China
| | - Cong Mao
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Jin Tang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Sen Yang
- Bioinformatics Center of AMMS, Beijing, China
| | - Yu Cheng
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wuke Wang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Qiuxi Gu
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wei Han
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Hao Chen
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Sihan Li
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | | | | | - Wuju Li
- Bioinformatics Center of AMMS, Beijing, China
| | - Aimin Pan
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Suwen Zhao
- iHuman Institute, ShanghaiTech University, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xingxu Huang
- Zhejiang Lab, Hangzhou, Zhejiang, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | | | - Jun Zhang
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China.
| | - Wenjie Shu
- Bioinformatics Center of AMMS, Beijing, China.
| | | |
Collapse
|
16
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
17
|
Zhang Y, Leung AK, Kang JJ, Sun Y, Wu G, Li L, Sun J, Cheng L, Qiu T, Zhang J, Wierbowski S, Gupta S, Booth J, Yu H. A multiscale functional map of somatic mutations in cancer integrating protein structure and network topology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.03.06.531441. [PMID: 36945530 PMCID: PMC10028849 DOI: 10.1101/2023.03.06.531441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
A major goal of cancer biology is to understand the mechanisms underlying tumorigenesis driven by somatically acquired mutations. Two distinct types of computational methodologies have emerged: one focuses on analyzing clustering of mutations within protein sequences and 3D structures, while the other characterizes mutations by leveraging the topology of protein-protein interaction network. Their insights are largely non-overlapping, offering complementary strengths. Here, we established a unified, end-to-end 3D structurally-informed protein interaction network propagation framework, NetFlow3D, that systematically maps the multiscale mechanistic effects of somatic mutations in cancer. The establishment of NetFlow3D hinges upon the Human Protein Structurome, a comprehensive repository we compiled that incorporates the 3D structures of every single protein as well as the binding interfaces of all known protein interactions in humans. NetFlow3D leverages the Structurome to integrate information across atomic, residue, protein and network levels: It conducts 3D clustering of mutations across atomic and residue levels on protein structures to identify potential driver mutations. It then anisotropically propagates their impacts across the protein interaction network, with propagation guided by the specific 3D structural interfaces involved, to identify significantly interconnected network "modules", thereby uncovering key biological processes underlying disease etiology. Applied to 1,038,899 somatic protein-altering mutations in 9,946 TCGA tumors across 33 cancer types, NetFlow3D identified 1,4444 significant 3D clusters throughout the Human Protein Structurome, of which ~55% would not have been found if using only experimentally-determined structures. It then identified 26 significantly interconnected modules that encompass ~8-fold more proteins than applying standard network analyses. NetFlow3D and our pan-cancer results can be accessed from http://netflow3d.yulab.org.
Collapse
Affiliation(s)
- Yingying Zhang
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
- Department of Molecular Biology and Genetics, Cornell University; Ithaca, 14853, USA
| | - Alden K. Leung
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Jin Joo Kang
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Yu Sun
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Guanxi Wu
- College of Agriculture and Life Sciences, Cornell University; Ithaca, 14853, USA
| | - Le Li
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Jiayang Sun
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
| | - Lily Cheng
- Department of Science and Technology Studies, Cornell University; Ithaca, 14853, USA
| | - Tian Qiu
- School of Electrical and Computer Engineering, Cornell University; Ithaca, 14853, USA
| | - Junke Zhang
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Shayne Wierbowski
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - Shagun Gupta
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| | - James Booth
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Department of Statistics and Data Science, Cornell University; Ithaca, 14853, USA
| | - Haiyuan Yu
- Department of Computational Biology, Cornell University; Ithaca, 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
| |
Collapse
|
18
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
19
|
Ashayeri H, Sobhi N, Pławiak P, Pedrammehr S, Alizadehsani R, Jafarizadeh A. Transfer Learning in Cancer Genetics, Mutation Detection, Gene Expression Analysis, and Syndrome Recognition. Cancers (Basel) 2024; 16:2138. [PMID: 38893257 PMCID: PMC11171544 DOI: 10.3390/cancers16112138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 05/30/2024] [Accepted: 06/01/2024] [Indexed: 06/21/2024] Open
Abstract
Artificial intelligence (AI), encompassing machine learning (ML) and deep learning (DL), has revolutionized medical research, facilitating advancements in drug discovery and cancer diagnosis. ML identifies patterns in data, while DL employs neural networks for intricate processing. Predictive modeling challenges, such as data labeling, are addressed by transfer learning (TL), leveraging pre-existing models for faster training. TL shows potential in genetic research, improving tasks like gene expression analysis, mutation detection, genetic syndrome recognition, and genotype-phenotype association. This review explores the role of TL in overcoming challenges in mutation detection, genetic syndrome detection, gene expression, or phenotype-genotype association. TL has shown effectiveness in various aspects of genetic research. TL enhances the accuracy and efficiency of mutation detection, aiding in the identification of genetic abnormalities. TL can improve the diagnostic accuracy of syndrome-related genetic patterns. Moreover, TL plays a crucial role in gene expression analysis in order to accurately predict gene expression levels and their interactions. Additionally, TL enhances phenotype-genotype association studies by leveraging pre-trained models. In conclusion, TL enhances AI efficiency by improving mutation prediction, gene expression analysis, and genetic syndrome detection. Future studies should focus on increasing domain similarities, expanding databases, and incorporating clinical data for better predictions.
Collapse
Affiliation(s)
- Hamidreza Ashayeri
- Student Research Committee, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran;
| | - Navid Sobhi
- Nikookari Eye Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran; (N.S.); (A.J.)
| | - Paweł Pławiak
- Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Warszawska 24, 31-155 Krakow, Poland
- Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Bałtycka 5, 44-100 Gliwice, Poland
| | - Siamak Pedrammehr
- Faculty of Design, Tabriz Islamic Art University, Tabriz 5164736931, Iran;
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Burwood, VIC 3216, Australia;
| | - Roohallah Alizadehsani
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Burwood, VIC 3216, Australia;
| | - Ali Jafarizadeh
- Nikookari Eye Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran; (N.S.); (A.J.)
- Immunology Research Center, Tabriz University of Medical Sciences, Tabriz 5165665811, Iran
| |
Collapse
|
20
|
Zhong G, Zhao Y, Zhuang D, Chung WK, Shen Y. PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581321. [PMID: 38746140 PMCID: PMC11092447 DOI: 10.1101/2024.02.20.581321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate prediction of the functional impact of missense variants is important for disease gene discovery, clinical genetic diagnostics, therapeutic strategies, and protein engineering. Previous efforts have focused on predicting a binary pathogenicity classification, but the functional impact of missense variants is multi-dimensional. Pathogenic missense variants in the same gene may act through different modes of action (i.e., gain/loss-of-function) by affecting different aspects of protein function. They may result in distinct clinical conditions that require different treatments. We developed a new method, PreMode, to perform gene-specific mode-of-action predictions. PreMode models effects of coding sequence variants using SE(3)-equivariant graph neural networks on protein sequences and structures. Using the largest-to-date set of missense variants with known modes of action, we showed that PreMode reached state-of-the-art performance in multiple types of mode-of-action predictions by efficient transfer-learning. Additionally, PreMode's prediction of G/LoF variants in a kinase is consistent with inactive-active conformation transition energy changes. Finally, we show that PreMode enables efficient study design of deep mutational scans and optimization in protein engineering.
Collapse
|
21
|
Zhao H, Wang L, Zhang M, Wang H, Zhang S, Wu J, Tang Y. Identification and characterization of novel genetic variants in the first Chinese family of mucopolysaccharidosis IIIC (Sanfilippo C syndrome). J Cell Mol Med 2024; 28:e18307. [PMID: 38613342 PMCID: PMC11015392 DOI: 10.1111/jcmm.18307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 02/23/2024] [Accepted: 03/26/2024] [Indexed: 04/14/2024] Open
Abstract
Mucopolysaccharidosis type IIIC (MPS IIIC) is one of inherited lysosomal storage disorders, caused by deficiencies in lysosomal hydrolases degrading acidic mucopolysaccharides. The gene responsible for MPS IIIC is HGSNAT, which encodes an enzyme that catalyses the acetylation of the terminal glucosamine residues of heparan sulfate. So far, few studies have focused on the genetic landscape of MPS IIIC in China, where IIIA and IIIB were the major subtypes. In this study, we utilized whole-exome sequencing (WES) to identify novel compound heterozygous variants in the HGSNAT gene from a Chinese patient with typical MPS IIIC symptoms: c.743G>A; p.Gly248Glu and c.1030C>T; p.Arg344Cys. We performed in silico analysis and experimental validation, which confirmed the deleterious pathogenic nature of both variants, as evidenced by the loss of HGSNAT activity and failure of lysosomal localization. To the best of our knowledge, the MPS IIIC is first confirmed by clinical, biochemical and molecular genetic findings in China. Our study thus expands the spectrum of MPS IIIC pathogenic variants, which is of importance to dissect the pathogenesis and to carry out clinical diagnosis of MPS IIIC. Moreover, this study helps to depict the natural history of Chinese MPS IIIC populations.
Collapse
Affiliation(s)
- Hongjun Zhao
- Department of Rheumatology and Immunology, Xiangya HospitalCentral South UniversityChangshaChina
- Provincial Clinical Research Center for Rheumatic and Immunologic Diseases, Xiangya HospitalCentral South UniversityChangshaChina
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
| | - Lijing Wang
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Geriatrics, Aging Research Center, Xiangya HospitalCentral South UniversityChangshaChina
| | - Mengfei Zhang
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Geriatrics, Aging Research Center, Xiangya HospitalCentral South UniversityChangshaChina
| | - Huakun Wang
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Geriatrics, Aging Research Center, Xiangya HospitalCentral South UniversityChangshaChina
| | - Sizhe Zhang
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Neurology, Xiangya HospitalCentral South UniversityChangshaHunanChina
| | - Junjiao Wu
- Department of Rheumatology and Immunology, Xiangya HospitalCentral South UniversityChangshaChina
- Provincial Clinical Research Center for Rheumatic and Immunologic Diseases, Xiangya HospitalCentral South UniversityChangshaChina
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
| | - Yu Tang
- National Clinical Research Center for Geriatric Disorders, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Geriatrics, Aging Research Center, Xiangya HospitalCentral South UniversityChangshaChina
- Department of Neurology, Xiangya HospitalCentral South UniversityChangshaHunanChina
| |
Collapse
|
22
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
23
|
Nourbakhsh M, Degn K, Saksager A, Tiberti M, Papaleo E. Prediction of cancer driver genes and mutations: the potential of integrative computational frameworks. Brief Bioinform 2024; 25:bbad519. [PMID: 38261338 PMCID: PMC10805075 DOI: 10.1093/bib/bbad519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 11/27/2023] [Accepted: 12/11/2023] [Indexed: 01/24/2024] Open
Abstract
The vast amount of available sequencing data allows the scientific community to explore different genetic alterations that may drive cancer or favor cancer progression. Software developers have proposed a myriad of predictive tools, allowing researchers and clinicians to compare and prioritize driver genes and mutations and their relative pathogenicity. However, there is little consensus on the computational approach or a golden standard for comparison. Hence, benchmarking the different tools depends highly on the input data, indicating that overfitting is still a massive problem. One of the solutions is to limit the scope and usage of specific tools. However, such limitations force researchers to walk on a tightrope between creating and using high-quality tools for a specific purpose and describing the complex alterations driving cancer. While the knowledge of cancer development increases daily, many bioinformatic pipelines rely on single nucleotide variants or alterations in a vacuum without accounting for cellular compartments, mutational burden or disease progression. Even within bioinformatics and computational cancer biology, the research fields work in silos, risking overlooking potential synergies or breakthroughs. Here, we provide an overview of databases and datasets for building or testing predictive cancer driver tools. Furthermore, we introduce predictive tools for driver genes, driver mutations, and the impact of these based on structural analysis. Additionally, we suggest and recommend directions in the field to avoid silo-research, moving towards integrative frameworks.
Collapse
Affiliation(s)
- Mona Nourbakhsh
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Kristine Degn
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Astrid Saksager
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Matteo Tiberti
- Cancer Structural Biology, Danish Cancer Institute, 2100 Copenhagen, Denmark
| | - Elena Papaleo
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
- Cancer Structural Biology, Danish Cancer Institute, 2100 Copenhagen, Denmark
| |
Collapse
|
24
|
Murtas G, Zerbini E, Rabattoni V, Motta Z, Caldinelli L, Orlando M, Marchesani F, Campanini B, Sacchi S, Pollegioni L. Biochemical and cellular studies of three human 3-phosphoglycerate dehydrogenase variants responsible for pathological reduced L-serine levels. Biofactors 2024; 50:181-200. [PMID: 37650587 DOI: 10.1002/biof.2002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 08/12/2023] [Indexed: 09/01/2023]
Abstract
In the brain, the non-essential amino acid L-serine is produced through the phosphorylated pathway (PP) starting from the glycolytic intermediate 3-phosphoglycerate: among the different roles played by this amino acid, it can be converted into D-serine and glycine, the two main co-agonists of NMDA receptors. In humans, the enzymes of the PP, namely phosphoglycerate dehydrogenase (hPHGDH, which catalyzes the first and rate-limiting step of this pathway), 3-phosphoserine aminotransferase, and 3-phosphoserine phosphatase are likely organized in the cytosol as a metabolic assembly (a "serinosome"). The hPHGDH deficiency is a pathological condition biochemically characterized by reduced levels of L-serine in plasma and cerebrospinal fluid and clinically identified by severe neurological impairment. Here, three single-point variants responsible for hPHGDH deficiency and Neu-Laxova syndrome have been studied. Their biochemical characterization shows that V261M, V425M, and V490M substitutions alter either the kinetic (both maximal activity and Km for 3-phosphoglycerate in the physiological direction) and the structural properties (secondary, tertiary, and quaternary structure, favoring aggregation) of hPHGDH. All the three variants have been successfully ectopically expressed in U251 cells, thus the pathological effect is not due to hindered expression level. At the cellular level, mistargeting and aggregation phenomena have been observed in cells transiently expressing the pathological protein variants, as well as a reduced L-serine cellular level. Previous studies demonstrated that the pharmacological supplementation of L-serine in hPHGDH deficiencies could ameliorate some of the related symptoms: our results now suggest the use of additional and alternative therapeutic approaches.
Collapse
Affiliation(s)
- Giulia Murtas
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Elena Zerbini
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Valentina Rabattoni
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Zoraide Motta
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Laura Caldinelli
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Marco Orlando
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| | | | | | - Silvia Sacchi
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| | - Loredano Pollegioni
- Department of Biotechnology and Life Sciences, University of Insubria, Varese, Italy
| |
Collapse
|
25
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|
26
|
Ge F, Arif M, Yan Z, Alahmadi H, Worachartcheewan A, Yu DJ, Shoombuatong W. MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction. J Chem Inf Model 2023; 63:7239-7257. [PMID: 37947586 PMCID: PMC10685454 DOI: 10.1021/acs.jcim.3c00950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 10/21/2023] [Accepted: 10/23/2023] [Indexed: 11/12/2023]
Abstract
Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals' outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites' ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals' outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.
Collapse
Affiliation(s)
- Fang Ge
- School
of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 9 Wenyuanlu, Nanjing 210023, China
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok 10700, Thailand
| | - Muhammad Arif
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
- Department
of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Zihao Yan
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Hanin Alahmadi
- College of
Computer Science and Engineering, Taibah
University, Madinah 344, Saudi Arabia
| | - Apilak Worachartcheewan
- Department
of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Dong-Jun Yu
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Watshara Shoombuatong
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
27
|
Zeibich R, Kwan P, J. O’Brien T, Perucca P, Ge Z, Anderson A. Applications for Deep Learning in Epilepsy Genetic Research. Int J Mol Sci 2023; 24:14645. [PMID: 37834093 PMCID: PMC10572791 DOI: 10.3390/ijms241914645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/11/2023] [Accepted: 09/21/2023] [Indexed: 10/15/2023] Open
Abstract
Epilepsy is a group of brain disorders characterised by an enduring predisposition to generate unprovoked seizures. Fuelled by advances in sequencing technologies and computational approaches, more than 900 genes have now been implicated in epilepsy. The development and optimisation of tools and methods for analysing the vast quantity of genomic data is a rapidly evolving area of research. Deep learning (DL) is a subset of machine learning (ML) that brings opportunity for novel investigative strategies that can be harnessed to gain new insights into the genomic risk of people with epilepsy. DL is being harnessed to address limitations in accuracy of long-read sequencing technologies, which improve on short-read methods. Tools that predict the functional consequence of genetic variation can represent breaking ground in addressing critical knowledge gaps, while methods that integrate independent but complimentary data enhance the predictive power of genetic data. We provide an overview of these DL tools and discuss how they may be applied to the analysis of genetic data for epilepsy research.
Collapse
Affiliation(s)
- Robert Zeibich
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
| | - Patrick Kwan
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Terence J. O’Brien
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Piero Perucca
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Epilepsy Research Centre, Department of Medicine, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
- Bladin-Berkovic Comprehensive Epilepsy Program, Department of Neurology, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
| | - Zongyuan Ge
- Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia;
- Monash-Airdoc Research, Monash University, Melbourne, VIC 3800, Australia
| | - Alison Anderson
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| |
Collapse
|
28
|
Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, Pritzel A, Wong LH, Zielinski M, Sargeant T, Schneider RG, Senior AW, Jumper J, Hassabis D, Kohli P, Avsec Ž. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 2023; 381:eadg7492. [PMID: 37733863 DOI: 10.1126/science.adg7492] [Citation(s) in RCA: 665] [Impact Index Per Article: 332.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 08/23/2023] [Indexed: 09/23/2023]
Abstract
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
Collapse
|
29
|
David A, Sternberg MJE. Protein structure-based evaluation of missense variants: Resources, challenges and future directions. Curr Opin Struct Biol 2023; 80:102600. [PMID: 37126977 DOI: 10.1016/j.sbi.2023.102600] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 03/30/2023] [Accepted: 03/31/2023] [Indexed: 05/03/2023]
Abstract
We provide an overview of the methods that can be used for protein structure-based evaluation of missense variants. The algorithms can be broadly divided into those that calculate the difference in free energy (ΔΔG) between the wild type and variant structures and those that use structural features to predict the damaging effect of a variant without providing a ΔΔG. A wide range of machine learning approaches have been employed to develop those algorithms. We also discuss challenges and opportunities for variant interpretation in view of the recent breakthrough in three-dimensional structural modelling using deep learning.
Collapse
Affiliation(s)
- Alessia David
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK.
| | - Michael J E Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|