1
|
Zhao Y, Lan T, Zhong G, Hagen J, Pan H, Chung WK, Shen Y. A probabilistic graphical model for estimating selection coefficients of nonsynonymous variants from human population sequence data. Nat Commun 2025; 16:4670. [PMID: 40393980 DOI: 10.1038/s41467-025-59937-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 05/06/2025] [Indexed: 05/22/2025] Open
Abstract
Accurately predicting the effect of missense variants is important in discovering disease risk genes and clinical genetic diagnostics. Commonly used computational methods predict pathogenicity, which does not capture the quantitative impact on fitness in humans. We develop a method, MisFit, to estimate missense fitness effect using a graphical model. MisFit jointly models the effect at a molecular level ( d ) and a population level (selection coefficient, s ), assuming that in the same gene, missense variants with similar d have similar s . We train it by maximizing probability of observed allele counts in 236,017 individuals of European ancestry. We show that s is informative in predicting allele frequency across ancestries and consistent with the fraction of de novo mutations in sites under strong selection. Further, s outperforms previous methods in prioritizing de novo missense variants in individuals with neurodevelopmental disorders. In conclusion, MisFit accurately predicts s and yields new insights from genomic data.
Collapse
Affiliation(s)
- Yige Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA
| | - Tian Lan
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Guojie Zhong
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA
| | - Jake Hagen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA, USA
| | - Hongbing Pan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| | - Wendy K Chung
- Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA, USA
| | - Yufeng Shen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
- JP Sulzberger Columbia Genome Center, Columbia University, New York, NY, USA.
| |
Collapse
|
2
|
Ramasamy R, Raveendran M, Harris RA, Le HD, Mure LS, Benegiamo G, Dkhissi-Benyahya O, Cooper H, Rogers J, Panda S. Genome-wide allele-specific expression in multi-tissue samples from healthy male baboons reveals the transcriptional complexity of mammals. CELL GENOMICS 2025; 5:100823. [PMID: 40187355 DOI: 10.1016/j.xgen.2025.100823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 12/13/2024] [Accepted: 03/06/2025] [Indexed: 04/07/2025]
Abstract
Allele-specific expression (ASE) is pivotal in understanding the genetic underpinnings of phenotypic variation within species, differences in disease susceptibility, and responses to environmental factors. We processed 11 different tissue types collected from 12 age-matched healthy olive baboons (Papio anubis) for genome-wide ASE analysis. By sequencing their genomes at a minimum depth of 30×, we identified over 16 million single-nucleotide variants (SNVs). We also generated long-read sequencing data, enabling the phasing of all variants present within the coding regions of 96.5% of assayable protein-coding genes as a single haplotype block. Given the extensive heterozygosity of baboons relative to humans, we could quantify ASE across 72% of the total annotated protein-coding gene set. We identified genes that exhibit ASE and affect specific tissues and genotypes. We discovered ASE SNVs that also exist in human populations with identical alleles and that are designated as pathogenic by both the PrimateAI-3D and AlphaMissense models.
Collapse
Affiliation(s)
- Ramesh Ramasamy
- Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA
| | - Muthuswamy Raveendran
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - R Alan Harris
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Hiep D Le
- Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA
| | - Ludovic S Mure
- Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA
| | - Giorgia Benegiamo
- Laboratory of Integrative Systems Physiology, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Ouria Dkhissi-Benyahya
- Univ Lyon, Université Claude Bernard Lyon 1, INSERM, Stem Cell and Brain Research Institute U1208, 69500 Bron, France
| | - Howard Cooper
- Univ Lyon, Université Claude Bernard Lyon 1, INSERM, Stem Cell and Brain Research Institute U1208, 69500 Bron, France
| | - Jeffrey Rogers
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
| | - Satchidananda Panda
- Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA.
| |
Collapse
|
3
|
Tekpinar M, David L, Henry T, Carbone A. PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects. Genome Biol 2025; 26:113. [PMID: 40329382 PMCID: PMC12054230 DOI: 10.1186/s13059-025-03581-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 04/17/2025] [Indexed: 05/08/2025] Open
Abstract
Predicting the functional impact of point mutations is a critical challenge in genomics. PRESCOTT reconstructs complete mutational landscapes, identifies mutation-sensitive regions, and categorizes missense variants as benign, pathogenic, or variants of uncertain significance. Leveraging protein sequences, structural models, and population-specific allele frequencies, PRESCOTT surpasses existing methods in classifying ClinVar variants, the ACMG dataset, and over 1800 proteins from the Human Protein Dataset. Its online server facilitates mutation effect predictions for any protein and variant, and includes a database of over 19,000 human proteins, ready for population-specific analyses. Open access to residue-specific scores offers transparency and valuable insights for genomic medicine.
Collapse
Affiliation(s)
- Mustafa Tekpinar
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Laurent David
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Thomas Henry
- Centre International de Recherche en Infectiologie (CIRI), Inserm U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, ENS de Lyon, Univ Lyon, Lyon, 69007, France
| | - Alessandra Carbone
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France.
- Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
4
|
Buckley RM, Bilgen N, Harris AC, Savolainen P, Tepeli C, Erdoğan M, Serres Armero A, Dreger DL, van Steenbeek FG, Hytönen MK, Parker HG, Hale J, Lohi H, Çınar Kul B, Boyko AR, Ostrander EA. Analysis of canine gene constraint identifies new variants for orofacial clefts and stature. Genome Res 2025; 35:1080-1093. [PMID: 40127928 PMCID: PMC12047267 DOI: 10.1101/gr.280092.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 03/10/2025] [Indexed: 03/26/2025]
Abstract
Dog breeding promotes within-group homogeneity through conformation to strict breed standards, while simultaneously driving between-group heterogeneity. There are over 350 recognized dog breeds that provide the foundation for investigating the genetic basis of phenotypic diversity. Typically, breed standard phenotypes such as stature, pelage, and craniofacial structure are analyzed through genetic association studies. However, such analyses are limited to assayed phenotypes only, leaving difficult-to-measure phenotypic subtleties easily overlooked. We investigated coding variation from over 2000 dogs, leading to discoveries of variants related to craniofacial morphology and stature. Breed-enriched variants were prioritized according to gene constraint, which was calculated using a mutation model derived from trinucleotide substitution probabilities. Among the newly found variants is a splice-acceptor variant in PDGFRA associated with bifid nose, a characteristic trait of Çatalburun dogs, implicating the gene's role in midline closure. Two additional LCORL variants, both associated with canine body size are also discovered: a frameshift that causes a premature stop in large breeds (>25 kg) and an intronic substitution found in small breeds (<10 kg), thus highlighting the importance of allelic heterogeneity in selection for breed traits. Most variants prioritized in this analysis are not associated with genomic signatures for breed differentiation, as these regions are enriched for constrained genes intolerant to nonsynonymous variation. This indicates trait selection in dogs is likely a balancing act between preserving essential gene functions and maximizing regulatory variation to drive phenotypic extremes.
Collapse
Affiliation(s)
- Reuben M Buckley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Nüket Bilgen
- Department of Animal Genetics, Faculty of Veterinary Medicine, University of Ankara, Ankara 06110, Türkiye
| | - Alexander C Harris
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Peter Savolainen
- KTH Royal Institute of Technology, School of Chemistry, Biotechnology and Health, Science for Life Laboratory, SE-100 44 Stockholm, Sweden
| | - Cafer Tepeli
- Department of Animal Science, Faculty of Veterinary Medicine, University of Selcuk, Konya 42100, Türkiye
| | - Metin Erdoğan
- Department of Veterinary Biology and Genetics, Faculty of Veterinary Medicine, Afyon Kocatepe University, Afyonkarahisar 03200, Türkiye
| | - Aitor Serres Armero
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Dayna L Dreger
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Frank G van Steenbeek
- Department of Clinical Sciences, Faculty of Veterinary Medicine, Utrecht University, 3584 CM Utrecht, The Netherlands
| | - Marjo K Hytönen
- Department of Medical and Clinical Genetics, University of Helsinki, 00014 Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland
- Folkhälsan Research Center, 00290 Helsinki, Finland
| | - Heidi G Parker
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Jessica Hale
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Hannes Lohi
- Department of Medical and Clinical Genetics, University of Helsinki, 00014 Helsinki, Finland
- Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland
- Folkhälsan Research Center, 00290 Helsinki, Finland
| | - Bengi Çınar Kul
- Department of Animal Genetics, Faculty of Veterinary Medicine, University of Ankara, Ankara 06110, Türkiye
| | - Adam R Boyko
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, New York 14853, USA
- Embark Veterinary, Inc., Boston, Massachusetts 02210, USA
| | - Elaine A Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
| |
Collapse
|
5
|
Saparov A, Zech M. Big data and transformative bioinformatics in genomic diagnostics and beyond. Parkinsonism Relat Disord 2025; 134:107311. [PMID: 39924354 DOI: 10.1016/j.parkreldis.2025.107311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 01/23/2025] [Accepted: 01/25/2025] [Indexed: 02/11/2025]
Abstract
The current era of high-throughput analysis-driven research offers invaluable insights into disease etiologies, accurate diagnostics, pathogenesis, and personalized therapy. In the field of movement disorders, investigators are facing an increasing growth in the volume of produced patient-derived datasets, providing substantial opportunities for precision medicine approaches based on extensive information accessibility and advanced annotation practices. Integrating data from multiple sources, including phenomics, genomics, and multi-omics, is crucial for comprehensively understanding different types of movement disorders. Here, we explore formats and analytics of big data generated for patients with movement disorders, including strategies to meaningfully share the data for optimized patient benefit. We review computational methods that are essential to accelerate the process of evaluating the increasing amounts of specialized data collected. Based on concrete examples, we highlight how bioinformatic approaches facilitate the translation of multidimensional biological information into clinically relevant knowledge. Moreover, we outline the feasibility of computer-aided therapeutic target evaluation, and we discuss the importance of expanding the focus of big data research to understudied phenotypes such as dystonia.
Collapse
Affiliation(s)
- Alice Saparov
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany
| | - Michael Zech
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany.
| |
Collapse
|
6
|
Kimura H, Lahouel K, Tomasetti C, Roberts NJ. Functional characterization of all CDKN2A missense variants and comparison to in silico models of pathogenicity. eLife 2025; 13:RP95347. [PMID: 40238651 PMCID: PMC12002794 DOI: 10.7554/elife.95347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2025] Open
Abstract
Interpretation of variants identified during genetic testing is a significant clinical challenge. In this study, we developed a high-throughput CDKN2A functional assay and characterized all possible human CDKN2A missense variants. We found that 17.7% of all missense variants were functionally deleterious. We also used our functional classifications to assess the performance of in silico models that predict the effect of variants, including recently reported models based on machine learning. Notably, we found that all in silico models performed similarly when compared to our functional classifications with accuracies of 39.5-85.4%. Furthermore, while we found that functionally deleterious variants were enriched within ankyrin repeats, we did not identify any residues where all missense variants were functionally deleterious. Our functional classifications are a resource to aid the interpretation of CDKN2A variants and have important implications for the application of variant interpretation guidelines, particularly the use of in silico models for clinical variant interpretation.
Collapse
Affiliation(s)
- Hirokazu Kimura
- Department of Pathology, the Johns Hopkins University School of MedicineBaltimoreUnited States
| | - Kamel Lahouel
- Division of Integrated Genomics, Translational Genomics Research InstitutePhoenixUnited States
- Department of Computational and Quantitative Medicine, Beckman Research Institute, City of HopeDuarteUnited States
| | - Cristian Tomasetti
- Division of Integrated Genomics, Translational Genomics Research InstitutePhoenixUnited States
- Department of Computational and Quantitative Medicine, Beckman Research Institute, City of HopeDuarteUnited States
| | - Nicholas Jason Roberts
- Department of Pathology, the Johns Hopkins University School of MedicineBaltimoreUnited States
- Department of Oncology, the Johns Hopkins University School of MedicineBaltimoreUnited States
| |
Collapse
|
7
|
Landry-Voyer AM, Holling T, Mis EK, Mir Hassani Z, Alawi M, Ji W, Jeffries L, Kutsche K, Bachand F, Lakhani SA. Biallelic variants in the conserved ribosomal protein chaperone gene PDCD2 are associated with hydrops fetalis and early pregnancy loss. Proc Natl Acad Sci U S A 2025; 122:e2426078122. [PMID: 40208938 PMCID: PMC12012559 DOI: 10.1073/pnas.2426078122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Accepted: 03/10/2025] [Indexed: 04/12/2025] Open
Abstract
Pregnancy loss is a major problem in clinical medicine with devastating consequences for families. Next generation sequencing has improved our ability to identify underlying molecular causes, though over half of all cases lack a clear etiology. Here, we began with clinical evaluation combined with exome sequencing across independent families to identify bi-allelic candidate genetic variants in the Programmed Cell Death 2 (PDCD2) gene in multiple fetuses with nonimmune hydrops fetalis (NIHF). PDCD2 is an evolutionarily conserved protein with no prior association with monogenic disorders. PDCD2 is known to act as a molecular chaperone for the ribosomal protein uS5, and this complex formation is important for incorporation of uS5 into the 40S subunit, a crucial step in ribosome biogenesis. Primary fibroblasts from an affected fetus and cell lines expressing PDCD2 patient variants demonstrated reduced levels of PDCD2, reduced PDCD2 binding to uS5, and altered ribosomal RNA processing. Xenopus tadpoles with Pdcd2 knockdown demonstrated developmental defects and edema, reminiscent of the NIHF seen in affected fetuses, and showed altered ribosomal RNA processing. Through genetic, biochemical, and in vivo approaches, we provide evidence that bi-allelic PDCD2 variants cause an autosomal recessive ribosomal biogenesis disorder resulting in pregnancy loss.
Collapse
Affiliation(s)
- Anne-Marie Landry-Voyer
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, SherbrookeJ1E4K8, Canada
| | - Tess Holling
- Institute of Human Genetics, University Medical Center Hamburg-Eppendorf, Hamburg20246, Germany
| | - Emily K. Mis
- Pediatric Genomics Discovery Program, Department of Pediatrics, Yale University School of Medicine, New Haven, CT06510
| | - Zabih Mir Hassani
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, SherbrookeJ1E4K8, Canada
| | - Malik Alawi
- Bioinformatics Core, University Medical Center Hamburg-Eppendorf, Hamburg20246, Germany
| | - Weizhen Ji
- Pediatric Genomics Discovery Program, Department of Pediatrics, Yale University School of Medicine, New Haven, CT06510
| | - Lauren Jeffries
- Pediatric Genomics Discovery Program, Department of Pediatrics, Yale University School of Medicine, New Haven, CT06510
| | - Kerstin Kutsche
- Institute of Human Genetics, University Medical Center Hamburg-Eppendorf, Hamburg20246, Germany
- German Center for Child and Adolescent Health, partner site Hamburg, Hamburg20246, Germany
| | - François Bachand
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, SherbrookeJ1E4K8, Canada
| | - Saquib A. Lakhani
- Pediatric Genomics Discovery Program, Department of Pediatrics, Yale University School of Medicine, New Haven, CT06510
| |
Collapse
|
8
|
Livesey BJ, Badonyi M, Dias M, Frazer J, Kumar S, Lindorff-Larsen K, McCandlish DM, Orenbuch R, Shearer CA, Muffley L, Foreman J, Glazer AM, Lehner B, Marks DS, Roth FP, Rubin AF, Starita LM, Marsh JA. Guidelines for releasing a variant effect predictor. Genome Biol 2025; 26:97. [PMID: 40234898 PMCID: PMC11998465 DOI: 10.1186/s13059-025-03572-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Accepted: 04/08/2025] [Indexed: 04/17/2025] Open
Abstract
Computational methods for assessing the likely impacts of mutations, known as variant effect predictors (VEPs), are widely used in the assessment and interpretation of human genetic variation, as well as in other applications like protein engineering. Many different VEPs have been released, and there is tremendous variability in their underlying algorithms, outputs, and the ways in which the methodologies and predictions are shared. This leads to considerable difficulties for users trying to navigate the selection and application of VEPs. Here, to address these issues, we provide guidelines and recommendations for the release of novel VEPs.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mihaly Badonyi
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mafalda Dias
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jonathan Frazer
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Sushant Kumar
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Kresten Lindorff-Larsen
- Department of Biology, Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen, Denmark
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, USA
| | - Rose Orenbuch
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Lara Muffley
- Department of Genome Sciences, University of Washingtonand the, Brotman Baty Institute for Precision Medicine , Seattle, WA, USA
| | - Julia Foreman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Ben Lehner
- Wellcome Sanger Institute, Cambridge, UK
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Institució Catalana de Recerca I Estudis Avançats (ICREA), Barcelona, Spain
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Boston, MA, USA
| | - Frederick P Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Alan F Rubin
- Bioinformatics Division, Walterand , Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Lea M Starita
- Department of Genome Sciences, University of Washingtonand the, Brotman Baty Institute for Precision Medicine , Seattle, WA, USA
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
9
|
Zhao Y, Lan T, Zhong G, Hagen J, Pan H, Chung WK, Shen Y. A probabilistic graphical model for estimating selection coefficient of nonsynonymous variants from human population sequence data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.12.11.23299809. [PMID: 38168397 PMCID: PMC10760286 DOI: 10.1101/2023.12.11.23299809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
Accurately predicting the effect of missense variants is important in discovering disease risk genes and clinical genetic diagnostics. Commonly used computational methods predict pathogenicity, which does not capture the quantitative impact on fitness in humans. We developed a method, MisFit, to estimate missense fitness effect using a graphical model. MisFit jointly models the effect at a molecular level (𝑑) and a population level (selection coefficient, 𝑠), assuming that in the same gene, missense variants with similar 𝑑 have similar 𝑠. We trained it by maximizing probability of observed allele counts in 236,017 European individuals. We show that 𝑠 is informative in predicting allele frequency across ancestries and consistent with the fraction of de novo mutations in sites under strong selection. Further, 𝑠 outperforms previous methods in prioritizing de novo missense variants in individuals with neurodevelopmental disorders. In conclusion, MisFit accurately predicts 𝑠 and yields new insights from genomic data.
Collapse
Affiliation(s)
- Yige Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY 10032
| | - Tian Lan
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
| | - Guojie Zhong
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- The Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY 10032
| | - Jake Hagen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- . Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115
| | - Hongbing Pan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
| | - Wendy K. Chung
- . Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115
| | - Yufeng Shen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
- JP Sulzberger Columbia Genome Center, Columbia University, New York, NY 10032
| |
Collapse
|
10
|
Herger M, Kajba CM, Buckley M, Cunha A, Strom M, Findlay GM. High-throughput screening of human genetic variants by pooled prime editing. CELL GENOMICS 2025; 5:100814. [PMID: 40120586 PMCID: PMC12008803 DOI: 10.1016/j.xgen.2025.100814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 01/10/2025] [Accepted: 02/13/2025] [Indexed: 03/25/2025]
Abstract
Multiplexed assays of variant effect (MAVEs) enable scalable functional assessment of human genetic variants. However, established MAVEs are limited by exogenous expression of variants or constraints of genome editing. Here, we introduce a pooled prime editing (PE) platform to scalably assay variants in their endogenous context. We first improve efficiency of PE in HAP1 cells, defining optimal prime editing guide RNA (pegRNA) designs and establishing enrichment of edited cells via co-selection. We next demonstrate negative selection screening by testing over 7,500 pegRNAs targeting SMARCB1 and observing depletion of efficiently installed loss-of-function (LoF) variants. We then screen for LoF variants in MLH1 via 6-thioguanine selection, testing 65.3% of all possible SNVs in a 200-bp region including exon 10 and 362 non-coding variants from ClinVar spanning a 60-kb region. The platform's overall accuracy for discriminating pathogenic variants indicates that it will be highly valuable for identifying new variants underlying diverse human phenotypes across large genomic regions.
Collapse
Affiliation(s)
- Michael Herger
- The Genome Function Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Christina M Kajba
- The Genome Function Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Megan Buckley
- The Genome Function Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - Ana Cunha
- Viral Vector Core, Human Biology Facility, The Francis Crick Institute, London NW1 1AT, UK
| | - Molly Strom
- Viral Vector Core, Human Biology Facility, The Francis Crick Institute, London NW1 1AT, UK
| | - Gregory M Findlay
- The Genome Function Laboratory, The Francis Crick Institute, London NW1 1AT, UK.
| |
Collapse
|
11
|
Orenbuch R, Shearer CA, Kollasch AW, Spinner HD, Hopf TA, van Niekerk L, Franceschi D, Dias M, Frazer J, Marks DS. Proteome-wide model for human disease genetics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.11.27.23299062. [PMID: 38076790 PMCID: PMC10705666 DOI: 10.1101/2023.11.27.23299062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Identifying variants driving disease accelerates both genetic diagnosis and therapeutic development, but missense variants still present a bottleneck as their effects are less straightforward than truncations or nonsense mutations. While computational prediction methods are sufficiently accurate to be of clinical value for variants in known disease genes, they do not generalize well to other genes as the scores are not calibrated across the proteome 1-6 . To address this, we developed a deep generative model, popEVE, that combines evolutionary information with population sequence data 7 and achieves state-of-the-art performance on a suite of proteome-wide prediction tasks, without overestimating the prevalence of deleterious variants in the population. popEVE identifies 442 genes in a developmental disorder cohort 8 , including evidence of 123 novel candidates, many without the need for cohort-wide enrichment. Candidate genes are functionally similar to known developmental disorder genes and case variants tend to fall in functionally important regions of these genes. Finally, we show that these findings can be reproduced from analysis of the patient exomes alone, demonstrating that popEVE provides a new avenue for genetic analysis in situations where traditional methods fail, including genetic diagnosis of rare-as-one diseases, even in the absence of parent sequencing.
Collapse
|
12
|
LaFlam TN, Billesbølle CB, Dinh T, Wolfreys FD, Lu E, Matteson T, An J, Xu Y, Singhal A, Brandes N, Ntranos V, Manglik A, Cyster JG, Ye CJ. Phenotypic pleiotropy of missense variants in human B cell-confinement receptor P2RY8. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.28.640567. [PMID: 40093123 PMCID: PMC11908195 DOI: 10.1101/2025.02.28.640567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Missense variants can have pleiotropic effects on protein function and predicting these effects can be difficult. We performed near-saturation deep mutational scanning of P2RY8, a G-protein-coupled receptor that promotes germinal center B cell confinement. We assayed the effect of each variant on surface expression, migration, and proliferation. We delineated variants that affected both expression and function, affected function independently of expression, and discrepantly affected migration and proliferation. We also used cryo-electron microscopy to determine the structure of activated, ligand-bound P2RY8, providing structural insights into the effects of variants on ligand binding and signal transmission. We applied the deep mutational scanning results to both improve computational variant effect predictions and to characterize the phenotype of germline variants and lymphoma-associated variants. Together, our results demonstrate the power of integrating deep mutational scanning, structure determination, and in silico prediction to advance the understanding of a receptor important in human health.
Collapse
Affiliation(s)
- Taylor N. LaFlam
- Division of Pediatric Rheumatology, Department of Pediatrics, University of California, San Francisco, San Francisco, CA, USA
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA, USA
| | - Christian B. Billesbølle
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
| | - Tuan Dinh
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
| | - Finn D. Wolfreys
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Erick Lu
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Tomas Matteson
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Jinping An
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Ying Xu
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Arushi Singhal
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA, USA
| | - Nadav Brandes
- Department of Biochemistry and Molecular Pharmacology, New York University, New York, NY, USA
| | - Vasilis Ntranos
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Diabetes Center, University of California, San Francisco, CA, USA
| | - Aashish Manglik
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Quantitative Biosciences Institute, San Francisco, CA, USA
- Department of Anesthesia and Perioperative Care, University of California, San Francisco, San Francisco, CA, USA
| | - Jason G. Cyster
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
- Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Chun Jimmie Ye
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, University of California, San Francisco, San Francisco, CA, USA
- Arc Institute, Palo Alto, CA, USA
| |
Collapse
|
13
|
Albors C, Li JC, Benegas G, Ye C, Song YS. A Phylogenetic Approach to Genomic Language Modeling. ARXIV 2025:arXiv:2503.03773v1. [PMID: 40093357 PMCID: PMC11908359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.
Collapse
Affiliation(s)
- Carlos Albors
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Jianan Canal Li
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Gonzalo Benegas
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Yun S Song
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
14
|
Ma K, Yang X, Mao Y. Advancing evolutionary medicine with complete primate genomes and advanced biotechnologies. Trends Genet 2025; 41:201-217. [PMID: 39627062 DOI: 10.1016/j.tig.2024.11.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/03/2024] [Accepted: 11/06/2024] [Indexed: 03/06/2025]
Abstract
Evolutionary medicine, which integrates evolutionary biology and medicine, significantly enhances our understanding of human traits and disease susceptibility. However, previous studies in this field have often focused on single-nucleotide variants due to technological limitations in characterizing complex genomic regions, hindering the comprehensive analyses of their evolutionary origins and clinical significance. In this review, we summarize recent advancements in complete telomere-to-telomere (T2T), primate genomes and other primate resources, and illustrate how these resources facilitate the research of complex regions. We focus on several biomedically relevant regions to examine the relationship between primate genome evolution and human diseases. We also highlight the potentials of high-throughput functional genomic technologies for assessing candidate loci. Finally, we discuss future directions for primate research within the context of evolutionary medicine.
Collapse
Affiliation(s)
- Kaiyue Ma
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xiangyu Yang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China; Center for Genomic Research, International Institutes of Medicine, Fourth Affiliated Hospital, Zhejiang University, Yiwu, Zhejiang, China.
| |
Collapse
|
15
|
Rastogi R, Chung R, Li S, Li C, Lee K, Woo J, Kim DW, Keum C, Babbi G, Martelli PL, Savojardo C, Casadio R, Chennen K, Weber T, Poch O, Ancien F, Cia G, Pucci F, Raimondi D, Vranken W, Rooman M, Marquet C, Olenyi T, Rost B, Andreoletti G, Kamandula A, Peng Y, Bakolitsa C, Mort M, Cooper DN, Bergquist T, Pejaver V, Liu X, Radivojac P, Brenner SE, Ioannidis NM. Critical assessment of missense variant effect predictors on disease-relevant variant data. Hum Genet 2025; 144:281-293. [PMID: 40113603 PMCID: PMC11976771 DOI: 10.1007/s00439-025-02732-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 02/07/2025] [Indexed: 03/22/2025]
Abstract
Regular, systematic, and independent assessments of computational tools that are used to predict the pathogenicity of missense variants are necessary to evaluate their clinical and research utility and guide future improvements. The Critical Assessment of Genome Interpretation (CAGI) conducts the ongoing Annotate-All-Missense (Missense Marathon) challenge, in which missense variant effect predictors (also called variant impact predictors) are evaluated on missense variants added to disease-relevant databases following the prediction submission deadline. Here we assess predictors submitted to the CAGI 6 Annotate-All-Missense challenge, predictors commonly used in clinical genetics, and recently developed deep learning methods. We examine performance across a range of settings relevant for clinical and research applications, focusing on different subsets of the evaluation data as well as high-specificity and high-sensitivity regimes. Our evaluations reveal notable advances in current methods relative to older, well-cited tools in the field. While meta-predictors tend to outperform their constituent individual predictors, several newer individual predictors perform comparably to commonly used meta-predictors. Predictor performance varies between high-specificity and high-sensitivity regimes, highlighting that different methods may be optimal for different use cases. We also characterize two potential sources of bias. Predictors that incorporate allele frequency as a predictive feature tend to have reduced performance when distinguishing pathogenic variants from very rare benign variants, and predictors trained on pathogenicity labels from curated variant databases often inherit gene-level label imbalances. Our findings help illuminate the clinical and research utility of modern missense variant effect predictors and identify potential areas for future development.
Collapse
Affiliation(s)
- Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA.
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, CA, USA
| | - Sindy Li
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Chang Li
- USF Genomics, College of Public Health, University of South Florida, Tampa, FL, USA
| | | | | | | | | | - Giulia Babbi
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Castrense Savojardo
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | | | | | | | - François Ancien
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
| | - Daniele Raimondi
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- Institut de Génétique Moléculaire de Montpellier, Université de Montpellier, Montpellier, France
| | - Wim Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology, Technical University of Munich, Munich, Germany
| | - Tobias Olenyi
- Department of Informatics, Bioinformatics and Computational Biology, Technical University of Munich, Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology, Technical University of Munich, Munich, Germany
| | - Gaia Andreoletti
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
- Sage Bionetworks, Seattle, WA, USA
| | - Akash Kamandula
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Yisu Peng
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Matthew Mort
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, UK
| | - David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, UK
| | - Timothy Bergquist
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Vikas Pejaver
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Xiaoming Liu
- USF Genomics, College of Public Health, University of South Florida, Tampa, FL, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Steven E Brenner
- Center for Computational Biology, University of California, Berkeley, CA, USA.
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA.
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
16
|
Kimura H, Lahouel K, Tomasetti C, Roberts NJ. Functional characterization of all CDKN2A missense variants and comparison to in silico models of pathogenicity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.12.28.573507. [PMID: 38234851 PMCID: PMC10793438 DOI: 10.1101/2023.12.28.573507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Interpretation of variants identified during genetic testing is a significant clinical challenge. In this study, we developed a high-throughput CDKN2A functional assay and characterized all possible CDKN2A missense variants. We found that 17.7% of all missense variants were functionally deleterious. We also used our functional classifications to assess the performance of in silico models that predict the effect of variants, including recently reported models based on machine learning. Notably, we found that all in silico models performed similarly when compared to our functional classifications with accuracies of 39.5-85.4%. Furthermore, while we found that functionally deleterious variants were enriched within ankyrin repeats, we did not identify any residues where all missense variants were functionally deleterious. Our functional classifications are a resource to aid the interpretation of CDKN2A variants and have important implications for the application of variant interpretation guidelines, particularly the use of in silico models for clinical variant interpretation.
Collapse
Affiliation(s)
- Hirokazu Kimura
- Department of Pathology, the Johns Hopkins University School of Medicine; Baltimore, 21287, USA
| | - Kamel Lahouel
- Division of Integrated Genomics, Translational Genomics Research Institute; Phoenix, 85004, USA
- Department of Computational and Quantitative Medicine, Beckman Research Institute, City of Hope; Duarte, 91010, USA
| | - Cristian Tomasetti
- Division of Integrated Genomics, Translational Genomics Research Institute; Phoenix, 85004, USA
- Department of Computational and Quantitative Medicine, Beckman Research Institute, City of Hope; Duarte, 91010, USA
| | - Nicholas J. Roberts
- Department of Pathology, the Johns Hopkins University School of Medicine; Baltimore, 21287, USA
- Department of Oncology, the Johns Hopkins University School of Medicine; Baltimore, 21287, USA
| |
Collapse
|
17
|
Chen YM, Hsiao TH, Lin CH, Fann YC. Unlocking precision medicine: clinical applications of integrating health records, genetics, and immunology through artificial intelligence. J Biomed Sci 2025; 32:16. [PMID: 39915780 PMCID: PMC11804102 DOI: 10.1186/s12929-024-01110-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 12/02/2024] [Indexed: 02/09/2025] Open
Abstract
Artificial intelligence (AI) has emerged as a transformative force in precision medicine, revolutionizing the integration and analysis of health records, genetics, and immunology data. This comprehensive review explores the clinical applications of AI-driven analytics in unlocking personalized insights for patients with autoimmune rheumatic diseases. Through the synergistic approach of integrating AI across diverse data sets, clinicians gain a holistic view of patient health and potential risks. Machine learning models excel at identifying high-risk patients, predicting disease activity, and optimizing therapeutic strategies based on clinical, genomic, and immunological profiles. Deep learning techniques have significantly advanced variant calling, pathogenicity prediction, splicing analysis, and MHC-peptide binding predictions in genetics. AI-enabled immunology data analysis, including dimensionality reduction, cell population identification, and sample classification, provides unprecedented insights into complex immune responses. The review highlights real-world examples of AI-driven precision medicine platforms and clinical decision support tools in rheumatology. Evaluation of outcomes demonstrates the clinical benefits and impact of these approaches in revolutionizing patient care. However, challenges such as data quality, privacy, and clinician trust must be navigated for successful implementation. The future of precision medicine lies in the continued research, development, and clinical integration of AI-driven strategies to unlock personalized patient care and drive innovation in rheumatology.
Collapse
Affiliation(s)
- Yi-Ming Chen
- Division of Allergy, Immunology and Rheumatology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, 40705, Taiwan
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, 11221, Taiwan
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, 40705, Taiwan
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taipei, 112304, Taiwan
- Graduate Institute of Clinical Medicine, College of Medicine, National Chung Hsing University, Taichung, 402202, Taiwan
- Precision Medicine Research Center, College of Medicine, National Chung Hsing University, Taichung, 402202, Taiwan
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, 40705, Taiwan
- Department of Public Health, College of Medicine, Fu Jen Catholic University, New Taipei City, 242062, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, 402202, Taiwan
| | - Ching-Heng Lin
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, 40705, Taiwan.
- Department of Public Health, College of Medicine, Fu Jen Catholic University, New Taipei City, 242062, Taiwan.
- Department of Industrial Engineering and Enterprise Information, Tunghai University, Taichung, 407224, Taiwan.
- Institute of Public Health and Community Medicine Research Center, National Yang Ming Chiao Tung University, Taipei, 11221, Taiwan.
| | - Yang C Fann
- Division of Intramural Research, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
18
|
Trivedi M, Arekar K, Manu S, Kuderna LFK, Rogers J, Farh KK, Bonet TM, Umapathy G. Historical Demography and Species Distribution Models Shed Light on Speciation in Primates of Northeast India. Ecol Evol 2025; 15:e70968. [PMID: 40008062 PMCID: PMC11850985 DOI: 10.1002/ece3.70968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 01/21/2025] [Accepted: 01/22/2025] [Indexed: 02/27/2025] Open
Abstract
Past climate change is one of the important factors influencing primate speciation. Populations of various species could have risen or declined in response to these climatic fluctuations. Northeast India harbors a rich diversity of primates, where such fluctuations can be implicated. Recent advances in climate modeling as well as genomic data analysis has paved the way for understanding how species accumulate at a particular geographic region. We utilized these methods to explore the primate diversity in this unique region in relation to past climate change. To ascertain the population level changes, we inferred the demographic history of nine species of primates found in Northeast India and compared it with species distribution models of Pliocene and Pleistocene period. Through this study, we are able to provide a detailed picture of how past climatic changes have resulted in the present species diversity and this mixture of species have either originated in the region or have dispersed from mainland Southeast Asia. We observe that effective population size has decreased for all the species, but distributions are different for all the four genera: Macaca, Trachypithecus, Hoolock and Nycticebus. It also gives an idea about how each species is affected differently by climate change, and why it should be given emphasis in framing species-wise conservation models for future climate change.
Collapse
Affiliation(s)
- Mihir Trivedi
- Laboratory for the Conservation of Endangered SpeciesCSIR‐Centre for Cellular and Molecular BiologyHyderabadIndia
| | - Kunal Arekar
- Centre for Ecological SciencesIndian Institute of ScienceBangaloreIndia
| | - Shivakumara Manu
- Laboratory for the Conservation of Endangered SpeciesCSIR‐Centre for Cellular and Molecular BiologyHyderabadIndia
- Academy of Scientific and Innovative Research (AcSIR)GhaziabadIndia
| | | | - Jeffrey Rogers
- Department of Molecular and Human Genetics, Human Genome Sequencing CenterBaylor College of MedicineHoustonTexasUSA
| | | | - Tomas Marques Bonet
- Institute of Evolutionary Biology (UPF‐CSIC), PRBBBarcelonaSpain
- Institució Catalana de Recerca i Estudis Avançats (ICREA) and Universitat Pompeu FabraBarcelonaSpain
| | - Govindhaswamy Umapathy
- Laboratory for the Conservation of Endangered SpeciesCSIR‐Centre for Cellular and Molecular BiologyHyderabadIndia
- Academy of Scientific and Innovative Research (AcSIR)GhaziabadIndia
| |
Collapse
|
19
|
Lauer L, Rivas MA. Unified meta regression models for rare variant association studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.23.634522. [PMID: 39896616 PMCID: PMC11785203 DOI: 10.1101/2025.01.23.634522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Rare variant association studies (RVAS) of complex traits have emerged as a powerful approach to advance drug discovery and diagnostics. Missense pathogenicity predictions from AlphaMissense based on structural context and protein language models improve the differentiation between benign and deleterious variants. Constraint metrics, on the other hand, allow researchers to pinpoint genomic regions under selective pressure that may not directly impact protein structure, but are more likely to contain functionally important mutations. Loss-of-function (LoF) variants, which result in the complete or partial loss of protein function, are particularly informative, as it is more straightforward to assess their downstream functional consequences. In this study, we present a unified meta regression model approach that incorporates the probability of pathogenicity, probability of constraint, and indicator whether a variant is a predicted loss-of-function or missense variant as features to model the observed effect size and uncertainty of effect size obtained from single-variant genetic analysis. We applied the unified meta regression model to 1,144 continuous phenotypes from UK Biobank using single variant summary statistics obtained from Genebass. We replicated our findings using the AllofUS cohort. For each gene discovery, we make available a characterization of whether constrained sites are associated with the phenotype, whether pathogenic sites determined by structural based predictions are associated with phenotype, and whether broader loss-of-function or missense variant annotation better explains the summary statistics observed. Our results are publicly available at Global Biobank Engine (https://biobankengine.shinyapps.io/phenome-wide-unified-model/).
Collapse
Affiliation(s)
| | - Manuel A. Rivas
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA 94305
| |
Collapse
|
20
|
Kikuchi Y, Uddin M, Veltman JA, Wells S, Morris C, Woodbury-Smith M. Evolutionary constrained genes associated with autism spectrum disorder across 2,054 nonhuman primate genomes. Mol Autism 2025; 16:5. [PMID: 39849619 PMCID: PMC11755938 DOI: 10.1186/s13229-024-00633-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Accepted: 12/11/2024] [Indexed: 01/25/2025] Open
Abstract
BACKGROUND Significant progress has been made in elucidating the genetic underpinnings of Autism Spectrum Disorder (ASD). However, there are still significant gaps in our understanding of the link between genomics, neurobiology and clinical phenotype in scientific discovery. New models are therefore needed to address these gaps. Rhesus macaques (Macaca mulatta) have been extensively used for preclinical neurobiological research because of remarkable similarities to humans across biology and behaviour that cannot be captured by other experimental animals. METHODS We used the macaque Genotype and Phenotype (mGAP) resource consisting of 2,054 macaque genomes to examine patterns of evolutionary constraint in known human neurodevelopmental genes. Residual variation intolerance scores (RVIS) were calculated for all annotated autosomal genes (N = 18,168) and Gene Set Enrichment Analysis (GSEA) was used to examine patterns of constraint across ASD genes and related neurodevelopmental genes. RESULTS We demonstrated that patterns of constraint across autosomal genes are correlated in humans and macaques, and that ASD-associated genes exhibit significant constraint in macaques (p = 9.4 × 10- 27). Among macaques, many key ASD-implicated genes were observed to harbour predicted damaging mutations. A small number of key ASD-implicated genes that are highly intolerant to mutation in humans, however, showed no evidence of similar intolerance in macaques (CACNA1D, MBD5, AUTS2 and NRXN1). Constraint was also observed across genes associated with intellectual disability (p = 1.1 × 10- 46), epilepsy (p = 2.1 × 10- 33) and schizophrenia (p = 4.2 × 10- 45), and for an overlapping neurodevelopmental gene set (p = 4.0 × 10- 10). LIMITATIONS The lack of behavioural phenotypes among the macaques whose genotypes were studied means that we are unable to further investigate whether genetic variants have similar phenotypic consequences among nonhuman primates. CONCLUSION The presence of pathological mutations in ASD genes among macaques, along with evidence of similar genetic constraints to those in humans, provides a strong rationale for further investigation of genotype-phenotype relationships in macaques. This highlights the importance of developing primate models of ASD to elucidate the neurobiological underpinnings and advance approaches for precision medicine and therapeutic interventions.
Collapse
Affiliation(s)
- Yukiko Kikuchi
- Biosciences Institute, Newcastle University, Newcastle upon Tyne, UK.
| | - Mohammed Uddin
- Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
- GenomeArc Inc, Mississauga, ON, Canada
| | - Joris A Veltman
- Biosciences Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Sara Wells
- MRC Centre for Macaques, Salisbury, UK
- Mary Lyon Centre at MRC Harwell, Oxfordshire, UK
| | - Christopher Morris
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Marc Woodbury-Smith
- Biosciences Institute, Newcastle University, Newcastle upon Tyne, UK.
- Department of Psychiatry, Queen's University, Kingston, ON, Canada.
| |
Collapse
|
21
|
López-Cortegano E, Chebib J, Jonas A, Vock A, Künzel S, Keightley PD, Tautz D. The rate and spectrum of new mutations in mice inferred by long-read sequencing. Genome Res 2025; 35:43-54. [PMID: 39622636 PMCID: PMC11789640 DOI: 10.1101/gr.279982.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Accepted: 11/26/2024] [Indexed: 01/12/2025]
Abstract
All forms of genetic variation originate from new mutations, making it crucial to understand their rates and mechanisms. Here, we use long-read sequencing from Pacific Biosciences (PacBio) to investigate de novo mutations that accumulated in 12 inbred mouse lines derived from three commonly used inbred strains (C3H, C57BL/6, and FVB) maintained for 8 to 15 generations in a mutation accumulation (MA) experiment. We built chromosome-level genome assemblies based on the MA line founders' genomes and then employed a combination of read and assembly-based methods to call the complete spectrum of new mutations. On average, there are about 45 mutations per haploid genome per generation, about half of which (54%) are insertions and deletions shorter than 50 bp (indels). The remainder are single-nucleotide mutations (SNMs; 44%) and large structural mutations (SMs; 2%). We found that the degree of DNA repetitiveness is positively correlated with SNM and indel rates and that a substantial fraction of SMs can be explained by homology-dependent mechanisms associated with repeat sequences. Most (90%) indels can be attributed to microsatellite contractions and expansions, and there is a marked bias toward 4 bp indels. Among the different types of SMs, tandem repeat mutations have the highest mutation rate, followed by insertions of transposable elements (TEs). We uncover a rich landscape of active TEs, notable differences in their spectrum among MA lines and strains, and a high rate of gene retroposition. Our study offers novel insights into mammalian genome evolution and highlights the importance of repetitive elements in shaping genomic diversity.
Collapse
Affiliation(s)
- Eugenio López-Cortegano
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom;
| | - Jobran Chebib
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom
| | - Anika Jonas
- Department for Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, 24306 Plön, Germany
| | - Anastasia Vock
- Department for Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, 24306 Plön, Germany
| | - Sven Künzel
- Department for Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, 24306 Plön, Germany
| | - Peter D Keightley
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom
| | - Diethard Tautz
- Department for Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, 24306 Plön, Germany
| |
Collapse
|
22
|
Benegas G, Albors C, Aw AJ, Ye C, Song YS. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat Biotechnol 2025:10.1038/s41587-024-02511-w. [PMID: 39747647 DOI: 10.1038/s41587-024-02511-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 11/20/2024] [Indexed: 01/04/2025]
Abstract
Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Graduate Group in Computational Biology, University of California, Berkeley, CA, US
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, US
| | - Carlos Albors
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, US
| | - Alan J Aw
- Department of Statistics, University of California, Berkeley, CA, US
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, US
| | - Yun S Song
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, US.
- Department of Statistics, University of California, Berkeley, CA, US.
- Center for Computational Biology, University of California, Berkeley, CA, US.
| |
Collapse
|
23
|
Petrazzini BO, Balick DJ, Forrest IS, Cho J, Rocheleau G, Jordan DM, Do R. Ensemble and consensus approaches to prediction of recessive inheritance for missense variants in human disease. CELL REPORTS METHODS 2024; 4:100914. [PMID: 39657681 PMCID: PMC11704621 DOI: 10.1016/j.crmeth.2024.100914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 09/19/2024] [Accepted: 11/13/2024] [Indexed: 12/12/2024]
Abstract
Mode of inheritance (MOI) is necessary for clinical interpretation of pathogenic variants; however, the majority of variants lack this information. Furthermore, variant effect predictors are fundamentally insensitive to recessive-acting diseases. Here, we present MOI-Pred, a variant pathogenicity prediction tool that accounts for MOI, and ConMOI, a consensus method that integrates variant MOI predictions from three independent tools. MOI-Pred integrates evolutionary and functional annotations to produce variant-level predictions that are sensitive to both dominant-acting and recessive-acting pathogenic variants. Both MOI-Pred and ConMOI show state-of-the-art performance on standard benchmarks. Importantly, dominant and recessive predictions from both tools are enriched in individuals with pathogenic variants for dominant- and recessive-acting diseases, respectively, in a real-world electronic health record (EHR)-based validation approach of 29,981 individuals. ConMOI outperforms its component methods in benchmarking and validation, demonstrating the value of consensus among multiple prediction methods. Predictions for all possible missense variants are provided in the "Data and code availability" section.
Collapse
Affiliation(s)
- Ben O Petrazzini
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Daniel J Balick
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Department of Biomedical Informatics, Harvard, Medical School, Boston, MA, USA
| | - Iain S Forrest
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Medical Scientist Training Program, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Judy Cho
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Ghislain Rocheleau
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Daniel M Jordan
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ron Do
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
24
|
Lehr AW, McDaniel KF, Roche KW. Analyses of Human Genetic Data to Identify Clinically Relevant Domains of Neuroligins. Genes (Basel) 2024; 15:1601. [PMID: 39766868 PMCID: PMC11675371 DOI: 10.3390/genes15121601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Revised: 12/03/2024] [Accepted: 12/11/2024] [Indexed: 01/30/2025] Open
Abstract
Background/Objectives: Neuroligins (NLGNs) are postsynaptic adhesion molecules critical for neuronal development that are highly associated with autism spectrum disorder (ASD). Here, we provide an overview of the literature on NLGN rare variants. In addition, we introduce a new approach to analyze human variation within NLGN genes to identify sensitive regions that have an increased frequency of ASD-associated variants to better understand NLGN function. Methods: To identify critical protein subdomains within the NLGN gene family, we developed an algorithm that assesses tolerance to missense mutations in human genetic variation by comparing clinical variants from ClinVar to reference variants from gnomAD. This approach provides tolerance values to subdomains within the protein. Results: Our algorithm identified several critical regions that were conserved across multiple NLGN isoforms. Importantly, this approach also identified a previously reported cluster of pathogenic variants in NLGN4X (also conserved in NLGN1 and NLGN3) as well as a region around the highly characterized NLGN3 R451C ASD-associated mutation. Additionally, we highlighted other, as of yet, uncharacterized regions enriched with mutations. Conclusions: The systematic analysis of NLGN ASD-associated variants compared to variants identified in the unaffected population (gnomAD) reveals conserved domains in NLGN isoforms that are tolerant to variation or are enriched in clinically relevant variants. Examination of databases also allows for predictions of the presumed tolerance to loss of an allele. The use of the algorithm we developed effectively allowed the evaluation of subdomains of NLGNs and can be used to examine other ASD-associated genes.
Collapse
Affiliation(s)
- Alexander W. Lehr
- Receptor Biology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA; (A.W.L.); (K.F.M.)
- Department of Neuroscience, Brown University, Providence, RI 02906, USA
| | - Kathryn F. McDaniel
- Receptor Biology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA; (A.W.L.); (K.F.M.)
- Department of Neuroscience, Brown University, Providence, RI 02906, USA
| | - Katherine W. Roche
- Receptor Biology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA; (A.W.L.); (K.F.M.)
| |
Collapse
|
25
|
Dias M, Orenbuch R, Marks DS, Frazer J. Toward trustable use of machine learning models of variant effects in the clinic. Am J Hum Genet 2024; 111:2589-2593. [PMID: 39561772 PMCID: PMC11639075 DOI: 10.1016/j.ajhg.2024.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 10/17/2024] [Accepted: 10/17/2024] [Indexed: 11/21/2024] Open
Abstract
There has been considerable progress in building models to predict the effect of missense substitutions in protein-coding genes, fueled in large part by progress in applying deep learning methods to sequence data. These models have the potential to enable clinical variant annotation on a large scale and hence increase the impact of patient sequencing in guiding diagnosis and treatment. To realize this potential, it is essential to provide reliable assessments of model performance, scope of applicability, and robustness. As a response to this need, the ClinGen Sequence Variant Interpretation Working Group, Pejaver et al., recently proposed a strategy for validation and calibration of in-silico predictions in the context of guidelines for variant annotation. While this work marks an important step forward, the strategy presented still has important limitations. We propose core principles and recommendations to overcome these limitations that can enable both more reliable and more impactful use of variant effect prediction models in the future.
Collapse
Affiliation(s)
- Mafalda Dias
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; University Pompeu Fabra, Barcelona, Spain
| | - Rose Orenbuch
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jonathan Frazer
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; University Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
26
|
Hong J, Lee D, Hwang A, Kim T, Ryu HY, Choi J. Rare disease genomics and precision medicine. Genomics Inform 2024; 22:28. [PMID: 39627904 PMCID: PMC11616305 DOI: 10.1186/s44342-024-00032-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 11/16/2024] [Indexed: 12/06/2024] Open
Abstract
Rare diseases, though individually uncommon, collectively affect millions worldwide. Genomic technologies and big data analytics have revolutionized diagnosing and understanding these conditions. This review explores the role of genomics in rare disease research, the impact of large consortium initiatives, advancements in extensive data analysis, the integration of artificial intelligence (AI) and machine learning (ML), and the therapeutic implications in precision medicine. We also discuss the challenges of data sharing and privacy concerns, emphasizing the need for collaborative efforts and secure data practices to advance rare disease research.
Collapse
Affiliation(s)
- Juhyeon Hong
- Department of Biomedical Sciences, Korea University College of Medicine, Seoul, 02841, Republic of Korea
| | - Dajun Lee
- Department of Biomedical Sciences, Korea University College of Medicine, Seoul, 02841, Republic of Korea
| | - Ayoung Hwang
- Department of Biomedical Sciences, Korea University College of Medicine, Seoul, 02841, Republic of Korea
| | - Taekeun Kim
- Department of Biomedical Sciences, Korea University College of Medicine, Seoul, 02841, Republic of Korea
| | - Hong-Yeoul Ryu
- School of Life Sciences, BK21 FOUR KNU Creative BioResearch Group, College of Natural Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Jungmin Choi
- Department of Biomedical Sciences, Korea University College of Medicine, Seoul, 02841, Republic of Korea.
| |
Collapse
|
27
|
Williams JPC, Mouilleron S, Trapero RH, Bertran MT, Marsh JA, Walport LJ. Structural insight into the function of human peptidyl arginine deiminase 6. Comput Struct Biotechnol J 2024; 23:3258-3269. [PMID: 39286527 PMCID: PMC11402830 DOI: 10.1016/j.csbj.2024.08.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 08/15/2024] [Accepted: 08/15/2024] [Indexed: 09/19/2024] Open
Abstract
Peptidyl arginine deiminase 6 (PADI6 or PAD6) is vital for early embryonic development in mice and humans, yet its function remains elusive. PADI6 is less conserved than other PADIs and it is currently unknown whether it has a catalytic function. Here we show that human PADI6 dimerises like hPADIs 2-4, however, does not bind Ca2+ and is inactive in in vitro assays against standard PADI substrates. By determining the crystal structure of hPADI6, we show that hPADI6 is structured in the absence of Ca2+ where hPADI2 and hPADI4 are not, and the Ca-binding sites are not conserved. Moreover, we show that whilst the key catalytic aspartic acid and histidine residues are structurally conserved, the cysteine is displaced far from the active site centre and the hPADI6 active site pocket appears closed through a unique evolved mechanism in hPADI6, not present in the other PADIs. Taken together, these findings provide insight into how the function of hPADI6 may differ from the other PADIs based on its structure and provides a resource for characterising the damaging effect of clinically significant PADI6 variants.
Collapse
Affiliation(s)
- Jack P C Williams
- Department of Chemistry, Imperial College London, London, United Kingdom
- Protein-Protein Interaction Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Stephane Mouilleron
- Structural Biology Science Technology Platform, The Francis Crick Institute, London, United Kingdom
| | - Rolando Hernandez Trapero
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - M Teresa Bertran
- Protein-Protein Interaction Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - Louise J Walport
- Department of Chemistry, Imperial College London, London, United Kingdom
- Protein-Protein Interaction Laboratory, The Francis Crick Institute, London, United Kingdom
| |
Collapse
|
28
|
Zhu Z, Han C, Huang S. New insights shed light on the enigma of genetic diversity and species complexity. SCIENCE CHINA. LIFE SCIENCES 2024; 67:2774-2776. [PMID: 39167323 DOI: 10.1007/s11427-023-2610-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 05/04/2024] [Indexed: 08/23/2024]
Affiliation(s)
- Zuobin Zhu
- Xuzhou Engineering Research Center of Medical Genetics and Transformation, Key Laboratory of Genetic Foundation and Clinical Application, Xuzhou Medical University, Xuzhou, 221004, China.
| | - Conghui Han
- Department of Urology, Xuzhou Clinical School of Xuzhou Medical University, Xuzhou Central Hospital, Xuzhou, 221009, China.
| | - Shi Huang
- Xuzhou Engineering Research Center of Medical Genetics and Transformation, Key Laboratory of Genetic Foundation and Clinical Application, Xuzhou Medical University, Xuzhou, 221004, China.
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, 410078, China.
| |
Collapse
|
29
|
Mascher M, Jayakodi M, Shim H, Stein N. Promises and challenges of crop translational genomics. Nature 2024; 636:585-593. [PMID: 39313530 PMCID: PMC7616746 DOI: 10.1038/s41586-024-07713-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 06/13/2024] [Indexed: 09/25/2024]
Abstract
Crop translational genomics applies breeding techniques based on genomic datasets to improve crops. Technological breakthroughs in the past ten years have made it possible to sequence the genomes of increasing numbers of crop varieties and have assisted in the genetic dissection of crop performance. However, translating research findings to breeding applications remains challenging. Here we review recent progress and future prospects for crop translational genomics in bringing results from the laboratory to the field. Genetic mapping, genomic selection and sequence-assisted characterization and deployment of plant genetic resources utilize rapid genotyping of large populations. These approaches have all had an impact on breeding for qualitative traits, where single genes with large phenotypic effects exert their influence. Characterization of the complex genetic architectures that underlie quantitative traits such as yield and flowering time, especially in newly domesticated crops, will require further basic research, including research into regulation and interactions of genes and the integration of genomic approaches and high-throughput phenotyping, before targeted interventions can be designed. Future priorities for translation include supporting genomics-assisted breeding in low-income countries and adaptation of crops to changing environments.
Collapse
Affiliation(s)
- Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
| | - Hyeonah Shim
- Department of Agriculture, Forestry and Bioresources, Plant Genomics and Breeding Institute, Research Institute of Agriculture and Life Sciences, College of Agriculture and Life Sciences, Seoul National University, Seoul, Korea
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
- Martin Luther University Halle-Wittenberg, Halle, Germany.
| |
Collapse
|
30
|
Ma K, Huang S, Ng KK, Lake NJ, Joseph S, Xu J, Lek A, Ge L, Woodman KG, Koczwara KE, Cohen J, Ho V, O'Connor CL, Brindley MA, Campbell KP, Lek M. Saturation mutagenesis-reinforced functional assays for disease-related genes. Cell 2024; 187:6707-6724.e22. [PMID: 39326416 PMCID: PMC11568926 DOI: 10.1016/j.cell.2024.08.047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 07/29/2024] [Accepted: 08/23/2024] [Indexed: 09/28/2024]
Abstract
Interpretation of disease-causing genetic variants remains a challenge in human genetics. Current costs and complexity of deep mutational scanning methods are obstacles for achieving genome-wide resolution of variants in disease-related genes. Our framework, saturation mutagenesis-reinforced functional assays (SMuRF), offers simple and cost-effective saturation mutagenesis paired with streamlined functional assays to enhance the interpretation of unresolved variants. Applying SMuRF to neuromuscular disease genes FKRP and LARGE1, we generated functional scores for all possible coding single-nucleotide variants, which aid in resolving clinically reported variants of uncertain significance. SMuRF also demonstrates utility in predicting disease severity, resolving critical structural regions, and providing training datasets for the development of computational predictors. Overall, our approach enables variant-to-function insights for disease genes in a cost-effective manner that can be broadly implemented by standard research laboratories.
Collapse
Affiliation(s)
- Kaiyue Ma
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA; Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.
| | - Shushu Huang
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | - Kenneth K Ng
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | - Nicole J Lake
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | - Soumya Joseph
- Howard Hughes Medical Institute, Senator Paul D. Wellstone Muscular Dystrophy Specialized Research Center, Department of Molecular Physiology and Biophysics and Department of Neurology, Roy J. and Lucille A. Carver College of Medicine, The University of Iowa, Iowa City, IA, USA
| | - Jenny Xu
- Yale University, New Haven, CT, USA
| | - Angela Lek
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA; Muscular Dystrophy Association, Chicago, IL, USA
| | - Lin Ge
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA; Department of Neurology, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, China
| | - Keryn G Woodman
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | | | - Justin Cohen
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | - Vincent Ho
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
| | | | - Melinda A Brindley
- Department of Infectious Diseases, Department of Population Health, University of Georgia, Athens, GA, USA
| | - Kevin P Campbell
- Howard Hughes Medical Institute, Senator Paul D. Wellstone Muscular Dystrophy Specialized Research Center, Department of Molecular Physiology and Biophysics and Department of Neurology, Roy J. and Lucille A. Carver College of Medicine, The University of Iowa, Iowa City, IA, USA
| | - Monkol Lek
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA.
| |
Collapse
|
31
|
Hou C, Shen Y. SeqDance: A Protein Language Model for Representing Protein Dynamic Properties. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.11.617911. [PMID: 39464109 PMCID: PMC11507661 DOI: 10.1101/2024.10.11.617911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Proteins perform their functions by folding amino acid sequences into dynamic structural ensembles. Despite the important role of protein dynamics, their complexity and the absence of efficient representation methods have limited their integration into studies on protein function and mutation fitness, especially in deep learning applications. To address this, we present SeqDance, a protein language model designed to learn representation of protein dynamic properties directly from sequence alone. SeqDance is pre-trained on dynamic biophysical properties derived from over 30,400 molecular dynamics trajectories and 28,600 normal mode analyses. Our results show that SeqDance effectively captures local dynamic interactions, co-movement patterns, and global conformational features, even for proteins lacking homologs in the pre-training set. Additionally, we showed that SeqDance enhances the prediction of protein fitness landscapes, disorder-to-order transition binding regions, and phase-separating proteins. By learning dynamic properties from sequence, SeqDance complements conventional evolution- and static structure-based methods, offering new insights into protein behavior and function.
Collapse
Affiliation(s)
- Chao Hou
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
| | - Yufeng Shen
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032
- JP Sulzberger Columbia Genome Center, Columbia University, New York, NY 10032
| |
Collapse
|
32
|
Boos J, van der Made CI, Ramakrishnan G, Coughlan E, Asselta R, Löscher BS, Valenti LVC, de Cid R, Bujanda L, Julià A, Pairo-Castineira E, Baillie JK, May S, Zametica B, Heggemann J, Albillos A, Banales JM, Barretina J, Blay N, Bonfanti P, Buti M, Fernandez J, Marsal S, Prati D, Ronzoni L, Sacchi N, Schultze JL, Riess O, Franke A, Rawlik K, Ellinghaus D, Hoischen A, Schmidt A, Ludwig KU. Stratified analyses refine association between TLR7 rare variants and severe COVID-19. HGG ADVANCES 2024; 5:100323. [PMID: 38944683 PMCID: PMC11320601 DOI: 10.1016/j.xhgg.2024.100323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 06/26/2024] [Accepted: 06/25/2024] [Indexed: 07/01/2024] Open
Abstract
Despite extensive global research into genetic predisposition for severe COVID-19, knowledge on the role of rare host genetic variants and their relation to other risk factors remains limited. Here, 52 genes with prior etiological evidence were sequenced in 1,772 severe COVID-19 cases and 5,347 population-based controls from Spain/Italy. Rare deleterious TLR7 variants were present in 2.4% of young (<60 years) cases with no reported clinical risk factors (n = 378), compared to 0.24% of controls (odds ratio [OR] = 12.3, p = 1.27 × 10-10). Incorporation of the results of either functional assays or protein modeling led to a pronounced increase in effect size (ORmax = 46.5, p = 1.74 × 10-15). Association signals for the X-chromosomal gene TLR7 were also detected in the female-only subgroup, suggesting the existence of additional mechanisms beyond X-linked recessive inheritance in males. Additionally, supporting evidence was generated for a contribution to severe COVID-19 of the previously implicated genes IFNAR2, IFIH1, and TBK1. Our results refine the genetic contribution of rare TLR7 variants to severe COVID-19 and strengthen evidence for the etiological relevance of genes in the interferon signaling pathway.
Collapse
Affiliation(s)
- Jannik Boos
- Institute of Human Genetics, University of Bonn School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Caspar I van der Made
- Department of Human Genetics, Department of Internal Medicine, Radboudumc Research Institute for Medical Innovation, Radboud Center for Infectious Diseases (RCI), Radboud University Medical Center, Nijmegen, the Netherlands
| | - Gayatri Ramakrishnan
- Department of Medical Biosciences, Radboud University Medical Center, Nijmegen, the Netherlands
| | - Eamon Coughlan
- Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh, UK; Roslin Institute, University of Edinburgh, Edinburgh, UK
| | - Rosanna Asselta
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20090 Pieve Emanuele, Milan, Italy; IRCCS Humanitas Research Hospital - via Manzoni 56, 20089 Rozzano, Milan, Italy
| | - Britt-Sabina Löscher
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center, Kiel, Germany
| | - Luca V C Valenti
- Department of Pathophysiology and Transplantation, Università degli Studi di Milano, Milan, Italy; Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Rafael de Cid
- Genomes for Life-GCAT Lab, CORE Program. Germans Trias i Pujol Research Institute (IGTP), 08916 Badalona, Spain; Grup de Recerca en Impacte de les Malalties Cròniques i les seves Trajectòries (GRIMTra) (IGTP), Badalona, Spain
| | - Luis Bujanda
- Department of Liver and Gastrointestinal Diseases, Biodonostia Health Research Institute, Donostia University Hospital, University of the Basque Country (UPV/EHU), San Sebastian, Spain; Centre for Biomedical Network Research on Hepatic and Digestive Diseases (CIBEREHD), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Antonio Julià
- Vall d'Hebron Hospital Research Institute, Barcelona, Spain
| | - Erola Pairo-Castineira
- Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh, UK; Roslin Institute, University of Edinburgh, Edinburgh, UK
| | - J Kenneth Baillie
- Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh, UK; Roslin Institute, University of Edinburgh, Edinburgh, UK
| | - Sandra May
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center, Kiel, Germany
| | - Berina Zametica
- Institute of Human Genetics, University of Bonn School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Julia Heggemann
- Institute of Human Genetics, University of Bonn School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Agustín Albillos
- Centre for Biomedical Network Research on Hepatic and Digestive Diseases (CIBEREHD), Instituto de Salud Carlos III, 28029 Madrid, Spain; Department of Gastroenterology, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), University of Alcalá, Madrid, Spain
| | - Jesus M Banales
- Department of Liver and Gastrointestinal Diseases, Biodonostia Health Research Institute, Donostia University Hospital, University of the Basque Country (UPV/EHU), San Sebastian, Spain; Centre for Biomedical Network Research on Hepatic and Digestive Diseases (CIBEREHD), Instituto de Salud Carlos III, 28029 Madrid, Spain; IKERBASQUE, Basque Foundation for Science, Bilbao, Spain; Department of Biochemistry and Genetics, School of Sciences, University of Navarra, Pamplona, Spain
| | - Jordi Barretina
- Genomes for Life-GCAT Lab, CORE Program. Germans Trias i Pujol Research Institute (IGTP), 08916 Badalona, Spain
| | - Natalia Blay
- Genomes for Life-GCAT Lab, CORE Program. Germans Trias i Pujol Research Institute (IGTP), 08916 Badalona, Spain; Grup de Recerca en Impacte de les Malalties Cròniques i les seves Trajectòries (GRIMTra) (IGTP), Badalona, Spain
| | - Paolo Bonfanti
- Division of Infectious Diseases, Università degli Studi di Milano Bicocca, Fondazione San Gerardo dei Tintori, Monza, Italy
| | - Maria Buti
- Centre for Biomedical Network Research on Hepatic and Digestive Diseases (CIBEREHD), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Javier Fernandez
- Hospital Clinic, University of Barcelona, Barcelona, Spain; European Foundation for the Study of Chronic Liver Failure (EF CLif), Barcelona, Spain
| | - Sara Marsal
- Vall d'Hebron Hospital Research Institute, Barcelona, Spain
| | - Daniele Prati
- Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Luisa Ronzoni
- Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | | | - Joachim L Schultze
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE) e.V., Bonn, Germany; Genomics and Immunoregulation, Life and Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany; PRECISE Platform for Genomics and Epigenomics, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE) e.V. and University of Bonn, Bonn, Germany
| | - Olaf Riess
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany; DFG NGS Competence Center Tübingen (NCCT), University of Tübingen, Tübingen, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center, Kiel, Germany
| | - Konrad Rawlik
- Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh, UK
| | - David Ellinghaus
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center, Kiel, Germany
| | - Alexander Hoischen
- Department of Human Genetics, Department of Internal Medicine, Radboudumc Research Institute for Medical Innovation, Radboud Center for Infectious Diseases (RCI), Radboud University Medical Center, Nijmegen, the Netherlands
| | - Axel Schmidt
- Institute of Human Genetics, University of Bonn School of Medicine and University Hospital Bonn, Bonn, Germany
| | - Kerstin U Ludwig
- Institute of Human Genetics, University of Bonn School of Medicine and University Hospital Bonn, Bonn, Germany.
| |
Collapse
|
33
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. PLoS Biol 2024; 22:e3002847. [PMID: 39383205 PMCID: PMC11493298 DOI: 10.1371/journal.pbio.3002847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 10/21/2024] [Accepted: 09/17/2024] [Indexed: 10/11/2024] Open
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these 2 fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we lay out a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., genome-wide association studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur analytically and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study, we re-examine an analysis testing for coevolution of expression levels between genes across a fungal phylogeny and show that including eigenvectors of the covariance matrix as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
Affiliation(s)
- Joshua G. Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Michael D. Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Matt Pennell
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
- Department of Biological Sciences, University of Southern California, Los Angeles, California, United States of America
| |
Collapse
|
34
|
Nebenführ M, Prochotta D, Ben Hamadou A, Janke A, Gerheim C, Betz C, Greve C, Bolz HJ. High-speed whole-genome sequencing of a Whippet: Rapid chromosome-level assembly and annotation of an extremely fast dog's genome. GIGABYTE 2024; 2024:gigabyte134. [PMID: 39314919 PMCID: PMC11418881 DOI: 10.46471/gigabyte.134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 09/09/2024] [Indexed: 09/25/2024] Open
Abstract
The time required for genome sequencing and de novo assembly depends on the interaction between laboratory work, sequencing capacity, and the bioinformatics workflow, often constrained by external sequencing services. Bringing together academic biodiversity institutes and a medical diagnostics company with extensive sequencing capabilities, we aimed at generating a high-quality mammalian de novo genome in minimal time. We present the first chromosome-level genome assembly of the Whippet, using PacBio long-read high-fidelity sequencing and reference-guided scaffolding. The final assembly has a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. The total assembly length is 2.47 Gbp, of which 2.43 Gpb were scaffolded into 39 chromosome-length scaffolds. Annotation using mammalian genomes and transcriptome data yielded 28,383 transcripts, 90.9% complete BUSCO genes, and identified 36.5% repeat content. Sequencing, assembling, and scaffolding the chromosome-level genome of the Whippet took less than a week, adding another high-quality reference genome to the available sequences of domestic dog breeds.
Collapse
Affiliation(s)
- Marcel Nebenführ
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- Institute for Ecology, Evolution, and Diversity, Goethe University, Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - David Prochotta
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- Institute for Ecology, Evolution, and Diversity, Goethe University, Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Alexander Ben Hamadou
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Axel Janke
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- Institute for Ecology, Evolution, and Diversity, Goethe University, Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Charlotte Gerheim
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Christian Betz
- Bioscientia Human Genetics, Institute for Medical Diagnostics GmbH, Ingelheim, Germany
| | - Carola Greve
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
- LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Hanno Jörn Bolz
- Bioscientia Human Genetics, Institute for Medical Diagnostics GmbH, Ingelheim, Germany
| |
Collapse
|
35
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
36
|
Zhou H, Gelernter J. Human genetics and epigenetics of alcohol use disorder. J Clin Invest 2024; 134:e172885. [PMID: 39145449 PMCID: PMC11324314 DOI: 10.1172/jci172885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/16/2024] Open
Abstract
Alcohol use disorder (AUD) is a prominent contributor to global morbidity and mortality. Its complex etiology involves genetics, epigenetics, and environmental factors. We review progress in understanding the genetics and epigenetics of AUD, summarizing the key findings. Advancements in technology over the decades have elevated research from early candidate gene studies to present-day genome-wide scans, unveiling numerous genetic and epigenetic risk factors for AUD. The latest GWAS on more than one million participants identified more than 100 genetic variants, and the largest epigenome-wide association studies (EWAS) in blood and brain samples have revealed tissue-specific epigenetic changes. Downstream analyses revealed enriched pathways, genetic correlations with other traits, transcriptome-wide association in brain tissues, and drug-gene interactions for AUD. We also discuss limitations and future directions, including increasing the power of GWAS and EWAS studies as well as expanding the diversity of populations included in these analyses. Larger samples, novel technologies, and analytic approaches are essential; these include whole-genome sequencing, multiomics, single-cell sequencing, spatial transcriptomics, deep-learning prediction of variant function, and integrated methods for disease risk prediction.
Collapse
Affiliation(s)
- Hang Zhou
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut, USA
- Department of Biomedical Informatics and Data Science
- Center for Brain and Mind Health
| | - Joel Gelernter
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut, USA
- Department of Genetics, and
- Department of Neuroscience, Yale School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
37
|
Du D, Zhong F, Liu L. Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models. J Transl Med 2024; 22:756. [PMID: 39135093 PMCID: PMC11318145 DOI: 10.1186/s12967-024-05567-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 08/03/2024] [Indexed: 08/16/2024] Open
Abstract
BACKGROUND Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. Thus, the recently developed artificial intelligence methods can be used to interpret the functions of those DNA sequences. METHODS This study explores the use of deep learning, particularly pre-trained genomic models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. Initially, we meticulously constructed multiple datasets linking genotypes and phenotypes to fine-tune those models for precise DNA sequence classification. Additionally, we evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the hidden layers of our model using the HERV dataset. To enhance our understanding of phenotype-specific patterns recognized by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the human endogenous retrovirus (HERV) sequence with high average local representation weight (ALRW) scores. RESULTS We have constructed multiple genotype-phenotype datasets displaying commendable classification performance in comparison with random genomic sequences, particularly in the HERV dataset, which achieved binary and multi-classification accuracies and F1 values exceeding 0.935 and 0.888, respectively. Notably, the fine-tuning of the HERV dataset not only improved our ability to identify and distinguish diverse information types within DNA sequences but also successfully identified specific motifs associated with neurological disorders and cancers in regions with high ALRW scores. Subsequent analysis of these motifs shed light on the adaptive responses of species to environmental pressures and their co-evolution with pathogens. CONCLUSIONS These findings highlight the potential of pre-trained genomic models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research endeavors. This study represents an innovative strategy that combines pre-trained genomic model representations with classical methods for analyzing the functionality of genome sequences, thereby promoting cross-fertilization between genomics and artificial intelligence.
Collapse
Affiliation(s)
- Duo Du
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China
| | - Fan Zhong
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China.
| | - Lei Liu
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China.
- Shanghai Institute of Stem Cell Research and Clinical Translation, Shanghai, 200120, China.
| |
Collapse
|
38
|
Fu X, Rabadan R. Understanding variants of unknown significance: the computational frontier. Oncologist 2024; 29:653-657. [PMID: 38848164 PMCID: PMC11299926 DOI: 10.1093/oncolo/oyae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 04/16/2024] [Indexed: 06/09/2024] Open
Abstract
The rapid advancement of sequencing technologies has led to the identification of numerous mutations in cancer genomes, many of which are variants of unknown significance (VUS). Computational models are increasingly being used to predict the functional impact of these mutations, in both coding and noncoding regions. Integration of these models with emerging genomic datasets will refine our understanding of mutation effects and guide clinical decision making. Future advancements in modeling protein interactions and transcriptional regulation will further enhance our ability to interpret VUS. Periodic incorporation of these developments into VUS reclassification practice has the potential to significantly improve personalized cancer care.
Collapse
Affiliation(s)
- Xi Fu
- Columbia University Irving Medical Center, New York, NY, USA
| | - Raul Rabadan
- Columbia University Irving Medical Center, New York, NY, USA
| |
Collapse
|
39
|
Federico CA, Trotsyuk AA. Biomedical Data Science, Artificial Intelligence, and Ethics: Navigating Challenges in the Face of Explosive Growth. Annu Rev Biomed Data Sci 2024; 7:1-14. [PMID: 38598860 DOI: 10.1146/annurev-biodatasci-102623-104553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Advances in biomedical data science and artificial intelligence (AI) are profoundly changing the landscape of healthcare. This article reviews the ethical issues that arise with the development of AI technologies, including threats to privacy, data security, consent, and justice, as they relate to donors of tissue and data. It also considers broader societal obligations, including the importance of assessing the unintended consequences of AI research in biomedicine. In addition, this article highlights the challenge of rapid AI development against the backdrop of disparate regulatory frameworks, calling for a global approach to address concerns around data misuse, unintended surveillance, and the equitable distribution of AI's benefits and burdens. Finally, a number of potential solutions to these ethical quandaries are offered. Namely, the merits of advocating for a collaborative, informed, and flexible regulatory approach that balances innovation with individual rights and public welfare, fostering a trustworthy AI-driven healthcare ecosystem, are discussed.
Collapse
Affiliation(s)
- Carole A Federico
- Center for Biomedical Ethics, Stanford University School of Medicine, Stanford, California, USA; ,
| | - Artem A Trotsyuk
- Center for Biomedical Ethics, Stanford University School of Medicine, Stanford, California, USA; ,
| |
Collapse
|
40
|
Ishida Y, Matsushita M, Yoneshiro T, Saito M, Fuse S, Hamaoka T, Kuroiwa M, Tanaka R, Kurosawa Y, Nishimura T, Motoi M, Maeda T, Nakayama K. Genetic evidence for involvement of β2-adrenergic receptor in brown adipose tissue thermogenesis in humans. Int J Obes (Lond) 2024; 48:1110-1117. [PMID: 38632325 PMCID: PMC11281906 DOI: 10.1038/s41366-024-01522-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 04/05/2024] [Accepted: 04/08/2024] [Indexed: 04/19/2024]
Abstract
BACKGROUND Sympathetic activation of brown adipose tissue (BAT) thermogenesis can ameliorate obesity and related metabolic abnormalities. However, crucial subtypes of the β-adrenergic receptor (AR), as well as effects of its genetic variants on functions of BAT, remains unclear in humans. We conducted association analyses of genes encoding β-ARs and BAT activity in human adults. METHODS Single nucleotide polymorphisms (SNPs) in β1-, β2-, and β3-AR genes (ADRB1, ADRB2, and ADRB3) were tested for the association with BAT activity under mild cold exposure (19 °C, 2 h) in 399 healthy Japanese adults. BAT activity was measured using fluorodeoxyglucose-positron emission tomography and computed tomography (FDG-PET/CT). To validate the results, we assessed the effects of SNPs in the two independent populations comprising 277 healthy East Asian adults using near-infrared time-resolved spectroscopy (NIRTRS) or infrared thermography (IRT). Effects of SNPs on physiological responses to intensive cold exposure were tested in 42 healthy Japanese adult males using an artificial climate chamber. RESULTS We found a significant association between a functional SNP (rs1042718) in ADRB2 and BAT activity assessed with FDG-PET/CT (p < 0.001). This SNP also showed an association with cold-induced thermogenesis in the population subset. Furthermore, the association was replicated in the two other independent populations; BAT activity was evaluated by NIRTRS or IRT (p < 0.05). This SNP did not show associations with oxygen consumption and cold-induced thermogenesis under intensive cold exposure, suggesting the irrelevance of shivering thermogenesis. The SNPs of ADRB1 and ADRB3 were not associated with these BAT-related traits. CONCLUSIONS The present study supports the importance of β2-AR in the sympathetic regulation of BAT thermogenesis in humans. The present collection of DNA samples is the largest to which information on the donor's BAT activity has been assigned and can serve as a reference for further in-depth understanding of human BAT function.
Collapse
MESH Headings
- Adult
- Female
- Humans
- Male
- Middle Aged
- Adipose Tissue, Brown/metabolism
- Japan
- Polymorphism, Single Nucleotide
- Positron Emission Tomography Computed Tomography
- Receptors, Adrenergic, beta-2/genetics
- Receptors, Adrenergic, beta-2/metabolism
- Receptors, Adrenergic, beta-3/genetics
- Receptors, Adrenergic, beta-3/metabolism
- Thermogenesis
- East Asian People/genetics
Collapse
Affiliation(s)
- Yuka Ishida
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8562, Japan
| | - Mami Matsushita
- Department of Nutrition, School of Nursing and Nutrition, Tenshi College, Sapporo, Hokkaido, 065-0013, Japan
| | - Takeshi Yoneshiro
- Research Center for Advanced Science and Technology, The University of Tokyo, Meguro-ku, Tokyo, 153-8904, Japan
- Department of Molecular Metabolism and Physiology, Graduate School of Medicine, Tohoku University, Aoba-ku, Sendai, 980-8575, Japan
| | - Masayuki Saito
- Department of Nutrition, School of Nursing and Nutrition, Tenshi College, Sapporo, Hokkaido, 065-0013, Japan
- Laboratory of Biochemistry, Faculty of Veterinary Medicine, Hokkaido University, Sapporo, Hokkaido, 060-0818, Japan
| | - Sayuri Fuse
- Department of Sports Medicine for Health Promotion, Tokyo Medical University, Shinjuku-ku, Tokyo, 160-8402, Japan
| | - Takafumi Hamaoka
- Department of Sports Medicine for Health Promotion, Tokyo Medical University, Shinjuku-ku, Tokyo, 160-8402, Japan
| | - Miyuki Kuroiwa
- Department of Sports Medicine for Health Promotion, Tokyo Medical University, Shinjuku-ku, Tokyo, 160-8402, Japan
| | - Riki Tanaka
- Faculty of Sports and Health Science, Fukuoka University, Fukuoka, Fukuoka, 814-0180, Japan
| | - Yuko Kurosawa
- Department of Sports Medicine for Health Promotion, Tokyo Medical University, Shinjuku-ku, Tokyo, 160-8402, Japan
| | - Takayuki Nishimura
- Department of Human Life Design and Science, Faculty of Design, Kyushu University, Fukuoka, Fukuoka, 815-8540, Japan
- Physiological Anthropology Research Center, Faculty of Design, Kyushu University, Fukuoka, Fukuoka, 815-8540, Japan
| | - Midori Motoi
- Department of Human Life Design and Science, Faculty of Design, Kyushu University, Fukuoka, Fukuoka, 815-8540, Japan
| | - Takafumi Maeda
- Department of Human Life Design and Science, Faculty of Design, Kyushu University, Fukuoka, Fukuoka, 815-8540, Japan
- Physiological Anthropology Research Center, Faculty of Design, Kyushu University, Fukuoka, Fukuoka, 815-8540, Japan
| | - Kazuhiro Nakayama
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8562, Japan.
| |
Collapse
|
41
|
Ding K, Chin M, Zhao Y, Huang W, Mai BK, Wang H, Liu P, Yang Y, Luo Y. Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nat Commun 2024; 15:6392. [PMID: 39080249 PMCID: PMC11289365 DOI: 10.1038/s41467-024-50698-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 07/19/2024] [Indexed: 08/02/2024] Open
Abstract
The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochrome c to achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY's potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Michael Chin
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Yunlong Zhao
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Wei Huang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Binh Khanh Mai
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Huanan Wang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Peng Liu
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA.
| | - Yang Yang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA.
- Biomolecular Science and Engineering (BMSE) Program, University of California, Santa Barbara, CA, 93106, USA.
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
42
|
Buckley RM, Ostrander EA. Large-scale genomic analysis of the domestic dog informs biological discovery. Genome Res 2024; 34:811-821. [PMID: 38955465 PMCID: PMC11293549 DOI: 10.1101/gr.278569.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Recent advances in genomics, coupled with a unique population structure and remarkable levels of variation, have propelled the domestic dog to new levels as a system for understanding fundamental principles in mammalian biology. Central to this advance are more than 350 recognized breeds, each a closed population that has undergone selection for unique features. Genetic variation in the domestic dog is particularly well characterized compared with other domestic mammals, with almost 3000 high-coverage genomes publicly available. Importantly, as the number of sequenced genomes increases, new avenues for analysis are becoming available. Herein, we discuss recent discoveries in canine genomics regarding behavior, morphology, and disease susceptibility. We explore the limitations of current data sets for variant interpretation, tradeoffs between sequencing strategies, and the burgeoning role of long-read genomes for capturing structural variants. In addition, we consider how large-scale collections of whole-genome sequence data drive rare variant discovery and assess the geographic distribution of canine diversity, which identifies Asia as a major source of missing variation. Finally, we review recent comparative genomic analyses that will facilitate annotation of the noncoding genome in dogs.
Collapse
Affiliation(s)
- Reuben M Buckley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Elaine A Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
43
|
Haque B, Guirguis G, Curtis M, Mohsin H, Walker S, Morrow MM, Costain G. A comparative medical genomics approach may facilitate the interpretation of rare missense variation. J Med Genet 2024; 61:817-821. [PMID: 38508706 PMCID: PMC11287553 DOI: 10.1136/jmg-2023-109760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/12/2024] [Indexed: 03/22/2024]
Abstract
PURPOSE To determine the degree to which likely causal missense variants of single-locus traits in domesticated species have features suggestive of pathogenicity in a human genomic context. METHODS We extracted missense variants from the Online Mendelian Inheritance in Animals database for nine animals (cat, cattle, chicken, dog, goat, horse, pig, rabbit and sheep), mapped coordinates to the human reference genome and annotated variants using genome analysis tools. We also searched a private commercial laboratory database of genetic testing results from >400 000 individuals with suspected rare disorders. RESULTS Of 339 variants that were mappable to the same residue and gene in the human genome, 56 had been previously classified with respect to pathogenicity: 31 (55.4%) pathogenic/likely pathogenic, 1 (1.8%) benign/likely benign and 24 (42.9%) uncertain/other. The odds ratio for a pathogenic/likely pathogenic classification in ClinVar was 7.0 (95% CI 4.1 to 12.0, p<0.0001), compared with all other germline missense variants in these same 220 genes. The remaining 283 variants disproportionately had allele frequencies and REVEL scores that supported pathogenicity. CONCLUSION Cross-species comparisons could facilitate the interpretation of rare missense variation. These results provide further support for comparative medical genomics approaches that connect big data initiatives in human and veterinary genetics.
Collapse
Affiliation(s)
- Bushra Haque
- Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - George Guirguis
- Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Meredith Curtis
- Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Hera Mohsin
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Susan Walker
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Ontario, Canada
| | | | - Gregory Costain
- Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Ontario, Canada
- Division of Clinical and Metabolic Genetics, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Paediatrics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
44
|
Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024; 7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open
Abstract
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.
Collapse
|
45
|
Dereli O, Kuru N, Akkoyun E, Bircan A, Tastan O, Adebali O. PHACTboost: A Phylogeny-Aware Pathogenicity Predictor for Missense Mutations via Boosting. Mol Biol Evol 2024; 41:msae136. [PMID: 38934805 PMCID: PMC11251492 DOI: 10.1093/molbev/msae136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 05/30/2024] [Accepted: 06/24/2024] [Indexed: 06/28/2024] Open
Abstract
Most algorithms that are used to predict the effects of variants rely on evolutionary conservation. However, a majority of such techniques compute evolutionary conservation by solely using the alignment of multiple sequences while overlooking the evolutionary context of substitution events. We had introduced PHACT, a scoring-based pathogenicity predictor for missense mutations that can leverage phylogenetic trees, in our previous study. By building on this foundation, we now propose PHACTboost, a gradient boosting tree-based classifier that combines PHACT scores with information from multiple sequence alignments, phylogenetic trees, and ancestral reconstruction. By learning from data, PHACTboost outperforms PHACT. Furthermore, the results of comprehensive experiments on carefully constructed sets of variants demonstrated that PHACTboost can outperform 40 prevalent pathogenicity predictors reported in the dbNSFP, including conventional tools, metapredictors, and deep learning-based approaches as well as more recent tools such as AlphaMissense, EVE, and CPT-1. The superiority of PHACTboost over these methods was particularly evident in case of hard variants for which different pathogenicity predictors offered conflicting results. We provide predictions of 215 million amino acid alterations over 20,191 proteins. PHACTboost is available at https://github.com/CompGenomeLab/PHACTboost. PHACTboost can improve our understanding of genetic diseases and facilitate more accurate diagnoses.
Collapse
Affiliation(s)
- Onur Dereli
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| | - Nurdan Kuru
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| | - Emrah Akkoyun
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
- Network Technologies Department, TÜBİTAK-ULAKBİM Turkish Academic Network and Information Center, Ankara 06530, Turkey
| | - Aylin Bircan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| | - Oznur Tastan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| | - Ogün Adebali
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
- Biological Sciences, TÜBİTAK Research Institute for Fundamental Sciences, Gebze 41470, Turkey
| |
Collapse
|
46
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
47
|
Rastogi R, Chung R, Li S, Li C, Lee K, Woo J, Kim DW, Keum C, Babbi G, Martelli PL, Savojardo C, Casadio R, Chennen K, Weber T, Poch O, Ancien F, Cia G, Pucci F, Raimondi D, Vranken W, Rooman M, Marquet C, Olenyi T, Rost B, Andreoletti G, Kamandula A, Peng Y, Bakolitsa C, Mort M, Cooper DN, Bergquist T, Pejaver V, Liu X, Radivojac P, Brenner SE, Ioannidis NM. Critical assessment of missense variant effect predictors on disease-relevant variant data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.06.597828. [PMID: 38895200 PMCID: PMC11185644 DOI: 10.1101/2024.06.06.597828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Regular, systematic, and independent assessment of computational tools used to predict the pathogenicity of missense variants is necessary to evaluate their clinical and research utility and suggest directions for future improvement. Here, as part of the sixth edition of the Critical Assessment of Genome Interpretation (CAGI) challenge, we assess missense variant effect predictors (or variant impact predictors) on an evaluation dataset of rare missense variants from disease-relevant databases. Our assessment evaluates predictors submitted to the CAGI6 Annotate-All-Missense challenge, predictors commonly used by the clinical genetics community, and recently developed deep learning methods for variant effect prediction. To explore a variety of settings that are relevant for different clinical and research applications, we assess performance within different subsets of the evaluation data and within high-specificity and high-sensitivity regimes. We find strong performance of many predictors across multiple settings. Meta-predictors tend to outperform their constituent individual predictors; however, several individual predictors have performance similar to that of commonly used meta-predictors. The relative performance of predictors differs in high-specificity and high-sensitivity regimes, suggesting that different methods may be best suited to different use cases. We also characterize two potential sources of bias. Predictors that incorporate allele frequency as a predictive feature tend to have reduced performance when distinguishing pathogenic variants from very rare benign variants, and predictors supervised on pathogenicity labels from curated variant databases often learn label imbalances within genes. Overall, we find notable advances over the oldest and most cited missense variant effect predictors and continued improvements among the most recently developed tools, and the CAGI Annotate-All-Missense challenge (also termed the Missense Marathon) will continue to assess state-of-the-art methods as the field progresses. Together, our results help illuminate the current clinical and research utility of missense variant effect predictors and identify potential areas for future development.
Collapse
|
48
|
Shorbaji A, Pushparaj PN, Bakhashab S, Al-Ghafari AB, Al-Rasheed RR, Siraj Mira L, Basabrain MA, Alsulami M, Abu Zeid IM, Naseer MI, Rasool M. Current genetic models for studying congenital heart diseases: Advantages and disadvantages. Bioinformation 2024; 20:415-429. [PMID: 39132229 PMCID: PMC11309114 DOI: 10.6026/973206300200415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 05/31/2024] [Accepted: 05/31/2024] [Indexed: 08/13/2024] Open
Abstract
Congenital heart disease (CHD) encompasses a diverse range of structural and functional anomalies that affect the heart and the major blood vessels. Epidemiological studies have documented a global increase in CHD prevalence, which can be attributed to advancements in diagnostic technologies. Extensive research has identified a plethora of CHD-related genes, providing insights into the biochemical pathways and molecular mechanisms underlying this pathological state. In this review, we discuss the advantages and challenges of various In vitro and in vivo CHD models, including primates, canines, Xenopus frogs, rabbits, chicks, mice, Drosophila, zebrafish, and induced pluripotent stem cells (iPSCs). Primates are closely related to humans but are rare and expensive. Canine models are costly but structurally comparable to humans. Xenopus frogs are advantageous because of their generation of many embryos, ease of genetic modification, and cardiac similarity. Rabbits mimic human physiology but are challenging to genetically control. Chicks are inexpensive and simple to handle; however, cardiac events can vary among humans. Mice differ physiologically, while being evolutionarily close and well-resourced. Drosophila has genes similar to those of humans but different heart structures. Zebrafish have several advantages, including high gene conservation in humans and physiological cardiac similarities but limitations in cross-reactivity with mammalian antibodies, gene duplication, and limited embryonic stem cells for reverse genetic methods. iPSCs have the potential for gene editing, but face challenges in terms of 2D structure and genomic stability. CRISPR-Cas9 allows for genetic correction but requires high technical skills and resources. These models have provided valuable knowledge regarding cardiac development, disease simulation, and the verification of genetic factors. This review highlights the distinct features of various models with respect to their biological characteristics, vulnerability to developing specific heart diseases, approaches employed to induce particular conditions, and the comparability of these species to humans. Therefore, the selection of appropriate models is based on research objectives, ultimately leading to an enhanced comprehension of disease pathology and therapy.
Collapse
Affiliation(s)
- Ayat Shorbaji
- Biochemistry Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Peter Natesan Pushparaj
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherin Bakhashab
- Biochemistry Department, King Abdulaziz University, Jeddah, Saudi Arabia
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ayat B Al-Ghafari
- Biochemistry Department, King Abdulaziz University, Jeddah, Saudi Arabia
- Experimental Biochemistry Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Rana R Al-Rasheed
- Experimental Biochemistry Unit, King Fahad research Center, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Loubna Siraj Mira
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammad Abdullah Basabrain
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Majed Alsulami
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Isam M Abu Zeid
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Imran Naseer
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mahmood Rasool
- Center of Excellence in Genomic Medicine Research, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
49
|
Chao KR, Wang L, Panchal R, Liao C, Abderrazzaq H, Ye R, Schultz P, Compitello J, Grant RH, Kosmicki JA, Weisburd B, Phu W, Wilson MW, Laricchia KM, Goodrich JK, Goldstein D, Goldstein JI, Vittal C, Poterba T, Baxter S, Watts NA, Solomonson M, Tiao G, Rehm HL, Neale BM, Talkowski ME, MacArthur DG, O'Donnell-Luria A, Karczewski KJ, Radivojac P, Daly MJ, Samocha KE. The landscape of regional missense mutational intolerance quantified from 125,748 exomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.11.588920. [PMID: 38645134 PMCID: PMC11030311 DOI: 10.1101/2024.04.11.588920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation 1-12 . Here, we leverage the patterns of rare missense variation in 125,748 individuals in the Genome Aggregation Database (gnomAD) 13 against a null mutational model to identify transcripts that display regional differences in missense constraint. Missense-depleted regions are enriched for ClinVar 14 pathogenic variants, de novo missense variants from individuals with neurodevelopmental disorders (NDDs) 15,16 , and complex trait heritability. Following ClinGen calibration recommendations for the ACMG/AMP guidelines, we establish that regions with less than 20% of their expected missense variation achieve moderate support for pathogenicity. We create a missense deleteriousness metric (MPC) that incorporates regional constraint and outperforms other deleteriousness scores at stratifying case and control de novo missense variation, with a strong enrichment in NDDs. These results provide additional tools to aid in missense variant interpretation.
Collapse
|
50
|
Zhong G, Zhao Y, Zhuang D, Chung WK, Shen Y. PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581321. [PMID: 38746140 PMCID: PMC11092447 DOI: 10.1101/2024.02.20.581321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate prediction of the functional impact of missense variants is important for disease gene discovery, clinical genetic diagnostics, therapeutic strategies, and protein engineering. Previous efforts have focused on predicting a binary pathogenicity classification, but the functional impact of missense variants is multi-dimensional. Pathogenic missense variants in the same gene may act through different modes of action (i.e., gain/loss-of-function) by affecting different aspects of protein function. They may result in distinct clinical conditions that require different treatments. We developed a new method, PreMode, to perform gene-specific mode-of-action predictions. PreMode models effects of coding sequence variants using SE(3)-equivariant graph neural networks on protein sequences and structures. Using the largest-to-date set of missense variants with known modes of action, we showed that PreMode reached state-of-the-art performance in multiple types of mode-of-action predictions by efficient transfer-learning. Additionally, PreMode's prediction of G/LoF variants in a kinase is consistent with inactive-active conformation transition energy changes. Finally, we show that PreMode enables efficient study design of deep mutational scans and optimization in protein engineering.
Collapse
|