1
|
Zhang J, Kinch L, Katsonis P, Lichtarge O, Jagota M, Song YS, Sun Y, Shen Y, Kuru N, Dereli O, Adebali O, Alladin MA, Pal D, Capriotti E, Turina MP, Savojardo C, Martelli PL, Babbi G, Casadio R, Pucci F, Rooman M, Cia G, Tsishyn M, Strokach A, Hu Z, van Loggerenberg W, Roth FP, Radivojac P, Brenner SE, Cong Q, Grishin NV. Assessing predictions on fitness effects of missense variants in HMBS in CAGI6. Hum Genet 2025; 144:173-189. [PMID: 39110250 PMCID: PMC12085147 DOI: 10.1007/s00439-024-02680-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Accepted: 05/17/2024] [Indexed: 02/21/2025]
Abstract
This paper presents an evaluation of predictions submitted for the "HMBS" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.
Collapse
Affiliation(s)
- Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Milind Jagota
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Nurdan Kuru
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Onur Dereli
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Ogun Adebali
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Muttaqi Ahmad Alladin
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Maria Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Alexey Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Warren van Loggerenberg
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Frederick P Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
2
|
Hermans P, Tsishyn M, Schwersensky M, Rooman M, Pucci F. Exploring Evolution to Uncover Insights Into Protein Mutational Stability. Mol Biol Evol 2025; 42:msae267. [PMID: 39786559 PMCID: PMC11721782 DOI: 10.1093/molbev/msae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/27/2024] [Accepted: 11/28/2024] [Indexed: 01/12/2025] Open
Abstract
Determining the impact of mutations on the thermodynamic stability of proteins is essential for a wide range of applications such as rational protein design and genetic variant interpretation. Since protein stability is a major driver of evolution, evolutionary data are often used to guide stability predictions. Many state-of-the-art stability predictors extract evolutionary information from multiple sequence alignments of proteins homologous to a query protein, and leverage it to predict the effects of mutations on protein stability. To evaluate the power and the limitations of such methods, we used the massive amount of stability data recently obtained by deep mutational scanning to study how best to construct multiple sequence alignments and optimally extract evolutionary information from them. We tested different evolutionary models and found that, unexpectedly, independent-site models achieve similar accuracy to more complex epistatic models. A detailed analysis of the latter models suggests that their inference often results in noisy couplings, which do not appear to add predictive power over the independent-site contribution, at least in the context of stability prediction. Interestingly, by combining any of the evolutionary features with a simple structural feature, the relative solvent accessibility of the mutated residue, we achieved similar prediction accuracy to supervised, machine learning-based, protein stability change predictors. Our results provide new insights into the relationship between protein evolution and stability, and show how evolutionary information can be exploited to improve the performance of mutational stability prediction.
Collapse
Affiliation(s)
- Pauline Hermans
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| |
Collapse
|
3
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
4
|
Tsishyn M, Cia G, Hermans P, Kwasigroch J, Rooman M, Pucci F. FiTMuSiC: leveraging structural and (co)evolutionary data for protein fitness prediction. Hum Genomics 2024; 18:36. [PMID: 38627807 PMCID: PMC11020440 DOI: 10.1186/s40246-024-00605-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
Systematically predicting the effects of mutations on protein fitness is essential for the understanding of genetic diseases. Indeed, predictions complement experimental efforts in analyzing how variants lead to dysfunctional proteins that in turn can cause diseases. Here we present our new fitness predictor, FiTMuSiC, which leverages structural, evolutionary and coevolutionary information. We show that FiTMuSiC predicts fitness with high accuracy despite the simplicity of its underlying model: it was among the top predictors on the hydroxymethylbilane synthase (HMBS) target of the sixth round of the Critical Assessment of Genome Interpretation challenge (CAGI6) and performs as well as much more complex deep learning models such as AlphaMissense. To further demonstrate FiTMuSiC's robustness, we compared its predictions with in vitro activity data on HMBS, variant fitness data on human glucokinase (GCK), and variant deleteriousness data on HMBS and GCK. These analyses further confirm FiTMuSiC's qualities and accuracy, which compare favorably with those of other predictors. Additionally, FiTMuSiC returns two scores that separately describe the functional and structural effects of the variant, thus providing mechanistic insight into why the variant leads to fitness loss or gain. We also provide an easy-to-use webserver at https://babylone.ulb.ac.be/FiTMuSiC , which is freely available for academic use and does not require any bioinformatics expertise, which simplifies the accessibility of our tool for the entire scientific community.
Collapse
Affiliation(s)
- Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium
| | - Pauline Hermans
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium
| | - Jean Kwasigroch
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium.
- Interuniversity Institute of Bioinformatics in Brussels, Triumph Bvd, 1050, Brussels, Belgium.
| |
Collapse
|