26
|
Aljarf R, Tang S, Pires DEV, Ascher DB. embryoTox: Using Graph-Based Signatures to Predict the Teratogenicity of Small Molecules. J Chem Inf Model 2023; 63:432-441. [PMID: 36595441 DOI: 10.1021/acs.jcim.2c00824] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Teratogenic drugs can lead to extreme fetal malformation and consequently critically influence the fetus's health, yet the teratogenic risks associated with most approved drugs are unknown. Here, we propose a novel predictive tool, embryoTox, which utilizes a graph-based signature representation of the chemical structure of a small molecule to predict and classify molecules likely to be safe during pregnancy. embryoTox was trained and validated using in vitro bioactivity data of over 700 small molecules with characterized teratogenicity effects. Our final model achieved an area under the receiver operating characteristic curve (AUC) of up to 0.96 on 10-fold cross-validation and 0.82 on nonredundant blind tests, outperforming alternative approaches. We believe that our predictive tool will provide a practical resource for optimizing screening libraries to determine effective and safe molecules to use during pregnancy. To provide a simple and integrated platform to rapidly screen for potential safe molecules and their risk factors, we made embryoTox freely available online at https://biosig.lab.uq.edu.au/embryotox/.
Collapse
|
27
|
Ascher DB, Kaminskas LM, Myung Y, Pires DEV. Using Graph-Based Signatures to Guide Rational Antibody Engineering. Methods Mol Biol 2023; 2552:375-397. [PMID: 36346604 DOI: 10.1007/978-1-0716-2609-2_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Antibodies are essential experimental and diagnostic tools and as biotherapeutics have significantly advanced our ability to treat a range of diseases. With recent innovations in computational tools to guide protein engineering, we can now rationally design better antibodies with improved efficacy, stability, and pharmacokinetics. Here, we describe the use of the mCSM web-based in silico suite, which uses graph-based signatures to rapidly identify the structural and functional consequences of mutations, to guide rational antibody engineering to improve stability, affinity, and specificity.
Collapse
|
28
|
Boer JC, Pan Q, Holien JK, Nguyen TB, Ascher DB, Plebanski M. A bias of Asparagine to Lysine mutations in SARS-CoV-2 outside the receptor binding domain affects protein flexibility. Front Immunol 2022; 13:954435. [PMID: 36569921 PMCID: PMC9788125 DOI: 10.3389/fimmu.2022.954435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 11/14/2022] [Indexed: 12/14/2022] Open
Abstract
Introduction COVID-19 pandemic has been threatening public health and economic development worldwide for over two years. Compared with the original SARS-CoV-2 strain reported in 2019, the Omicron variant (B.1.1.529.1) is more transmissible. This variant has 34 mutations in its Spike protein, 15 of which are present in the Receptor Binding Domain (RBD), facilitating viral internalization via binding to the angiotensin-converting enzyme 2 (ACE2) receptor on endothelial cells as well as promoting increased immune evasion capacity. Methods Herein we compared SARS-CoV-2 proteins (including ORF3a, ORF7, ORF8, Nucleoprotein (N), membrane protein (M) and Spike (S) proteins) from multiple ancestral strains. We included the currently designated original Variant of Concern (VOC) Omicron, its subsequent emerged variants BA.1, BA2, BA3, BA.4, BA.5, the two currently emerging variants BQ.1 and BBX.1, and compared these with the previously circulating VOCs Alpha, Beta, Gamma, and Delta, to better understand the nature and potential impact of Omicron specific mutations. Results Only in Omicron and its subvariants, a bias toward an Asparagine to Lysine (N to K) mutation was evident within the Spike protein, including regions outside the RBD domain, while none of the regions outside the Spike protein domain were characterized by this mutational bias. Computational structural analysis revealed that three of these specific mutations located in the central core region, contribute to a preference for the alteration of conformations of the Spike protein. Several mutations in the RBD which have circulated across most Omicron subvariants were also analysed, and these showed more potential for immune escape. Conclusion This study emphasizes the importance of understanding how specific N to K mutations outside of the RBD region affect SARS-CoV-2 conformational changes and the need for neutralizing antibodies for Omicron to target a subset of conformationally dependent B cell epitopes.
Collapse
|
29
|
Williams NP, Rodrigues CHM, Truong J, Ascher DB, Holien JK. DockNet: high-throughput protein-protein interface contact prediction. Bioinformatics 2022; 39:6885444. [PMID: 36484688 PMCID: PMC9825772 DOI: 10.1093/bioinformatics/btac797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 10/27/2022] [Accepted: 12/08/2022] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Over 300 000 protein-protein interaction (PPI) pairs have been identified in the human proteome and targeting these is fast becoming the next frontier in drug design. Predicting PPI sites, however, is a challenging task that traditionally requires computationally expensive and time-consuming docking simulations. A major weakness of modern protein docking algorithms is the inability to account for protein flexibility, which ultimately leads to relatively poor results. RESULTS Here, we propose DockNet, an efficient Siamese graph-based neural network method which predicts contact residues between two interacting proteins. Unlike other methods that only utilize a protein's surface or treat the protein structure as a rigid body, DockNet incorporates the entire protein structure and places no limits on protein flexibility during an interaction. Predictions are modeled at the residue level, based on a diverse set of input node features including residue type, surface accessibility, residue depth, secondary structure, pharmacophore and torsional angles. DockNet is comparable to current state-of-the-art methods, achieving an area under the curve (AUC) value of up to 0.84 on an independent test set (DB5), can be applied to a variety of different protein structures and can be utilized in situations where accurate unbound protein structures cannot be obtained. AVAILABILITY AND IMPLEMENTATION DockNet is available at https://github.com/npwilliams09/docknet and an easy-to-use webserver at https://biosig.lab.uq.edu.au/docknet. All other data underlying this article are available in the article and in its online supplementary material. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
30
|
Parthasarathy S, Ruggiero SM, Gelot A, Soardi FC, Ribeiro BFR, Pires DEV, Ascher DB, Schmitt A, Rambaud C, Represa A, Xie HM, Lusk L, Wilmarth O, McDonnell PP, Juarez OA, Grace AN, Buratti J, Mignot C, Gras D, Nava C, Pierce SR, Keren B, Kennedy BC, Pena SDJ, Helbig I, Cuddapah VA. A recurrent de novo splice site variant involving DNM1 exon 10a causes developmental and epileptic encephalopathy through a dominant-negative mechanism. Am J Hum Genet 2022; 109:2253-2269. [PMID: 36413998 PMCID: PMC9748255 DOI: 10.1016/j.ajhg.2022.11.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 11/01/2022] [Indexed: 11/23/2022] Open
Abstract
Heterozygous pathogenic variants in DNM1 cause developmental and epileptic encephalopathy (DEE) as a result of a dominant-negative mechanism impeding vesicular fission. Thus far, pathogenic variants in DNM1 have been studied with a canonical transcript that includes the alternatively spliced exon 10b. However, after performing RNA sequencing in 39 pediatric brain samples, we find the primary transcript expressed in the brain includes the downstream exon 10a instead. Using this information, we evaluated genotype-phenotype correlations of variants affecting exon 10a and identified a cohort of eleven previously unreported individuals. Eight individuals harbor a recurrent de novo splice site variant, c.1197-8G>A (GenBank: NM_001288739.1), which affects exon 10a and leads to DEE consistent with the classical DNM1 phenotype. We find this splice site variant leads to disease through an unexpected dominant-negative mechanism. Functional testing reveals an in-frame upstream splice acceptor causing insertion of two amino acids predicted to impair oligomerization-dependent activity. This is supported by neuropathological samples showing accumulation of enlarged synaptic vesicles adherent to the plasma membrane consistent with impaired vesicular fission. Two additional individuals with missense variants affecting exon 10a, p.Arg399Trp and p.Gly401Asp, had a similar DEE phenotype. In contrast, one individual with a missense variant affecting exon 10b, p.Pro405Leu, which is less expressed in the brain, had a correspondingly less severe presentation. Thus, we implicate variants affecting exon 10a as causing the severe DEE typically associated with DNM1-related disorders. We highlight the importance of considering relevant isoforms for disease-causing variants as well as the possibility of splice site variants acting through a dominant-negative mechanism.
Collapse
|
31
|
Zhou Y, Al‐Jarf R, Alavi A, Nguyen TB, Rodrigues CHM, Pires DEV, Ascher DB. kinCSM: Using graph-based signatures to predict small molecule CDK2 inhibitors. Protein Sci 2022; 31:e4453. [PMID: 36305769 PMCID: PMC9597374 DOI: 10.1002/pro.4453] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/14/2022] [Accepted: 09/15/2022] [Indexed: 11/20/2022]
Abstract
Protein phosphorylation acts as an essential on/off switch in many cellular signaling pathways. This has led to ongoing interest in targeting kinases for therapeutic intervention. Computer‐aided drug discovery has been proven a useful and cost‐effective approach for facilitating prioritization and enrichment of screening libraries, but limited effort has been devoted providing insights on what makes a potent kinase inhibitor. To fill this gap, here we developed kinCSM, an integrative computational tool capable of accurately identifying potent cyclin‐dependent kinase 2 (CDK2) inhibitors, quantitatively predicting CDK2 ligand–kinase inhibition constants (pKi) and classifying different types of inhibitors based on their favorable binding modes. kinCSM predictive models were built using supervised learning and leveraged the concept of graph‐based signatures to capture both physicochemical properties and geometry properties of small molecules. CDK2 inhibitors were accurately identified with Matthew's Correlation Coefficients (MCC) of up to 0.74, and inhibition constants predicted with Pearson's correlation of up to 0.76, both with consistent performances of 0.66 and 0.68 on a nonredundant blind test, respectively. kinCSM was also able to identify the potential type of inhibition for a given molecule, achieving MCC of up to 0.80 on cross‐validation and 0.73 on the blind test. Analyzing the molecular composition of revealed enriched chemical fragments in CDK2 inhibitors and different types of inhibitors, which provides insights into the molecular mechanisms behind ligand–kinase interactions. kinCSM will be an invaluable tool to guide future kinase drug discovery. To aid the fast and accurate screening of CDK2 inhibitors, kinCSM is freely available at https://biosig.lab.uq.edu.au/kin_csm/.
Collapse
|
32
|
Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, Bryant P, Good LL, Laskowski RA, Pozzati G, Shenoy A, Zhu W, Kundrotas P, Serra VR, Rodrigues CHM, Dunham AS, Burke D, Borkakoti N, Velankar S, Frost A, Basquin J, Lindorff-Larsen K, Bateman A, Kajava AV, Valencia A, Ovchinnikov S, Durairaj J, Ascher DB, Thornton JM, Davey NE, Stein A, Elofsson A, Croll TI, Beltrao P. A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol 2022; 29:1056-1067. [PMID: 36344848 PMCID: PMC9663297 DOI: 10.1038/s41594-022-00849-w] [Citation(s) in RCA: 193] [Impact Index Per Article: 96.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 09/20/2022] [Indexed: 11/09/2022]
Abstract
Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
Collapse
|
33
|
Iftkhar S, de Sá AGC, Velloso JPL, Aljarf R, Pires DEV, Ascher DB. cardioToxCSM: A Web Server for Predicting Cardiotoxicity of Small Molecules. J Chem Inf Model 2022; 62:4827-4836. [PMID: 36219164 DOI: 10.1021/acs.jcim.2c00822] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The design of novel, safe, and effective drugs to treat human diseases is a challenging venture, with toxicity being one of the main sources of attrition at later stages of development. Failure due to toxicity incurs a significant increase in costs and time to market, with multiple drugs being withdrawn from the market due to their adverse effects. Cardiotoxicity, for instance, was responsible for the failure of drugs such as fenspiride, propoxyphene, and valdecoxib. While significant effort has been dedicated to mitigate this issue by developing computational approaches that aim to identify molecules likely to be toxic, including quantitative structure-activity relationship models and machine learning methods, current approaches present limited performance and interpretability. To overcome these, we propose a new web-based computational method, cardioToxCSM, which can predict six types of cardiac toxicity outcomes, including arrhythmia, cardiac failure, heart block, hERG toxicity, hypertension, and myocardial infarction, efficiently and accurately. cardioToxCSM was developed using the concept of graph-based signatures, molecular descriptors, toxicophore matchings, and molecular fingerprints, leveraging explainable machine learning, and was validated internally via different cross validation schemes and externally via low-redundancy blind sets. The models presented robust performances with areas under ROC curves of up to 0.898 on 5-fold cross-validation, consistent with metrics on blind tests. Additionally, our models provide interpretation of the predictions by identifying whether substructures that are commonly enriched in toxic compounds were present. We believe cardioToxCSM will provide valuable insight into the potential cardiotoxicity of small molecules early on drug screening efforts. The method is made freely available as a web server at https://biosig.lab.uq.edu.au/cardiotoxcsm.
Collapse
|
34
|
Rodrigues CHM, Garg A, Keizer D, Pires DEV, Ascher DB. CSM-peptides: A computational approach to rapid identification of therapeutic peptides. Protein Sci 2022; 31:e4442. [PMID: 36173168 PMCID: PMC9518225 DOI: 10.1002/pro.4442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Revised: 08/29/2022] [Accepted: 08/30/2022] [Indexed: 11/25/2022]
Abstract
Peptides are attractive alternatives for the development of new therapeutic strategies due to their versatility and low complexity of synthesis. Increasing interest in these molecules has led to the creation of large collections of experimentally characterized therapeutic peptides, which greatly contributes to development of data‐driven computational approaches. Here we propose CSM‐peptides, a novel machine learning method for rapid identification of eight different types of therapeutic peptides: anti‐angiogenic, anti‐bacterial, anti‐cancer, anti‐inflammatory, anti‐viral, cell‐penetrating, quorum sensing, and surface binding. Our method has shown to outperform existing approaches, achieving an AUC of up to 0.92 on independent blind tests, and consistent performance on cross‐validation. We anticipate CSM‐peptides to be of great value in helping screening large libraries to identify novel peptides with therapeutic potential and have made it freely available as a user‐friendly web server and Application Programming Interface at https://biosig.lab.uq.edu.au/csm_peptides.
Collapse
|
35
|
Tichkule S, Myung Y, Naung MT, Ansell BRE, Guy AJ, Srivastava N, Mehra S, Cacciò SM, Mueller I, Barry AE, van Oosterhout C, Pope B, Ascher DB, Jex AR. VIVID: a web application for variant interpretation and visualisation in multidimensional analyses. Mol Biol Evol 2022; 39:6697981. [PMID: 36103257 PMCID: PMC9514033 DOI: 10.1093/molbev/msac196] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Large-scale comparative genomics- and population genetic studies generate enormous amounts of polymorphism data in the form of DNA variants. Ultimately, the goal of many of these studies is to associate genetic variants to phenotypes or fitness. We introduce VIVID, an interactive, user-friendly web application that integrates a wide range of approaches for encoding genotypic to phenotypic information in any organism or disease, from an individual or population, in three-dimensional (3D) space. It allows mutation mapping and annotation, calculation of interactions and conservation scores, prediction of harmful effects, analysis of diversity and selection, and 3D visualization of genotypic information encoded in Variant Call Format on AlphaFold2 protein models. VIVID enables the rapid assessment of genes of interest in the study of adaptive evolution and the genetic load, and it helps prioritizing targets for experimental validation. We demonstrate the utility of VIVID by exploring the evolutionary genetics of the parasitic protist Plasmodium falciparum, revealing geographic variation in the signature of balancing selection in potential targets of functional antibodies.
Collapse
|
36
|
Ruff KM, Choi YH, Cox D, Ormsby AR, Myung Y, Ascher DB, Radford SE, Pappu RV, Hatters DM. Sequence grammar underlying the unfolding and phase separation of globular proteins. Mol Cell 2022; 82:3193-3208.e8. [PMID: 35853451 PMCID: PMC10846692 DOI: 10.1016/j.molcel.2022.06.024] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 05/05/2022] [Accepted: 06/15/2022] [Indexed: 12/23/2022]
Abstract
Aberrant phase separation of globular proteins is associated with many diseases. Here, we use a model protein system to understand how the unfolded states of globular proteins drive phase separation and the formation of unfolded protein deposits (UPODs). We find that for UPODs to form, the concentrations of unfolded molecules must be above a threshold value. Additionally, unfolded molecules must possess appropriate sequence grammars to drive phase separation. While UPODs recruit molecular chaperones, their compositional profiles are also influenced by synergistic physicochemical interactions governed by the sequence grammars of unfolded proteins and cellular proteins. Overall, the driving forces for phase separation and the compositional profiles of UPODs are governed by the sequence grammars of unfolded proteins. Our studies highlight the need for uncovering the sequence grammars of unfolded proteins that drive UPOD formation and cause gain-of-function interactions whereby proteins are aberrantly recruited into UPODs.
Collapse
|
37
|
de Sá AGC, Long Y, Portelli S, Pires DEV, Ascher DB. toxCSM: comprehensive prediction of small molecule toxicity profiles. Brief Bioinform 2022; 23:6673851. [PMID: 35998885 DOI: 10.1093/bib/bbac337] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 07/17/2022] [Accepted: 07/23/2022] [Indexed: 01/29/2023] Open
Abstract
Drug discovery is a lengthy, costly and high-risk endeavour that is further convoluted by high attrition rates in later development stages. Toxicity has been one of the main causes of failure during clinical trials, increasing drug development time and costs. To facilitate early identification and optimisation of toxicity profiles, several computational tools emerged aiming at improving success rates by timely pre-screening drug candidates. Despite these efforts, there is an increasing demand for platforms capable of assessing both environmental as well as human-based toxicity properties at large scale. Here, we present toxCSM, a comprehensive computational platform for the study and optimisation of toxicity profiles of small molecules. toxCSM leverages on the well-established concepts of graph-based signatures, molecular descriptors and similarity scores to develop 36 models for predicting a range of toxicity properties, which can assist in developing safer drugs and agrochemicals. toxCSM achieved an Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of up to 0.99 and Pearson's correlation coefficients of up to 0.94 on 10-fold cross-validation, with comparable performance on blind test sets, outperforming all alternative methods. toxCSM is freely available as a user-friendly web server and API at http://biosig.lab.uq.edu.au/toxcsm.
Collapse
|
38
|
Rodrigues CHM, Pires DEV, Blundell TL, Ascher DB. Structural landscapes of PPI interfaces. Brief Bioinform 2022; 23:bbac165. [PMID: 35656714 PMCID: PMC9294409 DOI: 10.1093/bib/bbac165] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Revised: 03/10/2022] [Accepted: 04/13/2022] [Indexed: 02/07/2023] Open
Abstract
Proteins are capable of highly specific interactions and are responsible for a wide range of functions, making them attractive in the pursuit of new therapeutic options. Previous studies focusing on overall geometry of protein-protein interfaces, however, concluded that PPI interfaces were generally flat. More recently, this idea has been challenged by their structural and thermodynamic characterisation, suggesting the existence of concave binding sites that are closer in character to traditional small-molecule binding sites, rather than exhibiting complete flatness. Here, we present a large-scale analysis of binding geometry and physicochemical properties of all protein-protein interfaces available in the Protein Data Bank. In this review, we provide a comprehensive overview of the protein-protein interface landscape, including evidence that even for overall larger, more flat interfaces that utilize discontinuous interacting regions, small and potentially druggable pockets are utilized at binding sites.
Collapse
|
39
|
Aljarf R, Shen M, Pires DEV, Ascher DB. Understanding and predicting the functional consequences of missense mutations in BRCA1 and BRCA2. Sci Rep 2022; 12:10458. [PMID: 35729312 PMCID: PMC9213547 DOI: 10.1038/s41598-022-13508-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 05/25/2022] [Indexed: 11/21/2022] Open
Abstract
BRCA1 and BRCA2 are tumour suppressor genes that play a critical role in maintaining genomic stability via the DNA repair mechanism. DNA repair defects caused by BRCA1 and BRCA2 missense variants increase the risk of developing breast and ovarian cancers. Accurate identification of these variants becomes clinically relevant, as means to guide personalized patient management and early detection. Next-generation sequencing efforts have significantly increased data availability but also the discovery of variants of uncertain significance that need interpretation. Experimental approaches used to measure the molecular consequences of these variants, however, are usually costly and time-consuming. Therefore, computational tools have emerged as faster alternatives for assisting in the interpretation of the clinical significance of newly discovered variants. To better understand and predict variant pathogenicity in BRCA1 and BRCA2, various machine learning algorithms have been proposed, however presented limited performance. Here we present BRCA1 and BRCA2 gene-specific models and a generic model for quantifying the functional impacts of single-point missense variants in these genes. Across tenfold cross-validation, our final models achieved a Matthew's Correlation Coefficient (MCC) of up to 0.98 and comparable performance of up to 0.89 across independent, non-redundant blind tests, outperforming alternative approaches. We believe our predictive tool will be a valuable resource for providing insights into understanding and interpreting the functional consequences of missense variants in these genes and as a tool for guiding the interpretation of newly discovered variants and prioritizing mutations for experimental validation.
Collapse
|
40
|
Rezende PM, Xavier JS, Ascher DB, Fernandes GR, Pires DEV. Evaluating hierarchical machine learning approaches to classify biological databases. Brief Bioinform 2022; 23:6611916. [PMID: 35724625 PMCID: PMC9310517 DOI: 10.1093/bib/bbac216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 04/29/2022] [Accepted: 05/09/2022] [Indexed: 12/04/2022] Open
Abstract
The rate of biological data generation has increased dramatically in recent years, which has driven the importance of databases as a resource to guide innovation and the generation of biological insights. Given the complexity and scale of these databases, automatic data classification is often required. Biological data sets are often hierarchical in nature, with varying degrees of complexity, imposing different challenges to train, test and validate accurate and generalizable classification models. While some approaches to classify hierarchical data have been proposed, no guidelines regarding their utility, applicability and limitations have been explored or implemented. These include ‘Local’ approaches considering the hierarchy, building models per level or node, and ‘Global’ hierarchical classification, using a flat classification approach. To fill this gap, here we have systematically contrasted the performance of ‘Local per Level’ and ‘Local per Node’ approaches with a ‘Global’ approach applied to two different hierarchical datasets: BioLip and CATH. The results show how different components of hierarchical data sets, such as variation coefficient and prediction by depth, can guide the choice of appropriate classification schemes. Finally, we provide guidelines to support this process when embarking on a hierarchical classification task, which will help optimize computational resources and predictive performance.
Collapse
|
41
|
Rodrigues CHM, Ascher DB. CSM-Potential: mapping protein interactions and biological ligands in 3D space using geometric deep learning. Nucleic Acids Res 2022; 50:W204-W209. [PMID: 35609999 PMCID: PMC9252741 DOI: 10.1093/nar/gkac381] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 04/19/2022] [Accepted: 05/05/2022] [Indexed: 11/13/2022] Open
Abstract
Recent advances in protein structural modelling have enabled the accurate prediction of the holo 3D structures of almost any protein, however protein function is intrinsically linked to the interactions it makes. While a number of computational approaches have been proposed to explore potential biological interactions, they have been limited to specific interactions, and have not been readily accessible for non-experts or use in bioinformatics pipelines. Here we present CSM-Potential, a geometric deep learning approach to identify regions of a protein surface that are likely to mediate protein-protein and protein-ligand interactions in order to provide a link between 3D structure and biological function. Our method has shown robust performance, outperforming existing methods for both predictive tasks. By assessing the performance of CSM-Potential on independent blind tests, we show that our method was able to achieve ROC AUC values of up to 0.81 for the identification of potential protein-protein binding sites, and up to 0.96 accuracy on biological ligand classification. Our method is freely available as a user-friendly and easy-to-use web server and API at http://biosig.unimelb.edu.au/csm_potential.
Collapse
|
42
|
Paiva VA, Mendonça MV, Silveira SA, Ascher DB, Pires DEV, Izidoro SC. GASS-Metal: identifying metal-binding sites on protein structures using genetic algorithms. Brief Bioinform 2022; 23:6590153. [PMID: 35595534 DOI: 10.1093/bib/bbac178] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 04/18/2022] [Accepted: 04/20/2022] [Indexed: 12/12/2022] Open
Abstract
Metals are present in >30% of proteins found in nature and assist them to perform important biological functions, including storage, transport, signal transduction and enzymatic activity. Traditional and experimental techniques for metal-binding site prediction are usually costly and time-consuming, making computational tools that can assist in these predictions of significant importance. Here we present Genetic Active Site Search (GASS)-Metal, a new method for protein metal-binding site prediction. The method relies on a parallel genetic algorithm to find candidate metal-binding sites that are structurally similar to curated templates from M-CSA and MetalPDB. GASS-Metal was thoroughly validated using homologous proteins and conservative mutations of residues, showing a robust performance. The ability of GASS-Metal to identify metal-binding sites was also compared with state-of-the-art methods, outperforming similar methods and achieving an MCC of up to 0.57 and detecting up to 96.1% of the sites correctly. GASS-Metal is freely available at https://gassmetal.unifei.edu.br. The GASS-Metal source code is available at https://github.com/sandroizidoro/gassmetal-local.
Collapse
|
43
|
Stephenson SE, Costain G, Blok LE, Silk MA, Nguyen TB, Dong X, Alhuzaimi DE, Dowling JJ, Walker S, Amburgey K, Hayeems RZ, Rodan LH, Schwartz MA, Picker J, Lynch SA, Gupta A, Rasmussen KJ, Schimmenti LA, Klee EW, Niu Z, Agre KE, Chilton I, Chung WK, Revah-Politi A, Au PB, Griffith C, Racobaldo M, Raas-Rothschild A, Ben Zeev B, Barel O, Moutton S, Morice-Picard F, Carmignac V, Cornaton J, Marle N, Devinsky O, Stimach C, Wechsler SB, Hainline BE, Sapp K, Willems M, Bruel AL, Dias KR, Evans CA, Roscioli T, Sachdev R, Temple SE, Zhu Y, Baker JJ, Scheffer IE, Gardiner FJ, Schneider AL, Muir AM, Mefford HC, Crunk A, Heise EM, Millan F, Monaghan KG, Person R, Rhodes L, Richards S, Wentzensen IM, Cogné B, Isidor B, Nizon M, Vincent M, Besnard T, Piton A, Marcelis C, Kato K, Koyama N, Ogi T, Goh ESY, Richmond C, Amor DJ, Boyce JO, Morgan AT, Hildebrand MS, Kaspi A, Bahlo M, Friðriksdóttir R, Katrínardóttir H, Sulem P, Stefánsson K, Björnsson HT, Mandelstam S, Morleo M, Mariani M, Scala M, Accogli A, Torella A, Capra V, Wallis M, Jansen S, Waisfisz Q, de Haan H, Sadedin S, Lim SC, White SM, Ascher DB, Schenck A, Lockhart PJ, Christodoulou J, Tan TY, Christodoulou J, Tan TY. Germline variants in tumor suppressor FBXW7 lead to impaired ubiquitination and a neurodevelopmental syndrome. Am J Hum Genet 2022; 109:601-617. [PMID: 35395208 DOI: 10.1016/j.ajhg.2022.03.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 02/28/2022] [Indexed: 11/01/2022] Open
Abstract
Neurodevelopmental disorders are highly heterogenous conditions resulting from abnormalities of brain architecture and/or function. FBXW7 (F-box and WD-repeat-domain-containing 7), a recognized developmental regulator and tumor suppressor, has been shown to regulate cell-cycle progression and cell growth and survival by targeting substrates including CYCLIN E1/2 and NOTCH for degradation via the ubiquitin proteasome system. We used a genotype-first approach and global data-sharing platforms to identify 35 individuals harboring de novo and inherited FBXW7 germline monoallelic chromosomal deletions and nonsense, frameshift, splice-site, and missense variants associated with a neurodevelopmental syndrome. The FBXW7 neurodevelopmental syndrome is distinguished by global developmental delay, borderline to severe intellectual disability, hypotonia, and gastrointestinal issues. Brain imaging detailed variable underlying structural abnormalities affecting the cerebellum, corpus collosum, and white matter. A crystal-structure model of FBXW7 predicted that missense variants were clustered at the substrate-binding surface of the WD40 domain and that these might reduce FBXW7 substrate binding affinity. Expression of recombinant FBXW7 missense variants in cultured cells demonstrated impaired CYCLIN E1 and CYCLIN E2 turnover. Pan-neuronal knockdown of the Drosophila ortholog, archipelago, impaired learning and neuronal function. Collectively, the data presented herein provide compelling evidence of an F-Box protein-related, phenotypically variable neurodevelopmental disorder associated with monoallelic variants in FBXW7.
Collapse
|
44
|
Pan Q, Nguyen TB, Ascher DB, Pires DEV. Systematic evaluation of computational tools to predict the effects of mutations on protein stability in the absence of experimental structures. Brief Bioinform 2022; 23:bbac025. [PMID: 35189634 PMCID: PMC9155634 DOI: 10.1093/bib/bbac025] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/13/2022] [Accepted: 01/30/2022] [Indexed: 12/26/2022] Open
Abstract
Changes in protein sequence can have dramatic effects on how proteins fold, their stability and dynamics. Over the last 20 years, pioneering methods have been developed to try to estimate the effects of missense mutations on protein stability, leveraging growing availability of protein 3D structures. These, however, have been developed and validated using experimentally derived structures and biophysical measurements. A large proportion of protein structures remain to be experimentally elucidated and, while many studies have based their conclusions on predictions made using homology models, there has been no systematic evaluation of the reliability of these tools in the absence of experimental structural data. We have, therefore, systematically investigated the performance and robustness of ten widely used structural methods when presented with homology models built using templates at a range of sequence identity levels (from 15% to 95%) and contrasted performance with sequence-based tools, as a baseline. We found there is indeed performance deterioration on homology models built using templates with sequence identity below 40%, where sequence-based tools might become preferable. This was most marked for mutations in solvent exposed residues and stabilizing mutations. As structure prediction tools improve, the reliability of these predictors is expected to follow, however we strongly suggest that these factors should be taken into consideration when interpreting results from structure-based predictors of mutation effects on protein stability.
Collapse
|
45
|
Pires DEV, Stubbs KA, Mylne JS, Ascher DB. cropCSM: designing safe and potent herbicides with graph-based signatures. Brief Bioinform 2022; 23:6535680. [PMID: 35211724 PMCID: PMC9155605 DOI: 10.1093/bib/bbac042] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/26/2022] [Accepted: 01/27/2022] [Indexed: 12/11/2022] Open
Abstract
Herbicides have revolutionised weed management, increased crop yields and improved profitability allowing for an increase in worldwide food security. Their widespread use, however, has also led to a rise in resistance and concerns about their environmental impact. Despite the need for potent and safe herbicidal molecules, no herbicide with a new mode of action has reached the market in 30 years. Although development of computational approaches has proven invaluable to guide rational drug discovery pipelines, leading to higher hit rates and lower attrition due to poor toxicity, little has been done in contrast for herbicide design. To fill this gap, we have developed cropCSM, a computational platform to help identify new, potent, nontoxic and environmentally safe herbicides. By using a knowledge-based approach, we identified physicochemical properties and substructures enriched in safe herbicides. By representing the small molecules as a graph, we leveraged these insights to guide the development of predictive models trained and tested on the largest collected data set of molecules with experimentally characterised herbicidal profiles to date (over 4500 compounds). In addition, we developed six new environmental and human toxicity predictors, spanning five different species to assist in molecule prioritisation. cropCSM was able to correctly identify 97% of herbicides currently available commercially, while predicting toxicity profiles with accuracies of up to 92%. We believe cropCSM will be an essential tool for the enrichment of screening libraries and to guide the development of potent and safe herbicides. We have made the method freely available through a user-friendly webserver at http://biosig.unimelb.edu.au/crop_csm.
Collapse
|
46
|
Abrusán G, Ascher DB, Inouye M. Known allosteric proteins have central roles in genetic disease. PLoS Comput Biol 2022; 18:e1009806. [PMID: 35139069 PMCID: PMC10138267 DOI: 10.1371/journal.pcbi.1009806] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 04/27/2023] [Accepted: 01/05/2022] [Indexed: 12/15/2022] Open
Abstract
Allostery is a form of protein regulation, where ligands that bind sites located apart from the active site can modify the activity of the protein. The molecular mechanisms of allostery have been extensively studied, because allosteric sites are less conserved than active sites, and drugs targeting them are more specific than drugs binding the active sites. Here we quantify the importance of allostery in genetic disease. We show that 1) known allosteric proteins are central in disease networks, contribute to genetic disease and comorbidities much more than non-allosteric proteins, and there is an association between being allosteric and involvement in disease; 2) they are enriched in many major disease types like hematopoietic diseases, cardiovascular diseases, cancers, diabetes, or diseases of the central nervous system; 3) variants from cancer genome-wide association studies are enriched near allosteric proteins, indicating their importance to polygenic traits; and 4) the importance of allosteric proteins in disease is due, at least partly, to their central positions in protein-protein interaction networks, and less due to their dynamical properties.
Collapse
|
47
|
Myung Y, Pires DEV, Ascher DB. CSM-AB: graph-based antibody-antigen binding affinity prediction and docking scoring function. Bioinformatics 2022; 38:1141-1143. [PMID: 34734992 DOI: 10.1093/bioinformatics/btab762] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 10/18/2021] [Accepted: 11/01/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Understanding antibody-antigen interactions is key to improving their binding affinities and specificities. While experimental approaches are fundamental for developing new therapeutics, computational methods can provide quick assessment of binding landscapes, guiding experimental design. Despite this, little effort has been devoted to accurately predicting the binding affinity between antibodies and antigens and to develop tailored docking scoring functions for this type of interaction. Here, we developed CSM-AB, a machine learning method capable of predicting antibody-antigen binding affinity by modelling interaction interfaces as graph-based signatures. RESULTS CSM-AB outperformed alternative methods achieving a Pearson's correlation of up to 0.64 on blind tests. We also show CSM-AB can accurately rank near-native poses, working effectively as a docking scoring function. We believe CSM-AB will be an invaluable tool to assist in the development of new immunotherapies. AVAILABILITY AND IMPLEMENTATION CSM-AB is freely available as a user-friendly web interface and API at http://biosig.unimelb.edu.au/csm_ab/datasets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
48
|
Karmakar M, Ragonnet R, Ascher DB, Trauer JM, Denholm JT. Estimating tuberculosis drug resistance amplification rates in high-burden settings. BMC Infect Dis 2022; 22:82. [PMID: 35073862 PMCID: PMC8785585 DOI: 10.1186/s12879-022-07067-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 01/11/2022] [Indexed: 11/20/2022] Open
Abstract
Background Antimicrobial resistance develops following the accrual of mutations in the bacterial genome, and may variably impact organism fitness and hence, transmission risk. Classical representation of tuberculosis (TB) dynamics using a single or two strain (DS/MDR-TB) model typically does not capture elements of this important aspect of TB epidemiology. To understand and estimate the likelihood of resistance spreading in high drug-resistant TB incidence settings, we used epidemiological data to develop a mathematical model of Mycobacterium tuberculosis (Mtb) transmission. Methods A four-strain (drug-susceptible (DS), isoniazid mono-resistant (INH-R), rifampicin mono-resistant (RIF-R) and multidrug-resistant (MDR)) compartmental deterministic Mtb transmission model was developed to explore the progression from DS- to MDR-TB in The Philippines and Viet Nam. The models were calibrated using data from national tuberculosis prevalence (NTP) surveys and drug resistance surveys (DRS). An adaptive Metropolis algorithm was used to estimate the risks of drug resistance amplification among unsuccessfully treated individuals. Results The estimated proportion of INH-R amplification among failing treatments was 0.84 (95% CI 0.79–0.89) for The Philippines and 0.77 (95% CI 0.71–0.84) for Viet Nam. The proportion of RIF-R amplification among failing treatments was 0.05 (95% CI 0.04–0.07) for The Philippines and 0.011 (95% CI 0.010–0.012) for Viet Nam. Conclusion The risk of resistance amplification due to treatment failure for INH was dramatically higher than RIF. We observed RIF-R strains were more likely to be transmitted than acquired through amplification, while both mechanisms of acquisition were important contributors in the case of INH-R. These findings highlight the complexity of drug resistance dynamics in high-incidence settings, and emphasize the importance of prioritizing testing algorithms which allow for early detection of INH-R. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-022-07067-1.
Collapse
|
49
|
Karmakar M, Cicaloni V, Rodrigues CH, Spiga O, Santucci A, Ascher DB. HGDiscovery: An online tool providing functional and phenotypic information on novel variants of homogentisate 1,2- dioxigenase. Curr Res Struct Biol 2022; 4:271-277. [PMID: 36118553 PMCID: PMC9471331 DOI: 10.1016/j.crstbi.2022.08.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 07/28/2022] [Accepted: 08/23/2022] [Indexed: 11/28/2022] Open
Abstract
Alkaptonuria (AKU), a rare genetic disorder, is characterized by the accumulation of homogentisic acid (HGA) in the body. Affected individuals lack functional levels of an enzyme required to breakdown HGA. Mutations in the homogentisate 1,2-dioxygenase (HGD) gene cause AKU and they are responsible for deficient levels of functional HGD, which, in turn, leads to excess levels of HGA. Although HGA is rapidly cleared from the body by the kidneys, in the long term it starts accumulating in various tissues, especially cartilage. Over time (rarely before adulthood), it eventually changes the color of affected tissue to slate blue or black. Here we report a comprehensive mutation analysis of 111 pathogenic and 190 non-pathogenic HGD missense mutations using protein structural information. Using our comprehensive suite of graph-based signature methods, mCSM complemented with sequence-based tools, we studied the functional and molecular consequences of each mutation on protein stability, interaction and evolutionary conservation. The scores generated from the structure and sequence-based tools were used to train a supervised machine learning algorithm with 89% accuracy. The empirical classifier was used to generate the variant phenotype for novel HGD missense mutations. All this information is deployed as a user friendly freely available web server called HGDiscovery (https://biosig.lab.uq.edu.au/hgdiscovery/). Functional and phenotypic consequences of HGD non-synonymous variations. Biophysical, structural and evolutionary analysis of novel and known clinical variants. Pathogenic mutations affected protein stability and conformational flexibility. Pathogenic mutations associated with deleterious scores for sequence-based features. HGDiscovery (http://biosig.unimelb.edu.au/hgdiscovery/) – webserver.
Collapse
|
50
|
Nguyen TB, Pires DEV, Ascher DB. CSM-carbohydrate: protein-carbohydrate binding affinity prediction and docking scoring function. Brief Bioinform 2021; 23:6457169. [PMID: 34882232 DOI: 10.1093/bib/bbab512] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 11/06/2021] [Accepted: 11/08/2021] [Indexed: 12/29/2022] Open
Abstract
Protein-carbohydrate interactions are crucial for many cellular processes but can be challenging to biologically characterise. To improve our understanding and ability to model these molecular interactions, we used a carefully curated set of 370 protein-carbohydrate complexes with experimental structural and biophysical data in order to train and validate a new tool, cutoff scanning matrix (CSM)-carbohydrate, using machine learning algorithms to accurately predict their binding affinity and rank docking poses as a scoring function. Information on both protein and carbohydrate complementarity, in terms of shape and chemistry, was captured using graph-based structural signatures. Across both training and independent test sets, we achieved comparable Pearson's correlations of 0.72 under cross-validation [root mean square error (RMSE) of 1.58 Kcal/mol] and 0.67 on the independent test (RMSE of 1.72 Kcal/mol), providing confidence in the generalisability and robustness of the final model. Similar performance was obtained across mono-, di- and oligosaccharides, further highlighting the applicability of this approach to the study of larger complexes. We show CSM-carbohydrate significantly outperformed previous approaches and have implemented our method and make all data freely available through both a user-friendly web interface and application programming interface, to facilitate programmatic access at http://biosig.unimelb.edu.au/csm_carbohydrate/. We believe CSM-carbohydrate will be an invaluable tool for helping assess docking poses and the effects of mutations on protein-carbohydrate affinity, unravelling important aspects that drive binding recognition.
Collapse
|