1
|
Maier BD, Petursson B, Lussana A, Petsalaki E. Data-driven extraction of human kinase-substrate relationships from omics datasets. Mol Cell Proteomics 2025:100994. [PMID: 40381888 DOI: 10.1016/j.mcpro.2025.100994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Revised: 05/01/2025] [Accepted: 05/09/2025] [Indexed: 05/20/2025] Open
Abstract
Phosphorylation forms an important part of the signalling system that cells use for decision making and regulation of processes such as cell division and differentiation. In human, >90% of identified phosphosites don't have annotations regarding the relevant upstream kinase. At the same time around 30% of kinases (as annotated in Uniprot) have no known target. This knowledge gap stresses the need to make large scale, data-driven computational predictions. In this study, we have created a machine learning-based model to derive a probabilistic kinase-substrate network from omics datasets. Our methodology displays improved performance compared to other state-of-the-art kinase-substrate prediction methods and provides predictions for more kinases. Importantly, it better captures new experimentally-identified kinase-substrate relationships. It can therefore allow the improved prioritisation of kinase-substrate pairs for illuminating the dark human cell signalling space. Our model is integrated into a web server, SELPHI2.0, to allow unbiased analysis of phosphoproteomics data, facilitating the design of downstream experiments to uncover mechanisms of signal transduction across conditions and cellular contexts.
Collapse
Affiliation(s)
- Benjamin Dominik Maier
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Borgthor Petursson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Alessandro Lussana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Evangelia Petsalaki
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom.
| |
Collapse
|
2
|
Heerah S, Molinari R, Guerrier S, Marshall-Colon A. Granger-causal testing for irregularly sampled time series with application to nitrogen signalling in Arabidopsis. BIOINFORMATICS (OXFORD, ENGLAND) 2021; 37:2450-2460. [PMID: 33693548 DOI: 10.1101/2020.06.15.152819] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 02/18/2021] [Accepted: 03/03/2021] [Indexed: 05/27/2023]
Abstract
MOTIVATION Identification of system-wide causal relationships can contribute to our understanding of long-distance, intercellular signalling in biological organisms. Dynamic transcriptome analysis holds great potential to uncover coordinated biological processes between organs. However, many existing dynamic transcriptome studies are characterized by sparse and often unevenly spaced time points that make the identification of causal relationships across organs analytically challenging. Application of existing statistical models, designed for regular time series with abundant time points, to sparse data may fail to reveal biologically significant, causal relationships. With increasing research interest in biological time series data, there is a need for new statistical methods that are able to determine causality within and between time series data sets. Here, a statistical framework was developed to identify (Granger) causal gene-gene relationships of unevenly spaced, multivariate time series data from two different tissues of Arabidopsis thaliana in response to a nitrogen signal. RESULTS This work delivers a statistical approach for modelling irregularly sampled bivariate signals which embeds functions from the domain of engineering that allow to adapt the model's dependence structure to the specific sampling time. Using maximum-likelihood to estimate the parameters of this model for each bivariate time series, it is then possible to use bootstrap procedures for small samples (or asymptotics for large samples) in order to test for Granger-Causality. When applied to the A.thaliana data, the proposed approach produced 3078 significant interactions, in which 2012 interactions have root causal genes and 1066 interactions have shoot causal genes. Many of the predicted causal and target genes are known players in local and long-distance nitrogen signalling, including genes encoding transcription factors, hormones and signalling peptides. Of the 1007 total causal genes (either organ), 384 are either known or predicted mobile transcripts, suggesting that the identified causal genes may be directly involved in long-distance nitrogen signalling through intercellular interactions. The model predictions and subsequent network analysis identified nitrogen-responsive genes that can be further tested for their specific roles in long-distance nitrogen signalling. AVAILABILITY AND IMPLEMENTATION The method was developed with the R statistical software and is made available through the R package 'irg' hosted on the GitHub repository https://github.com/SMAC-Group/irg where also a running example vignette can be found (https://smac-group.github.io/irg/articles/vignette.html). A few signals from the original data set are made available in the package as an example to apply the method and the complete A.thaliana data can be found at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE97500. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sachin Heerah
- Department of Plant Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Roberto Molinari
- Department of Mathematics and Statistics, Auburn University, Auburn, AL 36849, USA
| | - Stéphane Guerrier
- Faculty of Science & Geneva School of Economics and Management, University of Geneva, Geneva 1205, Switzerland
| | - Amy Marshall-Colon
- Department of Plant Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
3
|
Heerah S, Molinari R, Guerrier S, Marshall-Colon A. Granger-Causal Testing for Irregularly Sampled Time Series with Application to Nitrogen Signaling in Arabidopsis. Bioinformatics 2021; 37:2450-2460. [PMID: 33693548 PMCID: PMC8388030 DOI: 10.1093/bioinformatics/btab126] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 02/18/2021] [Accepted: 03/03/2021] [Indexed: 12/05/2022] Open
Abstract
Motivation Identification of system-wide causal relationships can contribute to our understanding of long-distance, intercellular signalling in biological organisms. Dynamic transcriptome analysis holds great potential to uncover coordinated biological processes between organs. However, many existing dynamic transcriptome studies are characterized by sparse and often unevenly spaced time points that make the identification of causal relationships across organs analytically challenging. Application of existing statistical models, designed for regular time series with abundant time points, to sparse data may fail to reveal biologically significant, causal relationships. With increasing research interest in biological time series data, there is a need for new statistical methods that are able to determine causality within and between time series data sets. Here, a statistical framework was developed to identify (Granger) causal gene-gene relationships of unevenly spaced, multivariate time series data from two different tissues of Arabidopsis thaliana in response to a nitrogen signal. Results This work delivers a statistical approach for modelling irregularly sampled bivariate signals which embeds functions from the domain of engineering that allow to adapt the model’s dependence structure to the specific sampling time. Using maximum-likelihood to estimate the parameters of this model for each bivariate time series, it is then possible to use bootstrap procedures for small samples (or asymptotics for large samples) in order to test for Granger-Causality. When applied to the A.thaliana data, the proposed approach produced 3078 significant interactions, in which 2012 interactions have root causal genes and 1066 interactions have shoot causal genes. Many of the predicted causal and target genes are known players in local and long-distance nitrogen signalling, including genes encoding transcription factors, hormones and signalling peptides. Of the 1007 total causal genes (either organ), 384 are either known or predicted mobile transcripts, suggesting that the identified causal genes may be directly involved in long-distance nitrogen signalling through intercellular interactions. The model predictions and subsequent network analysis identified nitrogen-responsive genes that can be further tested for their specific roles in long-distance nitrogen signalling. Availability and implementation The method was developed with the R statistical software and is made available through the R package ‘irg’ hosted on the GitHub repository https://github.com/SMAC-Group/irg where also a running example vignette can be found (https://smac-group.github.io/irg/articles/vignette.html). A few signals from the original data set are made available in the package as an example to apply the method and the complete A.thaliana data can be found at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE97500. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sachin Heerah
- Department of Plant Biology, University of Illinois Urbana-Champaign, Urbana, IL, USA
| | - Roberto Molinari
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, USA
| | - Stéphane Guerrier
- Faculty of Science & Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland
| | - Amy Marshall-Colon
- Department of Plant Biology, University of Illinois Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
4
|
Abstract
MOTIVATION Cells regulate themselves via dizzyingly complex biochemical processes called signaling pathways. These are usually depicted as a network, where nodes represent proteins and edges indicate their influence on each other. In order to understand diseases and therapies at the cellular level, it is crucial to have an accurate understanding of the signaling pathways at work. Since signaling pathways can be modified by disease, the ability to infer signaling pathways from condition- or patient-specific data is highly valuable. A variety of techniques exist for inferring signaling pathways. We build on past works that formulate signaling pathway inference as a Dynamic Bayesian Network structure estimation problem on phosphoproteomic time course data. We take a Bayesian approach, using Markov Chain Monte Carlo to estimate a posterior distribution over possible Dynamic Bayesian Network structures. Our primary contributions are (i) a novel proposal distribution that efficiently samples sparse graphs and (ii) the relaxation of common restrictive modeling assumptions. RESULTS We implement our method, named Sparse Signaling Pathway Sampling, in Julia using the Gen probabilistic programming language. Probabilistic programming is a powerful methodology for building statistical models. The resulting code is modular, extensible and legible. The Gen language, in particular, allows us to customize our inference procedure for biological graphs and ensure efficient sampling. We evaluate our algorithm on simulated data and the HPN-DREAM pathway reconstruction challenge, comparing our performance against a variety of baseline methods. Our results demonstrate the vast potential for probabilistic programming, and Gen specifically, for biological network inference. AVAILABILITY AND IMPLEMENTATION Find the full codebase at https://github.com/gitter-lab/ssps. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Merrell
- Department of Computer Sciences, University of Wisconsin–Madison, Madison, WI 53706, USA
- Morgridge Institute for Research, Madison, WI 53715, USA
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin–Madison, Madison, WI 53706, USA
- Morgridge Institute for Research, Madison, WI 53715, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI 53726, USA
| |
Collapse
|
5
|
Mercatelli D, Scalambra L, Triboli L, Ray F, Giorgi FM. Gene regulatory network inference resources: A practical overview. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2019; 1863:194430. [PMID: 31678629 DOI: 10.1016/j.bbagrm.2019.194430] [Citation(s) in RCA: 75] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 09/06/2019] [Accepted: 09/09/2019] [Indexed: 02/08/2023]
Abstract
Transcriptional regulation is a fundamental molecular mechanism involved in almost every aspect of life, from homeostasis to development, from metabolism to behavior, from reaction to stimuli to disease progression. In recent years, the concept of Gene Regulatory Networks (GRNs) has grown popular as an effective applied biology approach for describing the complex and highly dynamic set of transcriptional interactions, due to its easy-to-interpret features. Since cataloguing, predicting and understanding every GRN connection in all species and cellular contexts remains a great challenge for biology, researchers have developed numerous tools and methods to infer regulatory processes. In this review, we catalogue these methods in six major areas, based on the dominant underlying information leveraged to infer GRNs: Coexpression, Sequence Motifs, Chromatin Immunoprecipitation (ChIP), Orthology, Literature and Protein-Protein Interaction (PPI) specifically focused on transcriptional complexes. The methods described here cover a wide range of user-friendliness: from web tools that require no prior computational expertise to command line programs and algorithms for large scale GRN inferences. Each method for GRN inference described herein effectively illustrates a type of transcriptional relationship, with many methods being complementary to others. While a truly holistic approach for inferring and displaying GRNs remains one of the greatest challenges in the field of systems biology, we believe that the integration of multiple methods described herein provides an effective means with which experimental and computational biologists alike may obtain the most complete pictures of transcriptional relationships. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Federico Manuel Giorgi and Dr. Shaun Mahony.
Collapse
Affiliation(s)
- Daniele Mercatelli
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Laura Scalambra
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Luca Triboli
- Centre for Integrative Biology (CIBIO), University of Trento, Italy
| | - Forest Ray
- Department of Systems Biology, Columbia University Medical Center, New York, NY, United States
| | - Federico M Giorgi
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.
| |
Collapse
|
6
|
Glymour C, Zhang K, Spirtes P. Review of Causal Discovery Methods Based on Graphical Models. Front Genet 2019; 10:524. [PMID: 31214249 PMCID: PMC6558187 DOI: 10.3389/fgene.2019.00524] [Citation(s) in RCA: 167] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 05/13/2019] [Indexed: 12/11/2022] Open
Abstract
A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.
Collapse
Affiliation(s)
- Clark Glymour
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Kun Zhang
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Peter Spirtes
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, United States
| |
Collapse
|
7
|
Köksal AS, Beck K, Cronin DR, McKenna A, Camp ND, Srivastava S, MacGilvray ME, Bodík R, Wolf-Yadlin A, Fraenkel E, Fisher J, Gitter A. Synthesizing Signaling Pathways from Temporal Phosphoproteomic Data. Cell Rep 2018; 24:3607-3618. [PMID: 30257219 PMCID: PMC6295338 DOI: 10.1016/j.celrep.2018.08.085] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Revised: 04/16/2018] [Accepted: 08/29/2018] [Indexed: 12/25/2022] Open
Abstract
We present a method for automatically discovering signaling pathways from time-resolved phosphoproteomic data. The Temporal Pathway Synthesizer (TPS) algorithm uses constraint-solving techniques first developed in the context of formal verification to explore paths in an interaction network. It systematically eliminates all candidate structures for a signaling pathway where a protein is activated or inactivated before its upstream regulators. The algorithm can model more than one hundred thousand dynamic phosphosites and can discover pathway members that are not differentially phosphorylated. By analyzing temporal data, TPS defines signaling cascades without needing to experimentally perturb individual proteins. It recovers known pathways and proposes pathway connections when applied to the human epidermal growth factor and yeast osmotic stress responses. Independent kinase mutant studies validate predicted substrates in the TPS osmotic stress pathway.
Collapse
Affiliation(s)
- Ali Sinan Köksal
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Kirsten Beck
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Dylan R Cronin
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA; Department of Biological Sciences, Bowling Green State University, Bowling Green, OH, USA
| | - Aaron McKenna
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Nathan D Camp
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Saurabh Srivastava
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | | | - Rastislav Bodík
- Paul G. Allen Center for Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | | | - Ernest Fraenkel
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jasmin Fisher
- Microsoft Research, Cambridge, UK; Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA; Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
8
|
Abstract
The copycatLayout app is a network-based visual differential analysis tool that improves upon the existing layoutSaver app and is delivered pre-installed with Cytoscape, beginning with v3.6.0. LayoutSaver cloned a network layout by mapping node locations from one network to another based on node attribute values, but failed to clone view scale and location, and provided no means of identifying which nodes were successfully mapped between networks. Copycat addresses these issues and provides additional layout options. With the advent of Cytoscape Automation (packaged in Cytoscape v3.6.0), researchers can utilize the Copycat layout and its output in workflows written in their language of choice by using only a few simple REST calls. Copycat enables researchers to visually compare groups of homologous genes, generate network comparison images for publications, and quickly identify differences between similar networks at a glance without leaving their script. With a few extra REST calls, scripts can discover nodes present in one network but not in the other, which can feed into more complex analyses (e.g., modifying mismatched nodes based on new data, then re-running the layout to highlight additional network changes).
Collapse
Affiliation(s)
- Brett Settle
- Department of Medicine, University of California, San Diego, California, 92093-0688, USA
| | - David Otasek
- Department of Medicine, University of California, San Diego, California, 92093-0688, USA
| | - John H Morris
- University of California San Francisco, San Francisco, California, 94143, USA
| | - Barry Demchak
- Department of Medicine, University of California, San Diego, California, 92093-0688, USA
| |
Collapse
|