1
|
Gogoshin G, Rodin AS. Minimum uncertainty as Bayesian network model selection principle. BMC Bioinformatics 2025; 26:100. [PMID: 40200184 PMCID: PMC11980298 DOI: 10.1186/s12859-025-06104-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 03/05/2025] [Indexed: 04/10/2025] Open
Abstract
BACKGROUND Bayesian Network (BN) modeling is a prominent methodology in computational systems biology. However, the incommensurability of datasets frequently encountered in life science domains gives rise to contextual dependence and numerical irregularities in the behavior of model selection criteria (such as MDL, Minimum Description Length) used in BN reconstruction. This renders model features, first and foremost dependency strengths, incomparable and difficult to interpret. In this study, we derive and evaluate a model selection principle that addresses these problems. RESULTS The objective of the study is attained by (i) approaching model evaluation as a misspecification problem, (ii) estimating the effect that sampling error has on the satisfiability of conditional independence criterion, as reflected by Mutual Information, and (iii) utilizing this error estimate to penalize uncertainty with the novel Minimum Uncertainty (MU) model selection principle. We validate our findings numerically and demonstrate the performance advantages of the MU criterion. Finally, we illustrate the advantages of the new model evaluation framework on real data examples. CONCLUSIONS The new BN model selection principle successfully overcomes performance irregularities observed with MDL, offers a superior average convergence rate in BN reconstruction, and improves the interpretability and universality of resulting BNs, thus enabling direct inter-BN comparisons and evaluations.
Collapse
Affiliation(s)
- Grigoriy Gogoshin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA, 91010, USA.
| | - Andrei S Rodin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA, 91010, USA
| |
Collapse
|
2
|
Manookian B, Mukhaleva E, Gogoshin G, Bhattacharya S, Sivaramakrishnan S, Vaidehi N, Rodin AS, Branciamore S. Temporally Resolved and Interpretable Machine Learning Model of GPCR conformational transition. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.17.643765. [PMID: 40166135 PMCID: PMC11957019 DOI: 10.1101/2025.03.17.643765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Identifying target-specific drugs remains a challenge in pharmacology, especially for highly homologous proteins such as dopamine receptors D2R and D3R. Differences in target-specific cryptic druggable sites for such receptors arise from the distinct conformational ensembles underlying their dynamic behavior. While Molecular Dynamics (MD) simulations has emerged as a powerful tool for dissecting protein dynamics, the sheer volume of MD data requires scalable and unbiased data analysis strategies to pinpoint residue communities regulating conformational state ensembles. We have developed the Dynamically Resolved Universal Model for BayEsiAn network Tracking (DRUMBEAT) interpretable machine learning algorithm and validated it by identifying residue communities that enable the deactivation of the β2-adrenergic receptor. Further, upon analyzing dopamine receptor dynamics we identified distinct and non-conserved residue communities around the contacts F1704.62_F172ECL2 and S1464.38_G14134.56 that are specific to D3R conformational transitions compared to D2R. This information can be tapped to design subtype-specific drugs for neuropsychiatric and substance use disorders.
Collapse
Affiliation(s)
- Babgen Manookian
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Elizaveta Mukhaleva
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
- Irell and Manella Graduate School of Biological Sciences, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Grigoriy Gogoshin
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Supriyo Bhattacharya
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Sivaraj Sivaramakrishnan
- Department of Genetics, Cell and Developmental Biology, University of Minnesota; Minneapolis, MN, USA
| | - Nagarajan Vaidehi
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
- Irell and Manella Graduate School of Biological Sciences, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Andrei S. Rodin
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
- Irell and Manella Graduate School of Biological Sciences, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| | - Sergio Branciamore
- Department of Computational and Quantitative Medicine, Beckman Research Institute of the City of Hope; Duarte, CA, USA
- Irell and Manella Graduate School of Biological Sciences, Beckman Research Institute of the City of Hope; Duarte, CA, USA
| |
Collapse
|
3
|
Wang S, Hu H, Li X. A systematic study of motif pairs that may facilitate enhancer-promoter interactions. J Integr Bioinform 2022; 19:jib-2021-0038. [PMID: 35130376 PMCID: PMC9069648 DOI: 10.1515/jib-2021-0038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/20/2022] [Indexed: 01/06/2023] Open
Abstract
Pairs of interacting transcription factors (TFs) have previously been shown to bind to enhancers and promoters and contribute to their physical interactions. However, to date, we have limited knowledge about such TF pairs. To fill this void, we systematically studied the co-occurrence of TF-binding motifs in interacting enhancer-promoter (EP) pairs in seven human cell lines. We discovered 423 motif pairs that significantly co-occur in enhancers and promoters of interacting EP pairs. We demonstrated that these motif pairs are biologically meaningful and significantly enriched with motif pairs of known interacting TF pairs. We also showed that the identified motif pairs facilitated the discovery of the interacting EP pairs. The developed pipeline, EPmotifPair, together with the predicted motifs and motif pairs, is available at https://doi.org/10.6084/m9.figshare.14192000. Our study provides a comprehensive list of motif pairs that may contribute to EP physical interactions, which facilitate generating meaningful hypotheses for experimental validation.
Collapse
Affiliation(s)
- Saidi Wang
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Haiyan Hu
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Xiaoman Li
- Burnett school of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL, 32816, USA
| |
Collapse
|
4
|
Gogoshin G, Branciamore S, Rodin AS. Synthetic data generation with probabilistic Bayesian Networks. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:8603-8621. [PMID: 34814315 PMCID: PMC8848551 DOI: 10.3934/mbe.2021426] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Bayesian Network (BN) modeling is a prominent and increasingly popular computational systems biology method. It aims to construct network graphs from the large heterogeneous biological datasets that reflect the underlying biological relationships. Currently, a variety of strategies exist for evaluating BN methodology performance, ranging from utilizing artificial benchmark datasets and models, to specialized biological benchmark datasets, to simulation studies that generate synthetic data from predefined network models. The last is arguably the most comprehensive approach; however, existing implementations often rely on explicit and implicit assumptions that may be unrealistic in a typical biological data analysis scenario, or are poorly equipped for automated arbitrary model generation. In this study, we develop a purely probabilistic simulation framework that addresses the demands of statistically sound simulations studies in an unbiased fashion. Additionally, we expand on our current understanding of the theoretical notions of causality and dependence / conditional independence in BNs and the Markov Blankets within.
Collapse
Affiliation(s)
- Grigoriy Gogoshin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| | - Sergio Branciamore
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| | - Andrei S. Rodin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| |
Collapse
|
5
|
Nobile MS, Cazzaniga P, Ramazzotti D. Investigating the performance of multi-objective optimization when learning Bayesian Networks. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
6
|
Wang X, Branciamore S, Gogoshin G, Ding S, Rodin AS. New Analysis Framework Incorporating Mixed Mutual Information and Scalable Bayesian Networks for Multimodal High Dimensional Genomic and Epigenomic Cancer Data. Front Genet 2020; 11:648. [PMID: 32625238 PMCID: PMC7314938 DOI: 10.3389/fgene.2020.00648] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2019] [Accepted: 05/28/2020] [Indexed: 12/14/2022] Open
Abstract
We propose a novel two-stage analysis strategy to discover candidate genes associated with the particular cancer outcomes in large multimodal genomic cancers databases, such as The Cancer Genome Atlas (TCGA). During the first stage, we use mixed mutual information to perform variable selection; during the second stage, we use scalable Bayesian network (BN) modeling to identify candidate genes and their interactions. Two crucial features of the proposed approach are (i) the ability to handle mixed data types (continuous and discrete, genomic, epigenomic, etc.) and (ii) a flexible boundary between the variable selection and network modeling stages - the boundary that can be adjusted in accordance with the investigators' BN software scalability and hardware implementation. These two aspects result in high generalizability of the proposed analytical framework. We apply the above strategy to three different TCGA datasets (LGG, Brain Lower Grade Glioma; HNSC, Head and Neck Squamous Cell Carcinoma; STES, Stomach and Esophageal Carcinoma), linking multimodal molecular information (SNPs, mRNA expression, DNA methylation) to two clinical outcome variables (tumor status and patient survival). We identify 11 candidate genes, of which 6 have already been directly implicated in the cancer literature. One novel LGG prognostic factor suggested by our analysis, methylation of TMPRSS11F type II transmembrane serine protease, presents intriguing direction for the follow-up studies.
Collapse
Affiliation(s)
- Xichun Wang
- Department of Computational and Quantitative Medicine, Beckman Research Institute and Diabetes and Metabolism Research Institute of the City of Hope, Duarte, CA, United States
| | - Sergio Branciamore
- Department of Computational and Quantitative Medicine, Beckman Research Institute and Diabetes and Metabolism Research Institute of the City of Hope, Duarte, CA, United States
| | - Grigoriy Gogoshin
- Department of Computational and Quantitative Medicine, Beckman Research Institute and Diabetes and Metabolism Research Institute of the City of Hope, Duarte, CA, United States
| | - Shuyu Ding
- Department of Computational and Quantitative Medicine, Beckman Research Institute and Diabetes and Metabolism Research Institute of the City of Hope, Duarte, CA, United States
| | - Andrei S Rodin
- Department of Computational and Quantitative Medicine, Beckman Research Institute and Diabetes and Metabolism Research Institute of the City of Hope, Duarte, CA, United States
| |
Collapse
|
7
|
Jabbari K, Chakraborty M, Wiehe T. DNA sequence-dependent chromatin architecture and nuclear hubs formation. Sci Rep 2019; 9:14646. [PMID: 31601866 PMCID: PMC6787200 DOI: 10.1038/s41598-019-51036-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 09/18/2019] [Indexed: 02/08/2023] Open
Abstract
In this study, by exploring chromatin conformation capture data, we show that the nuclear segregation of Topologically Associated Domains (TADs) is contributed by DNA sequence composition. GC-peaks and valleys of TADs strongly influence interchromosomal interactions and chromatin 3D structure. To gain insight on the compositional and functional constraints associated with chromatin interactions and TADs formation, we analysed intra-TAD and intra-loop GC variations. This led to the identification of clear GC-gradients, along which, the density of genes, super-enhancers, transcriptional activity, and CTCF binding sites occupancy co-vary non-randomly. Further, the analysis of DNA base composition of nucleolar aggregates and nuclear speckles showed strong sequence-dependant effects. We conjecture that dynamic DNA binding affinity and flexibility underlay the emergence of chromatin condensates, their growth is likely promoted in mechanically soft regions (GC-rich) of the lowest chromatin and nucleosome densities. As a practical perspective, the strong linear association between sequence composition and interchromosomal contacts can help define consensus chromatin interactions, which in turn may be used to study alternative states of chromatin architecture.
Collapse
Affiliation(s)
- Kamel Jabbari
- Institute for Genetics, Biocenter Cologne, University of Cologne, Zülpicher Straße 47a, 50674, Köln, Germany.
| | - Maharshi Chakraborty
- Institute for Genetics, Biocenter Cologne, University of Cologne, Zülpicher Straße 47a, 50674, Köln, Germany
| | - Thomas Wiehe
- Institute for Genetics, Biocenter Cologne, University of Cologne, Zülpicher Straße 47a, 50674, Köln, Germany
| |
Collapse
|
8
|
Branciamore S, Gogoshin G, Di Giulio M, Rodin AS. Intrinsic Properties of tRNA Molecules as Deciphered via Bayesian Network and Distribution Divergence Analysis. Life (Basel) 2018; 8:life8010005. [PMID: 29419741 PMCID: PMC5871937 DOI: 10.3390/life8010005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2017] [Revised: 01/22/2018] [Accepted: 01/23/2018] [Indexed: 12/27/2022] Open
Abstract
The identity/recognition of tRNAs, in the context of aminoacyl tRNA synthetases (and other molecules), is a complex phenomenon that has major implications ranging from the origins and evolution of translation machinery and genetic code to the evolution and speciation of tRNAs themselves to human mitochondrial diseases to artificial genetic code engineering. Deciphering it via laboratory experiments, however, is difficult and necessarily time- and resource-consuming. In this study, we propose a mathematically rigorous two-pronged in silico approach to identifying and classifying tRNA positions important for tRNA identity/recognition, rooted in machine learning and information-theoretic methodology. We apply Bayesian Network modeling to elucidate the structure of intra-tRNA-molecule relationships, and distribution divergence analysis to identify meaningful inter-molecule differences between various tRNA subclasses. We illustrate the complementary application of these two approaches using tRNA examples across the three domains of life, and identify and discuss important (informative) positions therein. In summary, we deliver to the tRNA research community a novel, comprehensive methodology for identifying the specific elements of interest in various tRNA molecules, which can be followed up by the corresponding experimental work and/or high-resolution position-specific statistical analyses.
Collapse
Affiliation(s)
- Sergio Branciamore
- Department of Diabetes Complications and Metabolism, Diabetes and Metabolism Research Institute, City of Hope, Duarte, 91010 CA, USA.
| | - Grigoriy Gogoshin
- Department of Diabetes Complications and Metabolism, Diabetes and Metabolism Research Institute, City of Hope, Duarte, 91010 CA, USA.
| | - Massimo Di Giulio
- Early Evolution of Life Laboratory, Institute of Biosciences and Bioresources, CNR, 80131 Naples, Italy.
| | - Andrei S Rodin
- Department of Diabetes Complications and Metabolism, Diabetes and Metabolism Research Institute, City of Hope, Duarte, 91010 CA, USA.
| |
Collapse
|