1
|
Deyneko IV. BestCRM: An Exhaustive Search for Optimal Cis-Regulatory Modules in Promoters Accelerated by the Multidimensional Hash Function. Int J Mol Sci 2024; 25:1903. [PMID: 38339181 PMCID: PMC10856692 DOI: 10.3390/ijms25031903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/24/2024] [Accepted: 01/26/2024] [Indexed: 02/12/2024] Open
Abstract
The concept of cis-regulatory modules located in gene promoters represents today's vision of the organization of gene transcriptional regulation. Such modules are a combination of two or more single, short DNA motifs. The bioinformatic identification of such modules belongs to so-called NP-hard problems with extreme computational complexity, and therefore, simplifications, assumptions, and heuristics are usually deployed to tackle the problem. In practice, this requires, first, many parameters to be set before the search, and second, it leads to the identification of locally optimal results. Here, a novel method is presented, aimed at identifying the cis-regulatory elements in gene promoters based on an exhaustive search of all the feasible modules' configurations. All required parameters are automatically estimated using positive and negative datasets. To be computationally efficient, the search is accelerated using a multidimensional hash function, allowing the search to complete in a few hours on a regular laptop (for example, a CPU Intel i7, 3.2 GH, 32 Gb RAM). Tests on an established benchmark and real data show better performance of BestCRM compared to the available methods according to several metrics like specificity, sensitivity, AUC, etc. A great practical advantage of the method is its minimum number of input parameters-apart from positive and negative promoters, only a desired level of module presence in promoters is required.
Collapse
Affiliation(s)
- Igor V Deyneko
- K.A. Timiryazev Institute of Plant Physiology RAS, 35 Botanicheskaya Str., Moscow 127276, Russia
| |
Collapse
|
2
|
Bentsen M, Heger V, Schultheis H, Kuenne C, Looso M. TF-COMB - discovering grammar of transcription factor binding sites. Comput Struct Biotechnol J 2022; 20:4040-4051. [PMID: 35983231 PMCID: PMC9358416 DOI: 10.1016/j.csbj.2022.07.025] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 07/12/2022] [Indexed: 02/07/2023] Open
Abstract
Cooperativity between transcription factors is important to regulate target gene expression. In particular, the binding grammar of TFs in relation to each other, as well as in the context of other genomic elements, is crucial for TF functionality. However, tools to easily uncover co-occurrence between DNA-binding proteins, and investigate the regulatory modules of TFs, are limited. Here we present TF-COMB (Transcription Factor Co-Occurrence using Market Basket analysis) - a tool to investigate co-occurring TFs and binding grammar within regulatory regions. We found that TF-COMB can accurately identify known co-occurring TFs from ChIP-seq data, as well as uncover preferential localization to other genomic elements. With the use of ATAC-seq footprinting and TF motif locations, we found that TFs exhibit both preferred orientation and distance in relation to each other, and that these are biologically significant. Finally, we extended the analysis to not only investigate individual TF pairs, but also TF pairs in the context of networks, which enabled the investigation of TF complexes and TF hubs. In conclusion, TF-COMB is a flexible tool to investigate various aspects of TF binding grammar.
Collapse
Affiliation(s)
- Mette Bentsen
- Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany
| | - Vanessa Heger
- Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany
| | - Hendrik Schultheis
- Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany
| | - Carsten Kuenne
- Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany
| | - Mario Looso
- Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany
- Cardio-Pulmonary Institute (CPI), Bad Nauheim, Germany
- Corresponding author at: Bioinformatics Core Unit (BCU), Max Planck Institute for Heart and Lung Research, Bad Nauheim, Germany.
| |
Collapse
|
3
|
Yang TH, Yang YC, Tu KC. regCNN: identifying Drosophila genome-wide cis-regulatory modules via integrating the local patterns in epigenetic marks and transcription factor binding motifs. Comput Struct Biotechnol J 2022; 20:296-308. [PMID: 35035784 PMCID: PMC8724954 DOI: 10.1016/j.csbj.2021.12.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 12/10/2021] [Accepted: 12/10/2021] [Indexed: 11/20/2022] Open
Abstract
Transcription regulation in metazoa is controlled by the binding events of transcription factors (TFs) or regulatory proteins on specific modular DNA regulatory sequences called cis-regulatory modules (CRMs). Understanding the distributions of CRMs on a genomic scale is essential for constructing the metazoan transcriptional regulatory networks that help diagnose genetic disorders. While traditional reporter-assay CRM identification approaches can provide an in-depth understanding of functions of some CRM, these methods are usually cost-inefficient and low-throughput. It is generally believed that by integrating diverse genomic data, reliable CRM predictions can be made. Hence, researchers often first resort to computational algorithms for genome-wide CRM screening before specific experiments. However, current existing in silico methods for searching potential CRMs were restricted by low sensitivity, poor prediction accuracy, or high computation time from TFBS composition combinatorial complexity. To overcome these obstacles, we designed a novel CRM identification pipeline called regCNN by considering the base-by-base local patterns in TF binding motifs and epigenetic profiles. On the test set, regCNN shows an accuracy/auROC of 84.5%/92.5% in CRM identification. And by further considering local patterns in epigenetic profiles and TF binding motifs, it can accomplish 4.7% (92.5%–87.8%) improvement in the auROC value over the average value-based pure multi-layer perceptron model. We also demonstrated that regCNN outperforms all currently available tools by at least 11.3% in auROC values. Finally, regCNN is verified to be robust against its resizing window hyperparameter in dealing with the variable lengths of CRMs. The model of regCNN can be downloaded athttp://cobisHSS0.im.nuk.edu.tw/regCNN/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Ya-Chiao Yang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Kai-Chi Tu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| |
Collapse
|
4
|
Yang TH. Transcription factor regulatory modules provide the molecular mechanisms for functional redundancy observed among transcription factors in yeast. BMC Bioinformatics 2019; 20:630. [PMID: 31881824 PMCID: PMC6933673 DOI: 10.1186/s12859-019-3212-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Current technologies for understanding the transcriptional reprogramming in cells include the transcription factor (TF) chromatin immunoprecipitation (ChIP) experiments and the TF knockout experiments. The ChIP experiments show the binding targets of TFs against which the antibody directs while the knockout techniques find the regulatory gene targets of the knocked-out TFs. However, it was shown that these two complementary results contain few common targets. Researchers have used the concept of TF functional redundancy to explain the low overlap between these two techniques. But the detailed molecular mechanisms behind TF functional redundancy remain unknown. Without knowing the possible molecular mechanisms, it is hard for biologists to fully unravel the cause of TF functional redundancy. RESULTS To mine out the molecular mechanisms, a novel algorithm to extract TF regulatory modules that help explain the observed TF functional redundancy effect was devised and proposed in this research. The method first searched for candidate TF sets from the TF binding data. Then based on these candidate sets the method utilized the modified Steiner Tree construction algorithm to construct the possible TF regulatory modules from protein-protein interaction data and finally filtered out the noise-induced results by using confidence tests. The mined-out regulatory modules were shown to correlate to the concept of functional redundancy and provided testable hypotheses of the molecular mechanisms behind functional redundancy. And the biological significance of the mined-out results was demonstrated in three different biological aspects: ontology enrichment, protein interaction prevalence and expression coherence. About 23.5% of the mined-out TF regulatory modules were literature-verified. Finally, the biological applicability of the proposed method was shown in one detailed example of a verified TF regulatory module for pheromone response and filamentous growth in yeast. CONCLUSION In this research, a novel method that mined out the potential TF regulatory modules which elucidate the functional redundancy observed among TFs is proposed. The extracted TF regulatory modules not only correlate the molecular mechanisms to the observed functional redundancy among TFs, but also show biological significance in inferring TF functional binding target genes. The results provide testable hypotheses for biologists to further design subsequent research and experiments.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Information Management, National University of Kaohsiung, 700, Kaohsiung University Rd, Kaohsiung, 81148, Taiwan.
| |
Collapse
|
5
|
Meckbach C, Wingender E, Gültas M. Removing Background Co-occurrences of Transcription Factor Binding Sites Greatly Improves the Prediction of Specific Transcription Factor Cooperations. Front Genet 2018; 9:189. [PMID: 29896218 PMCID: PMC5986914 DOI: 10.3389/fgene.2018.00189] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Accepted: 05/08/2018] [Indexed: 12/17/2022] Open
Abstract
Today, it is well-known that in eukaryotic cells the complex interplay of transcription factors (TFs) bound to the DNA of promoters and enhancers is the basis for precise and specific control of transcription. Computational methods have been developed for the identification of potentially cooperating TFs through the co-occurrence of their binding sites (TFBSs). One challenge of these methods is the differentiation of TFBS pairs that are specific for a given sequence set from those that are ubiquitously appearing, rendering the results highly dependent on the choice of a proper background set. Here, we present an extension of our previous PC-TraFF approach that estimates the background co-occurrence of any TF pair by preserving the (oligo-) nucleotide composition and, thus, the core of TFBSs in the sequences of interest. Applying our approach to a simulated data set with implanted TFBS pairs, we could successfully identify them as sequence-set specific under a variety of conditions. When we analyzed the gene expression data sets of five breast cancer associated subtypes, the number of overlapping pairs could be dramatically reduced in comparison to our previous approach. As a result, we could identify potentially cooperating transcriptional regulators that are characteristic for each of the five breast cancer subtypes. This indicates that our approach is able to discriminate specific potential TF cooperations against ubiquitously occurring combinations. The results obtained with our method may help to understand the genetic programs governing specific biological processes such as the development of different tumor types.
Collapse
Affiliation(s)
- Cornelia Meckbach
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Edgar Wingender
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Mehmet Gültas
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany.,Department of Breeding Informatics, Georg-August University Göttingen, Göttingen, Germany.,Center for Integrated Breeding Research (CiBreed), Georg-August University Göttingen, Göttingen, Germany
| |
Collapse
|
6
|
A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model. BIOMED RESEARCH INTERNATIONAL 2018; 2017:6274513. [PMID: 28497059 PMCID: PMC5405574 DOI: 10.1155/2017/6274513] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 03/06/2017] [Accepted: 03/23/2017] [Indexed: 11/24/2022]
Abstract
The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.
Collapse
|
7
|
Huminiecki Ł, Horbańczuk J. Can We Predict Gene Expression by Understanding Proximal Promoter Architecture? Trends Biotechnol 2017; 35:530-546. [PMID: 28377102 DOI: 10.1016/j.tibtech.2017.03.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2016] [Revised: 02/14/2017] [Accepted: 03/09/2017] [Indexed: 10/19/2022]
Abstract
We review computational predictions of expression from the promoter architecture - the set of transcription factors that can bind the proximal promoter. We focus on spatial expression patterns in animals with complex body plans and many distinct tissue types. This field is ripe for change as functional genomics datasets accumulate for both expression and protein-DNA interactions. While there has been some success in predicting the breadth of expression (i.e., the fraction of tissue types a gene is expressed in), predicting tissue specificity remains challenging. We discuss how progress can be achieved through either machine learning or complementary combinatorial data mining. The likely impact of single-cell expression data is considered. Finally, we discuss the design of artificial promoters as a practical application.
Collapse
Affiliation(s)
- Łukasz Huminiecki
- Institute of Genetics and Animal Breeding, Polish Academy of Sciences, ul. Postępu 36A, Jastrzębiec, 05-552 Magdalenka, Poland.
| | - Jarosław Horbańczuk
- Institute of Genetics and Animal Breeding, Polish Academy of Sciences, ul. Postępu 36A, Jastrzębiec, 05-552 Magdalenka, Poland
| |
Collapse
|
8
|
Guo H, Huo H, Yu Q. SMCis: An Effective Algorithm for Discovery of Cis-Regulatory Modules. PLoS One 2016; 11:e0162968. [PMID: 27637070 PMCID: PMC5026350 DOI: 10.1371/journal.pone.0162968] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/31/2016] [Indexed: 12/02/2022] Open
Abstract
The discovery of cis-regulatory modules (CRMs) is a challenging problem in computational biology. Limited by the difficulty of using an HMM to model dependent features in transcriptional regulatory sequences (TRSs), the probabilistic modeling methods based on HMMs cannot accurately represent the distance between regulatory elements in TRSs and are cumbersome to model the prevailing dependencies between motifs within CRMs. We propose a probabilistic modeling algorithm called SMCis, which builds a more powerful CRM discovery model based on a hidden semi-Markov model. Our model characterizes the regulatory structure of CRMs and effectively models dependencies between motifs at a higher level of abstraction based on segments rather than nucleotides. Experimental results on three benchmark datasets indicate that our method performs better than the compared algorithms.
Collapse
Affiliation(s)
- Haitao Guo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
- * E-mail:
| | - Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| |
Collapse
|
9
|
Meckbach C, Tacke R, Hua X, Waack S, Wingender E, Gültas M. PC-TraFF: identification of potentially collaborating transcription factors using pointwise mutual information. BMC Bioinformatics 2015; 16:400. [PMID: 26627005 PMCID: PMC4667426 DOI: 10.1186/s12859-015-0827-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Accepted: 11/17/2015] [Indexed: 01/06/2023] Open
Abstract
Background Transcription factors (TFs) are important regulatory proteins that govern transcriptional regulation. Today, it is known that in higher organisms different TFs have to cooperate rather than acting individually in order to control complex genetic programs. The identification of these interactions is an important challenge for understanding the molecular mechanisms of regulating biological processes. In this study, we present a new method based on pointwise mutual information, PC-TraFF, which considers the genome as a document, the sequences as sentences, and TF binding sites (TFBSs) as words to identify interacting TFs in a set of sequences. Results To demonstrate the effectiveness of PC-TraFF, we performed a genome-wide analysis and a breast cancer-associated sequence set analysis for protein coding and miRNA genes. Our results show that in any of these sequence sets, PC-TraFF is able to identify important interacting TF pairs, for most of which we found support by previously published experimental results. Further, we made a pairwise comparison between PC-TraFF and three conventional methods. The outcome of this comparison study strongly suggests that all these methods focus on different important aspects of interaction between TFs and thus the pairwise overlap between any of them is only marginal. Conclusions In this study, adopting the idea from the field of linguistics in the field of bioinformatics, we develop a new information theoretic method, PC-TraFF, for the identification of potentially collaborating transcription factors based on the idiosyncrasy of their binding site distributions on the genome. The results of our study show that PC-TraFF can succesfully identify known interacting TF pairs and thus its currently biologically uncorfirmed predictions could provide new hypotheses for further experimental validation. Additionally, the comparison of the results of PC-TraFF with the results of previous methods demonstrates that different methods with their specific scopes can perfectly supplement each other. Overall, our analyses indicate that PC-TraFF is a time-efficient method where its algorithm has a tractable computational time and memory consumption. The PC-TraFF server is freely accessible at http://pctraff.bioinf.med.uni-goettingen.de/ Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0827-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cornelia Meckbach
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Rebecca Tacke
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Xu Hua
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Stephan Waack
- Institute of Computer Science, University of Göttingen, Goldschmidtstr. 7, Göttingen, 37077, Germany.
| | - Edgar Wingender
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Mehmet Gültas
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| |
Collapse
|