1
|
Schroeder JW, Wolfe MB, Freddolino L. ShapeME: A tool and web front-end for de novo discovery of structural motifs underpinning protein-DNA interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.28.635290. [PMID: 39975017 PMCID: PMC11838363 DOI: 10.1101/2025.01.28.635290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Determining where transcriptional regulators bind within a genome is paramount to understanding how gene expression is regulated. Historically, position weight matrices (PWMs) have been used to define the binding preferences of DNA binding proteins1. However, PWMs treat the identity of each base in a sequence as an independent and additive measure of binding preference, which can limit their utility2. Models that consider higher order interactions between nearby bases yield greater success in predicting proteins' binding to DNA, but for many proteins there is still substantial room for improvement in predicting and understanding the determinants of proteins' binding to DNA3. In addition to DNA sequence motifs, structural motifs (e.g., a narrow minor groove width) are important determinants of binding for some DNA-binding proteins4. Despite the initial success of algorithms using structural features of DNA to predict binding properties of proteins from either ChIP-seq or SELEX data5-8, there remains a need for a de novo structural motif discovery framework which can be applied to data from a variety of experimental designs. Here, we present a unified workflow, capable of utilizing virtually any type of data representing sequence coverage or enrichment (e.g. ChIP-seq, RNA-seq, SELEX, etc.), to discover short structural motifs with explanatory power for a protein's DNA binding preference. We couple the DNAshapeR algorithm9 with our own information-theoretic approach to de novo motif discovery, and wrap shape and sequence motif inference and model selection into a single tool called ShapeME. Application of our structural motif discovery algorithm to proteins with ChIP-seq data in ENCODE datasets reveals a subset of proteins where short structural motifs outperform the best PWM for that protein as determined from the JASPAR database, or as identified by the sequence motif elicitation tool STREME. Our approach offers a powerful and versatile framework for inferring structural DNA binding motifs, and will complement current sequence-based motif elicitation tools in discovery of protein-DNA interaction principles. A web-based interface to ShapeME is available at https://seq2fun.dcmb.med.umich.edu/shapeme, with full source code available at https://github.com/freddolino-lab/ShapeME.
Collapse
Affiliation(s)
- Jeremy W. Schroeder
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael B. Wolfe
- Department of Biochemistry, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Lydia Freddolino
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
2
|
Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024; 26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open
Abstract
Given the adverse effects faced by rice due to abiotic stresses, the precise and rapid identification of single nucleotide polymorphisms (SNPs) associated with abiotic stress traits (ABST-SNPs) in rice is crucial for developing resistant rice varieties. The scarcity of high-quality data related to abiotic stress in rice has hindered the development of computational models and constrained research efforts aimed at rice improvement and breeding. Genome-wide association studies provide a better statistical power to consider ABST-SNPs in rice. Meanwhile, deep learning methods have shown their capability in predicting disease- or phenotype-associated loci, but have primarily focused on human species. Therefore, developing predictive models for identifying ABST-SNPs in rice is both urgent and valuable. In this paper, a model called RiceSNP-ABST is proposed for predicting ABST-SNPs in rice. Firstly, six training datasets were generated using a novel strategy for negative sample construction. Secondly, four feature encoding methods were proposed based on DNA sequence fragments, followed by feature selection. Finally, convolutional neural networks with residual connections were used to determine whether the sequences contained rice ABST-SNPs. RiceSNP-ABST outperformed traditional machine learning and state-of-the-art methods on the benchmark dataset and demonstrated consistent generalization on an independent dataset and cross-species datasets. Notably, multi-granularity causal structure learning was employed to elucidate the relationships among DNA structural features, aiming to identify key genetic variants more effectively. The web-based tool for the RiceSNP-ABST can be accessed at http://rice-snp-abst.aielab.cc.
Collapse
Affiliation(s)
- Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Meng Wang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
3
|
Xu J, Gao Y, Lu Q, Zhang R, Gui J, Liu X, Yue Z. RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice. Brief Bioinform 2024; 25:bbae599. [PMID: 39562160 PMCID: PMC11576077 DOI: 10.1093/bib/bbae599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 10/07/2024] [Accepted: 11/04/2024] [Indexed: 11/21/2024] Open
Abstract
Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.
Collapse
Affiliation(s)
- Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jianfeng Gui
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
4
|
Li J, Rohs R. Deep DNAshape webserver: prediction and real-time visualization of DNA shape considering extended k-mers. Nucleic Acids Res 2024; 52:W7-W12. [PMID: 38801070 PMCID: PMC11223853 DOI: 10.1093/nar/gkae433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 04/30/2024] [Accepted: 05/08/2024] [Indexed: 05/29/2024] Open
Abstract
Sequence-dependent DNA shape plays an important role in understanding protein-DNA binding mechanisms. High-throughput prediction of DNA shape features has become a valuable tool in the field of protein-DNA recognition, transcription factor-DNA binding specificity, and gene regulation. However, our widely used webserver, DNAshape, relies on statistically summarized pentamer query tables to query DNA shape features. These query tables do not consider flanking regions longer than two base pairs, and acquiring a query table for hexamers or higher-order k-mers is currently still unrealistic due to limitations in achieving sufficient statistical coverage in molecular simulations or structural biology experiments. A recent deep-learning method, Deep DNAshape, can predict DNA shape features at the core of a DNA fragment considering flanking regions of up to seven base pairs, trained on limited simulation data. However, Deep DNAshape is rather complicated to install, and it must run locally compared to the pentamer-based DNAshape webserver, creating a barrier for users. Here, we present the Deep DNAshape webserver, which has the benefits of both methods while being accurate, fast, and accessible to all users. Additional improvements of the webserver include the detection of user input in real time, the ability of interactive visualization tools and different modes of analyses. URL: https://deepdnashape.usc.edu.
Collapse
Affiliation(s)
- Jinsen Li
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089, USA
- Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
5
|
Horton CA, Alexandari AM, Hayes MGB, Marklund E, Schaepe JM, Aditham AK, Shah N, Suzuki PH, Shrikumar A, Afek A, Greenleaf WJ, Gordân R, Zeitlinger J, Kundaje A, Fordyce PM. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 2023; 381:eadd1250. [PMID: 37733848 DOI: 10.1126/science.add1250] [Citation(s) in RCA: 72] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 07/26/2023] [Indexed: 09/23/2023]
Abstract
Short tandem repeats (STRs) are enriched in eukaryotic cis-regulatory elements and alter gene expression, yet how they regulate transcription remains unknown. We found that STRs modulate transcription factor (TF)-DNA affinities and apparent on-rates by about 70-fold by directly binding TF DNA-binding domains, with energetic impacts exceeding many consensus motif mutations. STRs maximize the number of weakly preferred microstates near target sites, thereby increasing TF density, with impacts well predicted by statistical mechanics. Confirming that STRs also affect TF binding in cells, neural networks trained only on in vivo occupancies predicted effects identical to those observed in vitro. Approximately 90% of TFs preferentially bound STRs that need not resemble known motifs, providing a cis-regulatory mechanism to target TFs to genomic sites.
Collapse
Affiliation(s)
- Connor A Horton
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Michael G B Hayes
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Emil Marklund
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Julia M Schaepe
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| | - Arjun K Aditham
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
- ChEM-H Institute, Stanford University, Stanford, CA 94305, USA
| | - Nilay Shah
- Stowers Institute for Medical Research, Kansas City, MO 64110, USA
| | - Peter H Suzuki
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Ariel Afek
- Center for Genomic and Computational Biology, Duke University School of Medicine, Durham, NC 27710, USA
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA
- Department of Chemical and Structural Biology, Weizmann Institute of Science, Rehovot 7610001, Israel
| | | | - Raluca Gordân
- Center for Genomic and Computational Biology, Duke University School of Medicine, Durham, NC 27710, USA
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA
- Department of Computer Science, Duke University, Durham, NC 27708, USA
- Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, NC 27710, USA
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO 64110, USA
- The University of Kansas Medical Center, Kansas City, KS 66103, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Polly M Fordyce
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
- ChEM-H Institute, Stanford University, Stanford, CA 94305, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94110, USA
| |
Collapse
|
6
|
Samee MAH. Noncanonical binding of transcription factors: time to revisit specificity? Mol Biol Cell 2023; 34:pe4. [PMID: 37486893 PMCID: PMC10398899 DOI: 10.1091/mbc.e22-08-0325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 06/05/2023] [Accepted: 06/21/2023] [Indexed: 07/26/2023] Open
Abstract
Transcription factors (TFs) are one of the most studied classes of DNA-binding proteins that have a direct functional impact on gene transcription and thus, on human physiology and disease. The mechanisms that TFs use for recognizing target DNA binding sites have been studied for nearly five decades, yet they remain poorly understood. It is classically assumed that a TF recognizes a specific sequence pattern, or motif, as its binding sites. However, recent studies are consistently finding examples of noncanonical binding, that is, TFs binding at sites that do not resemble their sequence motifs. Here we review the current literature on four major types of noncanonical TF binding, namely binding based on DNA shape readout, at Guanine-quadruplex structures, at repeat sequences, and bispecific binding. These examples point to a critical need for studies to unify our current observations, many of which are at odds with the "one TF, one motif" view, into a more comprehensive definition of the DNA-binding specificity of TFs.
Collapse
|
7
|
Boumpas P, Merabet S, Carnesecchi J. Integrating transcription and splicing into cell fate: Transcription factors on the block. WILEY INTERDISCIPLINARY REVIEWS. RNA 2023; 14:e1752. [PMID: 35899407 DOI: 10.1002/wrna.1752] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 06/22/2022] [Accepted: 07/01/2022] [Indexed: 11/10/2022]
Abstract
Transcription factors (TFs) are present in all life forms and conserved across great evolutionary distances in eukaryotes. From yeast to complex multicellular organisms, they are pivotal players of cell fate decision by orchestrating gene expression at diverse molecular layers. Notably, TFs fine-tune gene expression by coordinating RNA fate at both the expression and splicing levels. They regulate alternative splicing, an essential mechanism for cell plasticity, allowing the production of many mRNA and protein isoforms in precise cell and tissue contexts. Despite this apparent role in splicing, how TFs integrate transcription and splicing to ultimately orchestrate diverse cell functions and cell fate decisions remains puzzling. We depict substantial studies in various model organisms underlining the key role of TFs in alternative splicing for promoting tissue-specific functions and cell fate. Furthermore, we emphasize recent advances describing the molecular link between the transcriptional and splicing activities of TFs. As TFs can bind both DNA and/or RNA to regulate transcription and splicing, we further discuss their flexibility and compatibility for DNA and RNA substrates. Finally, we propose several models integrating transcription and splicing activities of TFs in the coordination and diversification of cell and tissue identities. This article is categorized under: RNA Processing > Splicing Regulation/Alternative Splicing RNA Interactions with Proteins and Other Molecules > Protein-RNA Interactions: Functional Implications RNA Processing > Splicing Mechanisms.
Collapse
Affiliation(s)
- Panagiotis Boumpas
- Institut de Génomique Fonctionnelle de Lyon, UMR5242, Ecole Normale Supérieure de Lyon, Centre National de la Recherche Scientifique, Université Claude Bernard-Lyon 1, Lyon, France
| | - Samir Merabet
- Institut de Génomique Fonctionnelle de Lyon, UMR5242, Ecole Normale Supérieure de Lyon, Centre National de la Recherche Scientifique, Université Claude Bernard-Lyon 1, Lyon, France
| | - Julie Carnesecchi
- Institut de Génomique Fonctionnelle de Lyon, UMR5242, Ecole Normale Supérieure de Lyon, Centre National de la Recherche Scientifique, Université Claude Bernard-Lyon 1, Lyon, France
| |
Collapse
|
8
|
Zhang Q, Zhang Y, Wang S, Chen ZH, Gribova V, Filaretov VF, Huang DS. Predicting In-Vitro DNA-Protein Binding With a Spatially Aligned Fusion of Sequence and Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3144-3153. [PMID: 34882561 DOI: 10.1109/tcbb.2021.3133869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is of primary importance for understanding the underlying binding mechanic and gene regulation process. Growing evidence indicates that apart from the primary DNA sequences, DNA shape landscape has a significant influence on transcription factor binding preference. To effectively model the co-influence of sequence and shape features, we emphasize the importance of position information of sequence motif and shape pattern. In this paper, we propose a novel deep learning-based architecture, named hybridShape eDeepCNN, for TFBS prediction which integrates DNA sequence and shape information in a spatially aligned manner. Our model utilizes the power of the multi-layer convolutional neural network and constructs an independent subnetwork to adapt for the distinct data distribution of heterogeneous features. Besides, we explore the usage of continuous embedding vectors as the representation of DNA sequences. Based on the experiments on 20 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we demonstrate the superiority of our proposed method and validate the underlying design logic.
Collapse
|
9
|
Zhang Y, Liu Y, Wang Z, Wang M, Xiong S, Huang G, Gong M. Uncovering the Relationship between Tissue-Specific TF-DNA Binding and Chromatin Features through a Transformer-Based Model. Genes (Basel) 2022; 13:1952. [PMID: 36360189 PMCID: PMC9690320 DOI: 10.3390/genes13111952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 10/19/2022] [Accepted: 10/23/2022] [Indexed: 09/08/2024] Open
Abstract
Chromatin features can reveal tissue-specific TF-DNA binding, which leads to a better understanding of many critical physiological processes. Accurately identifying TF-DNA bindings and constructing their relationships with chromatin features is a long-standing goal in the bioinformatic field. However, this has remained elusive due to the complex binding mechanisms and heterogeneity among inputs. Here, we have developed the GHTNet (General Hybrid Transformer Network), a transformer-based model to predict TF-DNA binding specificity. The GHTNet decodes the relationship between tissue-specific TF-DNA binding and chromatin features via a specific input scheme of alternative inputs and reveals important gene regions and tissue-specific motifs. Our experiments show that the GHTNet has excellent performance, achieving about a 5% absolute improvement over existing methods. The TF-DNA binding mechanism analysis shows that the importance of TF-DNA binding features varies across tissues. The best predictor is based on the DNA sequence, followed by epigenomics and shape. In addition, cross-species studies address the limited data, thus providing new ideas in this case. Moreover, the GHTNet is applied to interpret the relationship among TFs, chromatin features, and diseases associated with AD46 tissue. This paper demonstrates that the GHTNet is an accurate and robust framework for deciphering tissue-specific TF-DNA binding and interpreting non-coding regions.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Maocheng Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Guo Huang
- School of Electronic Information and Artificial Intelligence, Leshan Normal University, Leshan 614000, China
| | - Meiqin Gong
- West China Second University Hospital, Sichuan University, Chengdu 610041, China
| |
Collapse
|
10
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
11
|
Spirov AV, Myasnikova EM. Heuristic algorithms in evolutionary computation and modular organization of biological macromolecules: Applications to in vitro evolution. PLoS One 2022; 17:e0260497. [PMID: 35085255 PMCID: PMC8794168 DOI: 10.1371/journal.pone.0260497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 11/10/2021] [Indexed: 11/19/2022] Open
Abstract
Evolutionary computing (EC) is an area of computer sciences and applied mathematics covering heuristic optimization algorithms inspired by evolution in Nature. EC extensively study all the variety of methods which were originally based on the principles of selectionism. As a result, many new algorithms and approaches, significantly more efficient than classical selectionist schemes, were found. This is especially true for some families of special problems. There are strong arguments to believe that EC approaches are quite suitable for modeling and numerical analysis of those methods of synthetic biology and biotechnology that are known as in vitro evolution. Therefore, it is natural to expect that the new algorithms and approaches developed in EC can be effectively applied in experiments on the directed evolution of biological macromolecules. According to the John Holland's Schema theorem, the effective evolutionary search in genetic algorithms (GA) is provided by identifying short schemata of high fitness which in the further search recombine into the larger building blocks (BBs) with higher and higher fitness. The multimodularity of functional biological macromolecules and the preservation of already found modules in the evolutionary search have a clear analogy with the BBs in EC. It seems reasonable to try to transfer and introduce the methods of EC, preserving BBs and essentially accelerating the search, into experiments on in vitro evolution. We extend the key instrument of the Holland's theory, the Royal Roads fitness function, to problems of the in vitro evolution (Biological Royal Staircase, BioRS, functions). The specific version of BioRS developed in this publication arises from the realities of experimental evolutionary search for (DNA-) RNA-devices (aptazymes). Our numerical tests showed that for problems with the BioRS functions, simple heuristic algorithms, which turned out to be very effective for preserving BBs in GA, can be very effective in in vitro evolution approaches. We are convinced that such algorithms can be implemented in modern methods of in vitro evolution to achieve significant savings in time and resources and a significant increase in the efficiency of evolutionary search.
Collapse
Affiliation(s)
- Alexander V. Spirov
- I. M. Sechenov Institute of Evolutionary Physiology and Biochemistry Russian Academy of Sciences, St. Petersburg, Russia
- The Institute of Scientific Information for Social Sciences RAS, Moscow, Russia
| | | |
Collapse
|
12
|
Guy JL, Mor GG. Transcription Factor-Binding Site Identification and Enrichment Analysis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2255:241-261. [PMID: 34033108 DOI: 10.1007/978-1-0716-1162-3_20] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Transcription factors orchestrate complex regulatory networks of gene expression. A better understanding of the common transcription factors, and their shared interactions, among a set of coregulated or differentially expressed genes can provide powerful insights into the key pathways governing such expression patterns. Critically, such information must also be considered in the context of the frequency in which a transcription factor is present in a properly selected background, and in the context of existing evidence of gene and transcription factor interaction. Given the vast amount of publicly available gene expression data that can be further scrutinized by the user-friendly analysis tools described here, many useful insights are assuredly to be revealed. The proceeding methods for application of the analysis tool CiiiDER for transcription factor-binding site identification, enrichment analysis, and coregulatory factor identification should be applicable to any dataset comparing differential gene expression in response to various stimuli and gene coexpression datasets. These methods should assist the researcher in identifying the most relevant regulators within a gene set, and refining the list of targets for future study to those which may share biologically important regulatory networks.
Collapse
Affiliation(s)
- Joe L Guy
- Department of Obstetrics and Gynecology, C.S. Mott Center for Human Growth and Development, Wayne State University, Detroit, MI, USA
| | - Gil G Mor
- Department of Obstetrics and Gynecology, C.S. Mott Center for Human Growth and Development, Wayne State University, Detroit, MI, USA.
| |
Collapse
|
13
|
Schnepf M, von Reutern M, Ludwig C, Jung C, Gaul U. Transcription Factor Binding Affinities and DNA Shape Readout. iScience 2020; 23:101694. [PMID: 33163946 PMCID: PMC7607496 DOI: 10.1016/j.isci.2020.101694] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 09/30/2020] [Accepted: 10/13/2020] [Indexed: 12/16/2022] Open
Abstract
An essential event in gene regulation is the binding of a transcription factor (TF) to its target DNA. Models considering the interactions between the TF and the DNA geometry proved to be successful approaches to describe this binding event, while conserving data interpretability. However, a direct characterization of the DNA shape contribution to binding is still missing due to the lack of accurate and large-scale binding affinity data. Here, we use a binding assay we recently established to measure with high sensitivity the binding specificities of 13 Drosophila TFs, including dinucleotide dependencies to capture non-independent amino acid-base interactions. Correlating the binding affinities with all DNA shape features, we find that shape readout is widely used by these factors. A shape readout/TF-DNA complex structure analysis validates our approach while providing biological insights such as positively charged or highly polar amino acids often contact nucleotides that exhibit strong shape readout. The DNA shape contribution to Drosophila TFs-DNA binding is directly characterized Zeroth- and first-order TF-DNA binding specificities are measured with high accuracy DNA shape readout is widely used by these TFs A shape readout/structural correlation analysis provides biological insights
Collapse
Affiliation(s)
- Max Schnepf
- Gene Center and Department of Biochemistry, Center for Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 München, Germany
| | - Marc von Reutern
- Gene Center and Department of Biochemistry, Center for Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 München, Germany
| | - Claudia Ludwig
- Gene Center and Department of Biochemistry, Center for Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 München, Germany
| | - Christophe Jung
- Gene Center and Department of Biochemistry, Center for Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 München, Germany
| | - Ulrike Gaul
- Gene Center and Department of Biochemistry, Center for Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 München, Germany
| |
Collapse
|
14
|
Pal S, Przytycka TM. Bioinformatics pipeline using JUDI: Just Do It! Bioinformatics 2020; 36:2572-2574. [PMID: 31882996 DOI: 10.1093/bioinformatics/btz956] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 12/05/2019] [Accepted: 12/24/2019] [Indexed: 12/22/2022] Open
Abstract
SUMMARY Large-scale data analysis in bioinformatics requires pipelined execution of multiple software. Generally each stage in a pipeline takes considerable computing resources and several workflow management systems (WMS), e.g. Snakemake, Nextflow, Common Workflow Language, Galaxy, etc. have been developed to ensure optimum execution of the stages across two invocations of the pipeline. However, when the pipeline needs to be executed with different settings of parameters, e.g. thresholds, underlying algorithms, etc. these WMS require significant scripting to ensure an optimal execution. We developed JUDI on top of DoIt, a Python based WMS, to systematically handle parameter settings based on the principles of database management systems. Using a novel modular approach that encapsulates a parameter database in each task and file associated with a pipeline stage, JUDI simplifies plug-and-play of the pipeline stages. For a typical pipeline with n parameters, JUDI reduces the number of lines of scripting required by a factor of O(n). With properly designed parameter databases, JUDI not only enables reproducing research under published values of parameters but also facilitates exploring newer results under novel parameter settings. AVAILABILITY AND IMPLEMENTATION https://github.com/ncbi/JUDI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Soumitra Pal
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
15
|
Chiu TP, Xin B, Markarian N, Wang Y, Rohs R. TFBSshape: an expanded motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2020; 48:D246-D255. [PMID: 31665425 PMCID: PMC7145579 DOI: 10.1093/nar/gkz970] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 10/08/2019] [Accepted: 10/11/2019] [Indexed: 12/31/2022] Open
Abstract
TFBSshape (https://tfbsshape.usc.edu) is a motif database for analyzing structural profiles of transcription factor binding sites (TFBSs). The main rationale for this database is to be able to derive mechanistic insights in protein-DNA readout modes from sequencing data without available structures. We extended the quantity and dimensionality of TFBSshape, from mostly in vitro to in vivo binding and from unmethylated to methylated DNA. This new release of TFBSshape improves its functionality and launches a responsive and user-friendly web interface for easy access to the data. The current expansion includes new entries from the most recent collections of transcription factors (TFs) from the JASPAR and UniPROBE databases, methylated TFBSs derived from in vitro high-throughput EpiSELEX-seq binding assays and in vivo methylated TFBSs from the MeDReaders database. TFBSshape content has increased to 2428 structural profiles for 1900 TFs from 39 different species. The structural profiles for each TFBS entry now include 13 shape features and minor groove electrostatic potential for standard DNA and four shape features for methylated DNA. We improved the flexibility and accuracy for the shape-based alignment of TFBSs and designed new tools to compare methylated and unmethylated structural profiles of TFs and methods to derive DNA shape-preserving nucleotide mutations in TFBSs.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Beibei Xin
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Nicholas Markarian
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Yingfei Wang
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
16
|
Ambrosini G, Vorontsov I, Penzar D, Groux R, Fornes O, Nikolaeva DD, Ballester B, Grau J, Grosse I, Makeev V, Kulakovskiy I, Bucher P. Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study. Genome Biol 2020; 21:114. [PMID: 32393327 PMCID: PMC7212583 DOI: 10.1186/s13059-020-01996-3] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Accepted: 03/11/2020] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Positional weight matrix (PWM) is a de facto standard model to describe transcription factor (TF) DNA binding specificities. PWMs inferred from in vivo or in vitro data are stored in many databases and used in a plethora of biological applications. This calls for comprehensive benchmarking of public PWM models with large experimental reference sets. RESULTS Here we report results from all-against-all benchmarking of PWM models for DNA binding sites of human TFs on a large compilation of in vitro (HT-SELEX, PBM) and in vivo (ChIP-seq) binding data. We observe that the best performing PWM for a given TF often belongs to another TF, usually from the same family. Occasionally, binding specificity is correlated with the structural class of the DNA binding domain, indicated by good cross-family performance measures. Benchmarking-based selection of family-representative motifs is more effective than motif clustering-based approaches. Overall, there is good agreement between in vitro and in vivo performance measures. However, for some in vivo experiments, the best performing PWM is assigned to an unrelated TF, indicating a binding mode involving protein-protein cooperativity. CONCLUSIONS In an all-against-all setting, we compute more than 18 million performance measure values for different PWM-experiment combinations and offer these results as a public resource to the research community. The benchmarking protocols are provided via a web interface and as docker images. The methods and results from this study may help others make better use of public TF specificity models, as well as public TF binding data sets.
Collapse
Affiliation(s)
- Giovanna Ambrosini
- School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland
| | - Ilya Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, Russia, 119991
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, Pushchino, Russia, 142290
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, Russia, 119991
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Leninskiye gory 1-73, Moscow, Russia, 119234
- Moscow Institute of Physics and Technology (State University), Institutskiy per. 9, Dolgoprudny, Russia, 141700
| | - Romain Groux
- School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| | - Daria D Nikolaeva
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Leninskiye gory 1-73, Moscow, Russia, 119234
| | | | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Vsevolod Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, Russia, 119991
- Moscow Institute of Physics and Technology (State University), Institutskiy per. 9, Dolgoprudny, Russia, 141700
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova 32, Moscow, Russia, 119991
| | - Ivan Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, Russia, 119991
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, Pushchino, Russia, 142290
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova 32, Moscow, Russia, 119991
| | - Philipp Bucher
- School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland.
| |
Collapse
|
17
|
Yang J, Ma A, Hoppe AD, Wang C, Li Y, Zhang C, Wang Y, Liu B, Ma Q. Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework. Nucleic Acids Res 2019; 47:7809-7824. [PMID: 31372637 PMCID: PMC6735894 DOI: 10.1093/nar/gkz672] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Accepted: 07/23/2019] [Indexed: 11/24/2022] Open
Abstract
The identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein–DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein–protein–DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF–DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework.
Collapse
Affiliation(s)
- Jinyu Yang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA.,Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76010, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Adam D Hoppe
- Department of Chemistry and Biochemistry, South Dakota State University, Brookings, SD 57007, USA.,BioSNTR, Brookings, SD 57007, USA
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Chi Zhang
- Department of Medical and Molecular Genetics, School of Medicine, Indiana University, Indianapolis, IN 46202, USA
| | - Yan Wang
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|