1
|
Le TT, Dang XT. Predicting TF-Target Gene Association Using a Heterogeneous Network and Enhanced Negative Sampling. Bioinform Biol Insights 2025; 19:11779322251316130. [PMID: 40012937 PMCID: PMC11863233 DOI: 10.1177/11779322251316130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 01/10/2025] [Indexed: 02/28/2025] Open
Abstract
Identifying interactions between transcription factors (TFs) and target genes is crucial for understanding the molecular mechanisms involved in biological processes and diseases. Traditional biological experiments used to determine these interactions are often time-consuming, costly, and limited in scale. Current computational methods mainly predict binding sites rather than direct interactions. Although recent studies have achieved high performance in predicting TF-target gene associations, they still face a significant challenge related to constructing a robust dataset of positive and negative samples. Currently, methods do not adequately focus on selecting negative samples, resulting in incomplete coverage of potential TF-target gene relationships. This article proposes a method to select enhanced negative samples to improve the prediction performance of TF-target gene interactions. Experimental results show that the proposed method achieves an average area under the curve (AUC) value of 0.9024 ± 0.0008 through 5-fold cross-validation. These results demonstrate the model's high efficiency and accuracy, confirming its potential application in predicting TF-target gene interactions across various datasets and paving the way for large-scale biomedical research.
Collapse
Affiliation(s)
- Thanh Tuoi Le
- Faculty of Information Technology, Hanoi National University of Education, Hanoi, Vietnam
- Faculty of Information Technology, Vinh University of Technology Education, Vinh, Vietnam
| | - Xuan Tho Dang
- Faculty of Digital Economics, Academy of Policy and Development, Hanoi, Vietnam
| |
Collapse
|
2
|
Wanniarachchi DV, Viswakula S, Wickramasuriya AM. The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes. BMC Bioinformatics 2024; 25:371. [PMID: 39623329 PMCID: PMC11613939 DOI: 10.1186/s12859-024-05995-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 11/21/2024] [Indexed: 12/06/2024] Open
Abstract
BACKGROUND The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites. RESULTS Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools. CONCLUSION The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.
Collapse
Affiliation(s)
- Dinithi V Wanniarachchi
- Department of Plant Sciences, Faculty of Science, University of Colombo, Colombo 03, Sri Lanka
| | - Sameera Viswakula
- Department of Statistics, Faculty of Science, University of Colombo, Colombo 03, Sri Lanka
| | | |
Collapse
|
3
|
Luo H, Tang L, Zeng M, Yin R, Ding P, Luo L, Li M. BertSNR: an interpretable deep learning framework for single-nucleotide resolution identification of transcription factor binding sites based on DNA language model. Bioinformatics 2024; 40:btae461. [PMID: 39107889 PMCID: PMC11310455 DOI: 10.1093/bioinformatics/btae461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 06/07/2024] [Indexed: 08/10/2024] Open
Abstract
MOTIVATION Transcription factors are pivotal in the regulation of gene expression, and accurate identification of transcription factor binding sites (TFBSs) at high resolution is crucial for understanding the mechanisms underlying gene regulation. The task of identifying TFBSs from DNA sequences is a significant challenge in the field of computational biology today. To address this challenge, a variety of computational approaches have been developed. However, these methods face limitations in their ability to achieve high-resolution identification and often lack interpretability. RESULTS We propose BertSNR, an interpretable deep learning framework for identifying TFBSs at single-nucleotide resolution. BertSNR integrates sequence-level and token-level information by multi-task learning based on pre-trained DNA language models. Benchmarking comparisons show that our BertSNR outperforms the existing state-of-the-art methods in TFBS predictions. Importantly, we enhanced the interpretability of the model through attentional weight visualization and motif analysis, and discovered the subtle relationship between attention weight and motif. Moreover, BertSNR effectively identifies TFBSs in promoter regions, facilitating the study of intricate gene regulation. AVAILABILITY AND IMPLEMENTATION The BertSNR source code can be found at https://github.com/lhy0322/BertSNR.
Collapse
Affiliation(s)
- Hanyu Luo
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| | - Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Rui Yin
- Department of Health Outcome and Biomedical Informatics, University of Florida, Gainesville, FL 32611, United States
| | - Pingjian Ding
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| |
Collapse
|
4
|
Ghosh N, Santoni D, Saha I, Felici G. Predicting Transcription Factor Binding Sites with Deep Learning. Int J Mol Sci 2024; 25:4990. [PMID: 38732207 PMCID: PMC11084193 DOI: 10.3390/ijms25094990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 04/28/2024] [Indexed: 05/13/2024] Open
Abstract
Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.
Collapse
Affiliation(s)
- Nimisha Ghosh
- Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ’O’ Anusandhan (Deemed to be University), Bhubaneswar 751030, India
| | - Daniele Santoni
- Institute for System Analysis and Computer Science “Antonio Ruberti”, National Research Council of Italy, 00185 Rome, Italy; (D.S.); (G.F.)
| | - Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training and Research, Kolkata 700106, India;
| | - Giovanni Felici
- Institute for System Analysis and Computer Science “Antonio Ruberti”, National Research Council of Italy, 00185 Rome, Italy; (D.S.); (G.F.)
| |
Collapse
|
5
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
6
|
Yi R, Cho K, Bonneau R. NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction. Bioinformatics 2022; 38:4762-4770. [PMID: 35997560 PMCID: PMC9563695 DOI: 10.1093/bioinformatics/btac569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 08/16/2022] [Accepted: 08/20/2022] [Indexed: 12/05/2022] Open
Abstract
Motivation Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution. Results We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings. Availability and implementation NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ren Yi
- Department of Computer Science, New York University, New York, NY, 10011, USA
| | - Kyunghyun Cho
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Department of Biology, New York University, New York, NY, 10003, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| |
Collapse
|
7
|
Kim C, Wang X, Kültz D. Prediction and Experimental Validation of a New Salinity-Responsive Cis-Regulatory Element (CRE) in a Tilapia Cell Line. Life (Basel) 2022; 12:787. [PMID: 35743818 PMCID: PMC9225295 DOI: 10.3390/life12060787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 05/12/2022] [Accepted: 05/16/2022] [Indexed: 11/16/2022] Open
Abstract
Transcriptional regulation is a major mechanism by which organisms integrate gene x environment interactions. It can be achieved by coordinated interplay between cis-regulatory elements (CREs) and transcription factors (TFs). Euryhaline tilapia (Oreochromis mossambicus) tolerate a wide range of salinity and thus are an appropriate model to examine transcriptional regulatory mechanisms during salinity stress in fish. Quantitative proteomics in combination with the transcription inhibitor actinomycin D revealed 19 proteins that are transcriptionally upregulated by hyperosmolality in tilapia brain (OmB) cells. We searched the extended proximal promoter up to intron1 of each corresponding gene for common motifs using motif discovery tools. The top-ranked motif identified (STREME1) represents a binding site for the Forkhead box TF L1 (FoxL1). STREME1 function during hyperosmolality was experimentally validated by choosing two of the 19 genes, chloride intracellular channel 2 (clic2) and uridine phosphorylase 1 (upp1), that are enriched in STREME1 in their extended promoters. Transcriptional induction of these genes during hyperosmolality requires STREME1, as evidenced by motif mutagenesis. We conclude that STREME1 represents a new functional CRE that contributes to gene x environment interactions during salinity stress in tilapia. Moreover, our results indicate that FoxL1 family TFs are contribute to hyperosmotic induction of genes in euryhaline fish.
Collapse
Affiliation(s)
- Chanhee Kim
- Stress-Induced Evolution Laboratory, Department of Animal Sciences, University of California, Davis, CA 95616, USA;
| | - Xiaodan Wang
- Laboratory of Aquaculture Nutrition and Environmental Health, School of Life Sciences, East China Normal University, Shanghai 200241, China;
| | - Dietmar Kültz
- Stress-Induced Evolution Laboratory, Department of Animal Sciences, University of California, Davis, CA 95616, USA;
| |
Collapse
|
8
|
Du ZH, Wu YH, Huang YA, Chen J, Pan GQ, Hu L, You ZH, Li JQ. GraphTGI: an attention-based graph embedding model for predicting TF-target gene interactions. Brief Bioinform 2022; 23:6576453. [PMID: 35511108 DOI: 10.1093/bib/bbac148] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 03/25/2022] [Accepted: 03/31/2022] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Interaction between transcription factor (TF) and its target genes establishes the knowledge foundation for biological researches in transcriptional regulation, the number of which is, however, still limited by biological techniques. Existing computational methods relevant to the prediction of TF-target interactions are mostly proposed for predicting binding sites, rather than directly predicting the interactions. To this end, we propose here a graph attention-based autoencoder model to predict TF-target gene interactions using the information of the known TF-target gene interaction network combined with two sequential and chemical gene characters, considering that the unobserved interactions between transcription factors and target genes can be predicted by learning the pattern of the known ones. To the best of our knowledge, the proposed model is the first attempt to solve this problem by learning patterns from the known TF-target gene interaction network. RESULTS In this paper, we formulate the prediction task of TF-target gene interactions as a link prediction problem on a complex knowledge graph and propose a deep learning model called GraphTGI, which is composed of a graph attention-based encoder and a bilinear decoder. We evaluated the prediction performance of the proposed method on a real dataset, and the experimental results show that the proposed model yields outstanding performance with an average AUC value of 0.8864 +/- 0.0057 in the 5-fold cross-validation. It is anticipated that the GraphTGI model can effectively and efficiently predict TF-target gene interactions on a large scale. AVAILABILITY Python code and the datasets used in our studies are made available at https://github.com/YanghanWu/GraphTGI.
Collapse
Affiliation(s)
- Zhi-Hua Du
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Yang-Han Wu
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Yu-An Huang
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Jie Chen
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Gui-Qing Pan
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Lun Hu
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| |
Collapse
|
9
|
Phage_UniR_LGBM: Phage Virion Proteins Classification with UniRep Features and LightGBM Model. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9470683. [PMID: 35465015 PMCID: PMC9033350 DOI: 10.1155/2022/9470683] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 03/15/2022] [Indexed: 11/23/2022]
Abstract
Phage, the most prevalent creature on the planet, serves a variety of critical roles. Phage's primary role is to facilitate gene-to-gene communication. The phage proteins can be defined as the virion proteins and the nonvirion ones. Nowadays, experimental identification is a difficult process that necessitates a significant amount of laboratory time and expense. Considering such situation, it is critical to design practical calculating techniques and develop well-performance tools. In this work, the Phage_UniR_LGBM has been proposed to classify the virion proteins. In detailed, such model utilizes the UniRep as the feature and the LightGBM algorithm as the classification model. And then, the training data train the model, and the testing data test the model with the cross-validation. The Phage_UniR_LGBM was compared with the several state-of-the-art features and classification algorithms. The performances of the Phage_UniR_LGBM are 88.51% in Sp,89.89% in Sn, 89.18% in Acc, 0.7873 in MCC, and 0.8925 in F1 score.
Collapse
|
10
|
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 2022; 23:bbab374. [PMID: 34607350 PMCID: PMC8769700 DOI: 10.1093/bib/bbab374] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/22/2021] [Accepted: 08/23/2021] [Indexed: 12/28/2022] Open
Abstract
Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.
Collapse
Affiliation(s)
- Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, and Christopher S. Bond Life Science Center, University of Missouri, MO, 65211, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| |
Collapse
|
11
|
Jiang Z, Xiao SR, Liu R. Dissecting and predicting different types of binding sites in nucleic acids based on structural information. Brief Bioinform 2021; 23:6384399. [PMID: 34624074 PMCID: PMC8769709 DOI: 10.1093/bib/bbab411] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 08/26/2021] [Accepted: 09/07/2021] [Indexed: 12/16/2022] Open
Abstract
The biological functions of DNA and RNA generally depend on their interactions with other molecules, such as small ligands, proteins and nucleic acids. However, our knowledge of the nucleic acid binding sites for different interaction partners is very limited, and identification of these critical binding regions is not a trivial work. Herein, we performed a comprehensive comparison between binding and nonbinding sites and among different categories of binding sites in these two nucleic acid classes. From the structural perspective, RNA may interact with ligands through forming binding pockets and contact proteins and nucleic acids using protruding surfaces, while DNA may adopt regions closer to the middle of the chain to make contacts with other molecules. Based on structural information, we established a feature-based ensemble learning classifier to identify the binding sites by fully using the interplay among different machine learning algorithms, feature spaces and sample spaces. Meanwhile, we designed a template-based classifier by exploiting structural conservation. The complementarity between the two classifiers motivated us to build an integrative framework for improving prediction performance. Moreover, we utilized a post-processing procedure based on the random walk algorithm to further correct the integrative predictions. Our unified prediction framework yielded promising results for different binding sites and outperformed existing methods.
Collapse
Affiliation(s)
- Zheng Jiang
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Si-Rui Xiao
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Rong Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| |
Collapse
|
12
|
Zhang Y, Wang Z, Zeng Y, Zhou J, Zou Q. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. Brief Bioinform 2021; 22:6322761. [PMID: 34272562 DOI: 10.1093/bib/bbab273] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/19/2021] [Accepted: 06/25/2021] [Indexed: 11/14/2022] Open
Abstract
Transcription factors (TFs) are essential proteins in regulating the spatiotemporal expression of genes. It is crucial to infer the potential transcription factor binding sites (TFBSs) with high resolution to promote biology and realize precision medicine. Recently, deep learning-based models have shown exemplary performance in the prediction of TFBSs at the base-pair level. However, the previous models fail to integrate nucleotide position information and semantic information without noisy responses. Thus, there is still room for improvement. Moreover, both the inner mechanism and prediction results of these models are challenging to interpret. To this end, the Deep Attentive Encoder-Decoder Neural Network (D-AEDNet) is developed to identify the location of TFs-DNA binding sites in DNA sequences. In particular, our model adopts Skip Architecture to leverage the nucleotide position information in the encoder and removes noisy responses in the information fusion process by Attention Gate. Simultaneously, the Transcription Factor Motif Discovery based on Sliding Window (TF-MoDSW), an approach to discover TFs-DNA binding motifs by utilizing the output of neural networks, is proposed to understand the biological meaning of the predicted result. On ChIP-exo datasets, experimental results show that D-AEDNet has better performance than competing methods. Besides, we authenticate that Attention Gate can improve the interpretability of our model by ways of visualization analysis. Furthermore, we confirm that ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs in TFs-DNA interaction by conducting experiment on ChIP-seq datasets.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuanqi Zeng
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
13
|
Salekin S, Mostavi M, Chiu YC, Chen Y, Zhang J(M, Huang Y. Predicting sites of epitranscriptome modifications using unsupervised representation learning based on generative adversarial networks. FRONTIERS IN PHYSICS 2020; 8:196. [PMID: 33274189 PMCID: PMC7710330 DOI: 10.3389/fphy.2020.00196] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Epitranscriptome is an exciting area that studies different types of modifications in transcripts and the prediction of such modification sites from the transcript sequence is of significant interest. However, the scarcity of positive sites for most modifications imposes critical challenges for training robust algorithms. To circumvent this problem, we propose MR-GAN, a generative adversarial network (GAN) based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract embeddings of the sequences in a training dataset we created for eight epitranscriptome modifications, including m6A, m1A, m1G, m2G, m5C, m5U, 2'-O-Me, Pseudouridine (Ψ) and Dihydrouridine (D), of which the positive samples are very limited. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art m6A site prediction algorithm and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning could lead to high precision for predicting multiple types of epitranscriptome modifications, even when the data size is small and extremely imbalanced.
Collapse
Affiliation(s)
- Sirajul Salekin
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX, 78207, USA
| | - Milad Mostavi
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX, 78207, USA
| | - Yu-Chiao Chiu
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
| | - Yidong Chen
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
| | - Jianqiu (Michelle) Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX, 78207, USA
| | - Yufei Huang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX, 78207, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
| |
Collapse
|
14
|
Mostavi M, Salekin S, Huang Y. Deep-2'-O-Me: Predicting 2'-O-methylation sites by Convolutional Neural Networks. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2019; 2018:2394-2397. [PMID: 30440889 DOI: 10.1109/embc.2018.8512780] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
2'-O-methylation (2'-O-me) of ribose moiety is one of the significant and ubiquitous post-transcriptional RNA modifications which is vital for metabolism and functions of RNA. Although recent development of new technology (Nmseq) enabled biologists to find precise location of 2'-O-me in RNA sequences, there is still a lack of computational tools that can also provide high resolution prediction of this RNA modification. In this paper, we propose a deep learning based method that takes advantage of an embedding method to learn complex feature representation of pre-mRNA sequences and employs a Convolutional Neural Network to fine-tune the features required for accurate prediction of such alteration. Specifically, we adopted dna2vec, a biological sequence embedding method originally inspired by the word2vec model of text analysis, to yield embedded representation of sequences that may or may not contain 2-O-me sites before feeding those features into CNN for classification. Our model was trained using the data collected from Nm-seq experiment. The proposed method achieved AUC and auPRC scores of 90% outperforming existing state-of-the-art algorithms by a significant margin in both balanced and unbalanced class testing scenarios.
Collapse
|