1
|
Li Y, Zou Q, Dai Q, Stalin A, Luo X. Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM. PLoS Comput Biol 2025; 21:e1012513. [PMID: 40359430 DOI: 10.1371/journal.pcbi.1012513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Accepted: 04/29/2025] [Indexed: 05/15/2025] Open
Abstract
Transcription factors (TFs) can affect gene expression by binding to certain specific DNA sequences. This binding process of TFs may be modulated by DNA methylation. A subset of TFs that serve as methylation readers preferentially binds to certain methylated DNA and is defined as TFPM. The identification of TFPMs enhances our understanding of DNA methylation's role in gene regulation. However, their experimental identification is resource-demanding. In this study, we propose a novel two-step computational approach to classify TFs and TFPMs. First, we employed a fine-tuned ProtBERT model to differentiate between the classes of TFs and non-TFs. Second, we combined the Reduced Amino Acid Category (RAAC) with K-mer and SVM to predict the potential of TFs to bind to methylated DNA. Comparative experiments demonstrate that our proposed methods outperform all existing approaches and emphasize the efficiency of our computational framework in classifying TFs and TFPMs. Cross-species validation on an independent mouse dataset further demonstrates the generalizability of our proposed framework In addition, we conducted predictions on all human transcription factors and found that most of the top 20 proteins belong to the Krueppel C2H2-type Zinc-finger family. So far, some studies have demonstrated a partial correlation between this family and DNA methylation and confirmed the preference of some of its members, thereby showing the robustness of our approach.
Collapse
Affiliation(s)
- Yanchao Li
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Qi Dai
- College of Life Science and medicine, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China
| | - Antony Stalin
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Ximei Luo
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
2
|
Gaffar S, Chong KT, Tayara H. TFProtBert: Detection of Transcription Factors Binding to Methylated DNA Using ProtBert Latent Space Representation. Int J Mol Sci 2025; 26:4234. [PMID: 40362469 PMCID: PMC12071566 DOI: 10.3390/ijms26094234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2025] [Revised: 04/22/2025] [Accepted: 04/24/2025] [Indexed: 05/15/2025] Open
Abstract
Transcription factors (TFs) are fundamental regulators of gene expression and perform diverse functions in cellular processes. The management of 3-dimensional (3D) genome conformation and gene expression relies primarily on TFs. TFs are crucial regulators of gene expression, performing various roles in biological processes. They attract transcriptional machinery to the enhancers or promoters of specific genes, thereby activating or inhibiting transcription. Identifying these TFs is a significant step towards understanding cellular gene expression mechanisms. Due to the time-consuming and labor-intensive nature of experimental methods, the development of computational models is essential. In this work, we introduced a two-layer prediction framework based on a support vector machine (SVM) using the latent space representation of a protein language model, ProtBert. The first layer of the method reliably predicts and identifies transcription factors (TFs), and in the second layer, the proposed method predicts and identifies transcription factors that prefer binding to methylated deoxyribonucleic acid (TFPMs). In addition, we also tested the proposed method on an imbalanced database. In detecting TFs and TFPMs, the proposed model consistently outperformed state-of-the-art approaches, as demonstrated by performance comparisons via empirical cross-validation analysis and independent tests.
Collapse
Affiliation(s)
- Saima Gaffar
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea;
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea;
- Advances Electronics and Information Research Centre, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
3
|
Luo X, Wang Y, Zou Q, Xu L. Recall DNA methylation levels at low coverage sites using a CNN model in WGBS. PLoS Comput Biol 2023; 19:e1011205. [PMID: 37315069 DOI: 10.1371/journal.pcbi.1011205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/22/2023] [Indexed: 06/16/2023] Open
Abstract
DNA methylation is an important regulator of gene transcription. WGBS is the gold-standard approach for base-pair resolution quantitative of DNA methylation. It requires high sequencing depth. Many CpG sites with insufficient coverage in the WGBS data, resulting in inaccurate DNA methylation levels of individual sites. Many state-of-arts computation methods were proposed to predict the missing value. However, many methods required either other omics datasets or other cross-sample data. And most of them only predicted the state of DNA methylation. In this study, we proposed the RcWGBS, which can impute the missing (or low coverage) values from the DNA methylation levels on the adjacent sides. Deep learning techniques were employed for the accurate prediction. The WGBS datasets of H1-hESC and GM12878 were down-sampled. The average difference between the DNA methylation level at 12× depth predicted by RcWGBS and that at >50× depth in the H1-hESC and GM2878 cells are less than 0.03 and 0.01, respectively. RcWGBS performed better than METHimpute even though the sequencing depth was as low as 12×. Our work would help to process methylation data of low sequencing depth. It is beneficial for researchers to save sequencing costs and improve data utilization through computational methods.
Collapse
Affiliation(s)
- Ximei Luo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yansu Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| |
Collapse
|
4
|
Ming Y, Liu H, Cui Y, Guo S, Ding Y, Liu R. Identification of DNA-binding proteins by Kernel Sparse Representation via L 2,1-matrix norm. Comput Biol Med 2023; 159:106849. [PMID: 37060772 DOI: 10.1016/j.compbiomed.2023.106849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 02/26/2023] [Accepted: 03/30/2023] [Indexed: 04/17/2023]
Abstract
An understanding of DNA-binding proteins is helpful in exploring the role that proteins play in cell biology. Furthermore, the prediction of DNA-binding proteins is essential for the chemical modification and structural composition of DNA, and is of great importance in protein functional analysis and drug design. In recent years, DNA-binding protein prediction has typically used machine learning-based methods. The prediction accuracy of various classifiers has improved considerably, but researchers continue to spend time and effort on improving prediction performance. In this paper, we combine protein sequence evolutionary information with a classification method based on kernel sparse representation for the prediction of DNA-binding proteins, and based on the field of machine learning, a model for the identification of DNA-binding proteins by sequence information was finally proposed. Based on the confirmation of the final experimental results, we achieved good prediction accuracy on both the PDB1075 and PDB186 datasets. Our training result for cross-validation on PDB1075 was 81.37%, and our independent test result on PDB186 was 83.9%, both of which outperformed the other methods to some extent. Therefore, the proposed method in this paper is proven to be effective and feasible for predicting DNA-binding proteins.
Collapse
Affiliation(s)
- Yutong Ming
- School of Computer Science and Engineering, Beijing Technology and Business University, China
| | - Hongzhi Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, China
| | - Yizhi Cui
- School of Computer Science and Engineering, Beijing Technology and Business University, China
| | - Shaoyong Guo
- Beijing University of Posts and Telecommunications, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.
| | - Ruijun Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, China.
| |
Collapse
|
5
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
6
|
Identification of DNA-binding proteins via Multi-view LSSVM with independence criterion. Methods 2022; 207:29-37. [PMID: 36087888 DOI: 10.1016/j.ymeth.2022.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 08/06/2022] [Accepted: 08/25/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins actively participate in life activities such as DNA replication, recombination, gene expression and regulation and play a prominent role in these processes. As DNA-binding proteins continue to be discovered and increase, it is imperative to design an efficient and accurate identification tool. Considering the time-consuming and expensive traditional experimental technology and the insufficient number of samples in the biological computing method based on structural information, we proposed a machine learning algorithm based on sequence information to identify DNA binding proteins, named multi-view Least Squares Support Vector Machine via Hilbert-Schmidt Independence Criterion (multi-view LSSVM via HSIC). This method took 6 feature sets as multi-view input and trains a single view through the LSSVM algorithm. Then, we integrated HSIC into LSSVM as a regular term to reduce the dependence between views and explored the complementary information of multiple views. Subsequently, we trained and coordinated the submodels and finally combined the submodels in the form of weights to obtain the final prediction model. On training set PDB1075, the prediction results of our model were better than those of most existing methods. Independent tests are conducted on the datasets PDB186 and PDB2272. The accuracy of the prediction results was 85.5% and 79.36%, respectively. This result exceeded the current state-of-the-art methods, which showed that the multi-view LSSVM via HSIC can be used as an efficient predictor.
Collapse
|
7
|
Wang S, Xu D, Gao B, Yan S, Sun Y, Tang X, Jiao Y, Huang S, Zhang S. Heterogeneity Analysis of Bladder Cancer Based on DNA Methylation Molecular Profiling. Front Oncol 2022; 12:915542. [PMID: 35747826 PMCID: PMC9209659 DOI: 10.3389/fonc.2022.915542] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 05/13/2022] [Indexed: 11/13/2022] Open
Abstract
Bladder cancer is a highly complex and heterogeneous malignancy. Tumor heterogeneity is a barrier to effective diagnosis and treatment of bladder cancer. Human carcinogenesis is closely related to abnormal gene expression, and DNA methylation is an important regulatory factor of gene expression. Therefore, it is of great significance for bladder cancer research to characterize tumor heterogeneity by integrating genetic and epigenetic characteristics. This study explored specific molecular subtypes based on DNA methylation status and identified subtype-specific characteristics using patient samples from the TCGA database with DNA methylation and gene expression were measured simultaneously. The results were validated using an independent cohort from GEO database. Four DNA methylation molecular subtypes of bladder cancer were obtained with different prognostic states. In addition, subtype-specific DNA methylation markers were identified using an information entropy-based algorithm to represent the unique molecular characteristics of the subtype and verified in the test set. The results of this study can provide an important reference for clinicians to make treatment decisions.
Collapse
Affiliation(s)
- Shuyu Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Dali Xu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Shuhan Yan
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yiwei Sun
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Xinxing Tang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjia Jiao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
- *Correspondence: Shumei Zhang, ; Shan Huang,
| | - Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Shumei Zhang, ; Shan Huang,
| |
Collapse
|
8
|
Chen Y, Gong Y, Dou L, Zhou X, Zhang Y. Bioinformatics analysis methods for cell-free DNA. Comput Biol Med 2022; 143:105283. [PMID: 35149459 DOI: 10.1016/j.compbiomed.2022.105283] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 01/29/2022] [Accepted: 01/30/2022] [Indexed: 12/13/2022]
Abstract
As a kind of novel non-invasive marker for molecular detection, cell-free DNA (cfDNA) has potential value for the early diagnosis of diseases, prognosis assessment, and efficacy monitoring. The constant developments in molecular biology detection technologies have led to an increase in clinical studies on the use of cfDNA detection methods for patients, and many gratifying outcomes have been achieved. In this review, the contributions of bioinformatics tools to the study of cfDNA are well discussed. The focus of the review is on cfDNA identification signals, cfDNA identification methods, and the relationship of cfDNA with human diseases such as hepatic cancer, lung cancer, end-stage kidney disease, and ischemic stroke. The research significance and existing problems of using cfDNA as a biomarker for diseases are also discussed.
Collapse
Affiliation(s)
- Yaojia Chen
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yuxin Gong
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China; School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lijun Dou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China; School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Ying Zhang
- Department of Anesthesiology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China.
| |
Collapse
|
9
|
Chen Z, Jiao S, Zhao D, Zou Q, Xu L, Zhang L, Su X. The Characterization of Structure and Prediction for Aquaporin in Tumour Progression by Machine Learning. Front Cell Dev Biol 2022; 10:845622. [PMID: 35178393 PMCID: PMC8844512 DOI: 10.3389/fcell.2022.845622] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/17/2022] [Indexed: 11/21/2022] Open
Abstract
Recurrence and new cases of cancer constitute a challenging human health problem. Aquaporins (AQPs) can be expressed in many types of tumours, including the brain, breast, pancreas, colon, skin, ovaries, and lungs, and the histological grade of cancer is positively correlated with AQP expression. Therefore, the identification of aquaporins is an area to explore. Computational tools play an important role in aquaporin identification. In this research, we propose reliable, accurate and automated sequence predictor iAQPs-RF to identify AQPs. In this study, the feature extraction method was 188D (global protein sequence descriptor, GPSD). Six common classifiers, including random forest (RF), NaiveBayes (NB), support vector machine (SVM), XGBoost, logistic regression (LR) and decision tree (DT), were used for AQP classification. The classification results show that the random forest (RF) algorithm is the most suitable machine learning algorithm, and the accuracy was 97.689%. Analysis of Variance (ANOVA) was used to analyse these characteristics. Feature rank based on the ANOVA method and IFS strategy was applied to search for the optimal features. The classification results suggest that the 26th feature (neutral/hydrophobic) and 21st feature (hydrophobic) are the two most powerful and informative features that distinguish AQPs from non-AQPs. Previous studies reported that plasma membrane proteins have hydrophobic characteristics. Aquaporin subcellular localization prediction showed that all aquaporins were plasma membrane proteins with highly conserved transmembrane structures. In addition, the 3D structure of aquaporins was consistent with the localization results. Therefore, these studies confirmed that aquaporins possess hydrophobic properties. Although aquaporins are highly conserved transmembrane structures, the phylogenetic tree shows the diversity of aquaporins during evolution. The PCA showed that positive and negative samples were well separated by 54D features, indicating that the 54D feature can effectively classify aquaporins. The online prediction server is accessible at http://lab.malab.cn/∼acy/iAQP.
Collapse
Affiliation(s)
- Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Da Zhao
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lijun Zhang
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
| | - Xi Su
- Foshan Maternal and Child Health Hospital, Foshan, China
| |
Collapse
|
10
|
Zhang S, Zhang J, Zhang Q, Liang Y, Du Y, Wang G. Identification of Prognostic Biomarkers for Bladder Cancer Based on DNA Methylation Profile. Front Cell Dev Biol 2022; 9:817086. [PMID: 35174173 PMCID: PMC8841402 DOI: 10.3389/fcell.2021.817086] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 12/22/2021] [Indexed: 12/14/2022] Open
Abstract
Background: DNA methylation is an important epigenetic modification, which plays an important role in regulating gene expression at the transcriptional level. In tumor research, it has been found that the change of DNA methylation leads to the abnormality of gene structure and function, which can provide early warning for tumorigenesis. Our study aims to explore the relationship between the occurrence and development of tumor and the level of DNA methylation. Moreover, this study will provide a set of prognostic biomarkers, which can more accurately predict the survival and health of patients after treatment. Methods: Datasets of bladder cancer patients and control samples were collected from TCGA database, differential analysis was employed to obtain genes with differential DNA methylation levels between tumor samples and normal samples. Then the protein-protein interaction network was constructed, and the potential tumor markers were further obtained by extracting Hub genes from subnet. Cox proportional hazard regression model and survival analysis were used to construct the prognostic model and screen out the prognostic markers of bladder cancer, so as to provide reference for tumor prognosis monitoring and improvement of treatment plan. Results: In this study, we found that DNA methylation was indeed related with the occurrence of bladder cancer. Genes with differential DNA methylation could serve as potential biomarkers for bladder cancer. Through univariate and multivariate Cox proportional hazard regression analysis, we concluded that FASLG and PRKCZ can be used as prognostic biomarkers for bladder cancer. Patients can be classified into high or low risk group by using this two-gene prognostic model. By detecting the methylation status of these genes, we can evaluate the survival of patients. Conclusion: The analysis in our study indicates that the methylation status of tumor-related genes can be used as prognostic biomarkers of bladder cancer.
Collapse
Affiliation(s)
- Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Qichao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yingjian Liang
- Department of General Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- Key Laboratory of Hepatosplenic Surgery, Ministry of Education, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Youwen Du
- School of Life Sciences, Anhui Medical University, Hefei, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Guohua Wang,
| |
Collapse
|
11
|
Mouse4mC-BGRU: deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022; 204:258-262. [DOI: 10.1016/j.ymeth.2022.01.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/14/2022] [Accepted: 01/24/2022] [Indexed: 12/12/2022] Open
|
12
|
Li H, Gong Y, Liu Y, Lin H, Wang G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Brief Bioinform 2021; 23:6484512. [PMID: 34962264 DOI: 10.1093/bib/bbab533] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 10/23/2021] [Accepted: 11/19/2021] [Indexed: 12/13/2022] Open
Abstract
Transcription factors (TFs) are proteins specifically involved in gene expression regulation. It is generally accepted in epigenetics that methylated nucleotides could prevent the TFs from binding to DNA fragments. However, recent studies have confirmed that some TFs have capability to interact with methylated DNA fragments to further regulate gene expression. Although biochemical experiments could recognize TFs binding to methylated DNA sequences, these wet experimental methods are time-consuming and expensive. Machine learning methods provide a good choice for quickly identifying these TFs without experimental materials. Thus, this study aims to design a robust predictor to detect methylated DNA-bound TFs. We firstly proposed using tripeptide word vector feature to formulate protein samples. Subsequently, based on recurrent neural network with long short-term memory, a two-step computational model was designed. The first step predictor was utilized to discriminate transcription factors from non-transcription factors. Once proteins were predicted as TFs, the second step predictor was employed to judge whether the TFs can bind to methylated DNA. Through the independent dataset test, the accuracies of the first step and the second step are 86.63% and 73.59%, respectively. In addition, the statistical analysis of the distribution of tripeptides in training samples showed that the position and number of some tripeptides in the sequence could affect the binding of TFs to methylated DNA. Finally, on the basis of our model, a free web server was established based on the proposed model, which can be available at https://bioinfor.nefu.edu.cn/TFPM/.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering at Northeast Forestry University of China
| | - Yue Gong
- College of Information and Computer Engineering at Northeast Forestry University of China
| | - Yifeng Liu
- School of management at Henan Institute of Technology of China
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Guohua Wang
- College of Information and Computer Engineering at Northeast Forestry University of China
| |
Collapse
|
13
|
Jia Y, Huang S, Zhang T. KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest. Front Genet 2021; 12:811158. [PMID: 34912382 PMCID: PMC8667860 DOI: 10.3389/fgene.2021.811158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 11/15/2021] [Indexed: 02/04/2023] Open
Abstract
DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.
Collapse
Affiliation(s)
- Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
14
|
Guo Y, Hou L, Zhu W, Wang P. Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes. Front Genet 2021; 12:797641. [PMID: 34887905 PMCID: PMC8650314 DOI: 10.3389/fgene.2021.797641] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/05/2021] [Indexed: 11/29/2022] Open
Abstract
Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body's life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.
Collapse
Affiliation(s)
- Yuxin Guo
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Liping Hou
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Peng Wang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
15
|
Niu M, Ju Y, Lin C, Zou Q. Characterizing viral circRNAs and their application in identifying circRNAs in viruses. Brief Bioinform 2021; 23:6377516. [PMID: 34585234 DOI: 10.1093/bib/bbab404] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 08/23/2021] [Accepted: 09/02/2021] [Indexed: 01/19/2023] Open
Abstract
Circular RNAs (circRNAs) are non-coding RNAs with a special circular structure produced formed by the reverse splicing mechanism, which play an important role in a variety of biological activities. Viruses can encode circRNA, and viral circRNAs have been found in multiple single-stranded and double-stranded viruses. However, the characteristics and functions of viral circRNAs remain unknown. Sequence alignment showed that viral circRNAs are less conserved than circRNAs in animal, indicating that the viral circRNAs may evolve rapidly. Through the analysis of the sequence characteristics of viral circRNAs and circRNAs in animal, it was found that viral circRNAs and animals circRNAs are similar in nucleic acid composition, but have obvious differences in secondary structure and autocorrelation characteristics. Based on these characteristics of viral circRNAs, machine learning algorithms were employed to construct a prediction model to identify viral circRNA. Additionally, analysis of the interaction between viral circRNA and miRNAs showed that viral circRNA is expected to interact with 518 human miRNAs, and preliminary analysis of the role of viral circRNA. And it has been also found that viral circRNAs may be involved in many KEGG pathways related to nervous system and cancer. We curated an online server, and the data and code are available: http://server.malab.cn/viral-CircRNA/.
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Chen Lin
- School of Informatics, Xiamen University, Xiamen, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| |
Collapse
|
16
|
Zhou J, Bo S, Wang H, Zheng L, Liang P, Zuo Y. Identification of Disease-Related 2-Oxoglutarate/Fe (II)-Dependent Oxygenase Based on Reduced Amino Acid Cluster Strategy. Front Cell Dev Biol 2021; 9:707938. [PMID: 34336861 PMCID: PMC8323781 DOI: 10.3389/fcell.2021.707938] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 06/10/2021] [Indexed: 11/17/2022] Open
Abstract
The 2-oxoglutarate/Fe (II)-dependent (2OG) oxygenase superfamily is mainly responsible for protein modification, nucleic acid repair and/or modification, and fatty acid metabolism and plays important roles in cancer, cardiovascular disease, and other diseases. They are likely to become new targets for the treatment of cancer and other diseases, so the accurate identification of 2OG oxygenases is of great significance. Many computational methods have been proposed to predict functional proteins to compensate for the time-consuming and expensive experimental identification. However, machine learning has not been applied to the study of 2OG oxygenases. In this study, we developed OGFE_RAAC, a prediction model to identify whether a protein is a 2OG oxygenase. To improve the performance of OGFE_RAAC, 673 amino acid reduction alphabets were used to determine the optimal feature representation scheme by recoding the protein sequence. The 10-fold cross-validation test showed that the accuracy of the model in identifying 2OG oxygenases is 91.04%. Besides, the independent dataset results also proved that the model has excellent generalization and robustness. It is expected to become an effective tool for the identification of 2OG oxygenases. With further research, we have also found that the function of 2OG oxygenases may be related to their polarity and hydrophobicity, which will help the follow-up study on the catalytic mechanism of 2OG oxygenases and the way they interact with the substrate. Based on the model we built, a user-friendly web server was established and can be friendly accessed at http://bioinfor.imu.edu.cn/ogferaac.
Collapse
Affiliation(s)
- Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Suling Bo
- College of Computer and Information, Inner Mongolia Medical University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| |
Collapse
|
17
|
Zeng R, Cheng S, Liao M. 4mCPred-MTL: Accurate Identification of DNA 4mC Sites in Multiple Species Using Multi-Task Deep Learning Based on Multi-Head Attention Mechanism. Front Cell Dev Biol 2021; 9:664669. [PMID: 34041243 PMCID: PMC8141656 DOI: 10.3389/fcell.2021.664669] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 03/17/2021] [Indexed: 01/10/2023] Open
Abstract
DNA methylation is one of the most extensive epigenetic modifications. DNA 4mC modification plays a key role in regulating chromatin structure and gene expression. In this study, we proposed a generic 4mC computational predictor, namely, 4mCPred-MTL using multi-task learning coupled with Transformer to predict 4mC sites in multiple species. In this predictor, we utilize a multi-task learning framework, in which each task is to train species-specific data based on Transformer. Extensive experimental results show that our multi-task predictive model can significantly improve the performance of the model based on single task and outperform existing methods on benchmarking comparison. Moreover, we found that our model can sufficiently capture better characteristics of 4mC sites as compared to existing commonly used feature descriptors, demonstrating the strong feature learning ability of our model. Therefore, based on the above results, it can be expected that our 4mCPred-MTL can be a useful tool for research communities of interest.
Collapse
Affiliation(s)
- Rao Zeng
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| | - Song Cheng
- Department of Thoracic Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Minghong Liao
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
18
|
Yang X, Ye X, Li X, Wei L. iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool. Front Genet 2021; 12:663572. [PMID: 33868390 PMCID: PMC8044371 DOI: 10.3389/fgene.2021.663572] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/02/2021] [Indexed: 02/04/2023] Open
Abstract
Motivation DNA N4-methylcytosine (4mC) and N6-methyladenine (6mA) are two important DNA modifications and play crucial roles in a variety of biological processes. Accurate identification of the modifications is essential to better understand their biological functions and mechanisms. However, existing methods to identify 4mA or 6mC sites are all single tasks, which demonstrates that they can identify only a certain modification in one species. Therefore, it is desirable to develop a novel computational method to identify the modification sites in multiple species simultaneously. Results In this study, we proposed a computational method, called iDNA-MT, to identify 4mC sites and 6mA sites in multiple species, respectively. The proposed iDNA-MT mainly employed multi-task learning coupled with the bidirectional gated recurrent units (BGRU) to capture the sharing information among different species directly from DNA primary sequences. Experimental comparative results on two benchmark datasets, containing different species respectively, show that either for identifying 4mA or for 6mC site in multiple species, the proposed iDNA-MT outperforms other state-of-the-art single-task methods. The promising results have demonstrated that iDNA-MT has great potential to be a powerful and practically useful tool to accurately identify DNA modifications.
Collapse
Affiliation(s)
- Xiao Yang
- School of Software, Shandong University, Jinan, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Xuehong Li
- Department of Rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Lesong Wei
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
19
|
Niu K, Luo X, Zhang S, Teng Z, Zhang T, Zhao Y. iEnhancer-EBLSTM: Identifying Enhancers and Strengths by Ensembles of Bidirectional Long Short-Term Memory. Front Genet 2021; 12:665498. [PMID: 33833783 PMCID: PMC8021722 DOI: 10.3389/fgene.2021.665498] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 03/01/2021] [Indexed: 12/26/2022] Open
Abstract
Enhancers are regulatory DNA sequences that could be bound by specific proteins named transcription factors (TFs). The interactions between enhancers and TFs regulate specific genes by increasing the target gene expression. Therefore, enhancer identification and classification have been a critical issue in the enhancer field. Unfortunately, so far there has been a lack of suitable methods to identify enhancers. Previous research has mainly focused on the features of the enhancer's function and interactions, which ignores the sequence information. As we know, the recurrent neural network (RNN) and long short-term memory (LSTM) models are currently the most common methods for processing time series data. LSTM is more suitable than RNN to address the DNA sequence. In this paper, we take the advantages of LSTM to build a method named iEnhancer-EBLSTM to identify enhancers. iEnhancer-ensembles of bidirectional LSTM (EBLSTM) consists of two steps. In the first step, we extract subsequences by sliding a 3-mer window along the DNA sequence as features. Second, EBLSTM model is used to identify enhancers from the candidate input sequences. We use the dataset from the study of Quang H et al. as the benchmarks. The experimental results from the datasets demonstrate the efficiency of our proposed model.
Collapse
Affiliation(s)
- Kun Niu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Ximei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
20
|
Wang S, Zhang Q, Shen Z, He Y, Chen ZH, Li J, Huang DS. Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. MOLECULAR THERAPY-NUCLEIC ACIDS 2021; 24:154-163. [PMID: 33767912 PMCID: PMC7972936 DOI: 10.1016/j.omtn.2021.02.014] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/14/2021] [Indexed: 12/26/2022]
Abstract
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. Recent research has shown that the double helix structure of nucleotides plays an important role in improving the accuracy and interpretability of transcription factor binding sites (TFBSs). Although several computational methods have been designed to take both DNA sequence and DNA shape features into consideration simultaneously, how to design an efficient model is still an intractable topic. In this paper, we proposed a hybrid convolutional recurrent neural network (CNN/RNN) architecture, CRPTS, to predict TFBSs by combining DNA sequence and DNA shape features. The novelty of our proposed method relies on three critical aspects: (1) the application of a shared hybrid CNN and RNN has the ability to efficiently extract features from large-scale genomic sequences obtained by high-throughput technology; (2) the common patterns were found from DNA sequences and their corresponding DNA shape features; (3) our proposed CRPTS can capture local structural information of DNA sequences without completely relying on DNA shape data. A series of comprehensive experiments on 66 in vitro datasets derived from universal protein binding microarrays (uPBMs) shows that our proposed method CRPTS obviously outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Siguo Wang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Qinhu Zhang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China.,Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Siping Road 1239, Shanghai 200092, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, Henan 473004, China
| | - Ying He
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Zhen-Heng Chen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - De-Shuang Huang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| |
Collapse
|