1
|
Huang R, Qiu W, Xiao X, Lin W. iProtDNA-SMOTE: Enhancing protein-DNA binding sites prediction through imbalanced graph neural networks. PLoS One 2025; 20:e0320817. [PMID: 40359455 PMCID: PMC12074593 DOI: 10.1371/journal.pone.0320817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Accepted: 02/24/2025] [Indexed: 05/15/2025] Open
Abstract
Protein-DNA interactions play a crucial role in cellular biology, essential for maintaining life processes and regulating cellular functions. We propose a method called iProtDNA-SMOTE, which utilizes non-equilibrium graph neural networks along with pre-trained protein language models to predict DNA binding residues. This approach effectively addresses the class imbalance issue in predicting protein-DNA binding sites by leveraging unbalanced graph data, thus enhancing model's generalization and specificity. We trained the model on two datasets, TR646 and TR573, and conducted a series of experiments to evaluate its performance. The model achieved AUC values of 0.850, 0.896, and 0.858 on the independent test datasets TE46, TE129, and TE181, respectively. These results indicate that iProtDNA-SMOTE outperforms existing methods in terms of accuracy and generalization for predicting DNA binding sites, offering reliable and effective predictions to minimize errors. The model has been thoroughly validated for its ability to predict protein-DNA binding sites with high reliability and precision. For the convenience of the scientific community, the benchmark datasets and codes are publicly available at https://github.com/primrosehry/iProtDNA-SMOTE.
Collapse
Affiliation(s)
- Ruiyan Huang
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen Jiangxi, China
| | - Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen Jiangxi, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen Jiangxi, China
- School of Information Engineering, Jingxi Art & Ceramics Technology Institute, Jingdezhen Jiangxi, China
| | - Weizhong Lin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen Jiangxi, China
| |
Collapse
|
2
|
Zankov D, Madzhidov T, Varnek A, Polishchuk P. Chemical complexity challenge: Is multi‐instance machine learning a solution? WIRES COMPUTATIONAL MOLECULAR SCIENCE 2024; 14. [DOI: 10.1002/wcms.1698] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 11/07/2023] [Indexed: 01/03/2025]
Abstract
AbstractMolecules are complex dynamic objects that can exist in different molecular forms (conformations, tautomers, stereoisomers, protonation states, etc.) and often it is not known which molecular form is responsible for observed physicochemical and biological properties of a given molecule. This raises the problem of the selection of the correct molecular form for machine learning modeling of target properties. The same problem is common to biological molecules (RNA, DNA, proteins)—long sequences where only key segments, which often cannot be located precisely, are involved in biological functions. Multi‐instance machine learning (MIL) is an efficient approach for solving problems where objects under study cannot be uniquely represented by a single instance, but rather by a set of multiple alternative instances. Multi‐instance learning was formalized in 1997 and motivated by the problem of conformation selection in drug activity prediction tasks. Since then MIL has found a lot of applications in various domains, such as information retrieval, computer vision, signal processing, bankruptcy prediction, and so on. In the given review we describe the MIL framework and its applications to the tasks associated with ambiguity in the representation of small and biological molecules in chemoinformatics and bioinformatics. We have collected examples that demonstrate the advantages of MIL over the traditional single‐instance learning (SIL) approach. Special attention was paid to the ability of MIL models to identify key instances responsible for a modeling property.This article is categorized under:Data Science > ChemoinformaticsData Science > Artificial Intelligence/Machine Learning
Collapse
Affiliation(s)
| | | | - Alexandre Varnek
- ICReDD Hokkaido University Sapporo Japan
- Laboratory of Chemoinformatics University of Strasbourg Strasbourg France
| | - Pavel Polishchuk
- Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry Palacky University Olomouc Olomouc Czech Republic
| |
Collapse
|
3
|
Zhou J, Wang X, Wei Z, Meng J, Huang D. 4acCPred: Weakly supervised prediction of N4-acetyldeoxycytosine DNA modification from sequences. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 30:337-345. [DOI: 10.1016/j.omtn.2022.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022]
|
4
|
Multiple instance neural networks based on sparse attention for cancer detection using T-cell receptor sequences. BMC Bioinformatics 2022; 23:469. [PMID: 36348271 PMCID: PMC9644450 DOI: 10.1186/s12859-022-05012-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 10/26/2022] [Indexed: 11/11/2022] Open
Abstract
Early detection of cancers has been much explored due to its paramount importance in biomedical fields. Among different types of data used to answer this biological question, studies based on T cell receptors (TCRs) are under recent spotlight due to the growing appreciation of the roles of the host immunity system in tumor biology. However, the one-to-many correspondence between a patient and multiple TCR sequences hinders researchers from simply adopting classical statistical/machine learning methods. There were recent attempts to model this type of data in the context of multiple instance learning (MIL). Despite the novel application of MIL to cancer detection using TCR sequences and the demonstrated adequate performance in several tumor types, there is still room for improvement, especially for certain cancer types. Furthermore, explainable neural network models are not fully investigated for this application. In this article, we propose multiple instance neural networks based on sparse attention (MINN-SA) to enhance the performance in cancer detection and explainability. The sparse attention structure drops out uninformative instances in each bag, achieving both interpretability and better predictive performance in combination with the skip connection. Our experiments show that MINN-SA yields the highest area under the ROC curve scores on average measured across 10 different types of cancers, compared to existing MIL approaches. Moreover, we observe from the estimated attentions that MINN-SA can identify the TCRs that are specific for tumor antigens in the same T cell repertoire.
Collapse
|
5
|
Pan-cancer identification of the relationship of metabolism-related differentially expressed transcription regulation with non-differentially expressed target genes via a gated recurrent unit network. Comput Biol Med 2022; 148:105883. [PMID: 35878490 DOI: 10.1016/j.compbiomed.2022.105883] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/10/2022] [Accepted: 07/16/2022] [Indexed: 11/20/2022]
Abstract
The transcriptome describes the expression of all genes in a sample. Most studies have investigated the differential patterns or discrimination powers of transcript expression levels. In this study, we hypothesized that the quantitative correlations between the expression levels of transcription factors (TFs) and their regulated target genes (mRNAs) serve as a novel view of healthy status, and a disease sample exhibits a differential landscape (mqTrans) of transcription regulations compared with healthy status. We formulated quantitative transcription regulation relationships of metabolism-related genes as a multi-input multi-output regression model via a gated recurrent unit (GRU) network. The GRU model was trained using healthy blood transcriptomes and the expression levels of mRNAs were predicted by those of the TFs. The mqTrans feature of a gene was defined as the difference between its predicted and actual expression levels. A pan-cancer investigation of the differentially expressed mqTrans features was conducted between the early- and late-stage cancers in 26 cancer types of The Cancer Genome Atlas database. This study focused on the differentially expressed mqTrans features, that did not show differential expression in the actual expression levels. These genes could not be detected by conventional differential analysis. Such dark biomarkers are worthy of further wet-lab investigation. The experimental data also showed that the proposed mqTrans investigation improved the classification between early- and late-stage samples for some cancer types. Thus, the mqTrans features serve as a complementary view to transcriptomes, an OMIC type with mature high-throughput production technologies, and abundant public resources.
Collapse
|
6
|
Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein–DNA-binding sites in computational biology. Brief Funct Genomics 2022; 21:357-375. [DOI: 10.1093/bfgp/elac009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/07/2022] [Accepted: 04/22/2022] [Indexed: 01/08/2023] Open
Abstract
Abstract
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein–DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein–DNA-binding sites. In recent years, methods based on deep learning to predict protein–DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein–DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN–RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein–DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein–DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein–DNA-binding site prediction methods will help researchers better understand this field.
Collapse
|
7
|
Huang D, Song B, Wei J, Su J, Coenen F, Meng J. Weakly supervised learning of RNA modifications from low-resolution epitranscriptome data. Bioinformatics 2021; 37:i222-i230. [PMID: 34252943 PMCID: PMC8336446 DOI: 10.1093/bioinformatics/btab278] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation Increasing evidence suggests that post-transcriptional ribonucleic acid (RNA) modifications regulate essential biomolecular functions and are related to the pathogenesis of various diseases. Precise identification of RNA modification sites is essential for understanding the regulatory mechanisms of RNAs. To date, many computational approaches for predicting RNA modifications have been developed, most of which were based on strong supervision enabled by base-resolution epitranscriptome data. However, high-resolution data may not be available. Results We propose WeakRM, the first weakly supervised learning framework for predicting RNA modifications from low-resolution epitranscriptome datasets, such as those generated from acRIP-seq and hMeRIP-seq. Evaluations on three independent datasets (corresponding to three different RNA modification types and their respective sequencing technologies) demonstrated the effectiveness of our approach in predicting RNA modifications from low-resolution data. WeakRM outperformed state-of-the-art multi-instance learning methods for genomic sequences, such as WSCNN, which was originally designed for transcription factor binding site prediction. Additionally, our approach captured motifs that are consistent with existing knowledge, and visualization of the predicted modification-containing regions unveiled the potentials of detecting RNA modifications with improved resolution. Availability implementation The source code for the WeakRM algorithm, along with the datasets used, are freely accessible at: https://github.com/daiyun02211/WeakRM Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daiyun Huang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China.,Department of Computer Science, University of Liverpool, Liverpool L69 7ZB, UK
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jingjue Wei
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Jionglong Su
- School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China.,AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Frans Coenen
- Department of Computer Science, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK.,AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| |
Collapse
|
8
|
Prasasty VD, Hutagalung RA, Gunadi R, Sofia DY, Rosmalena R, Yazid F, Sinaga E. Prediction of human-Streptococcus pneumoniae protein-protein interactions using logistic regression. Comput Biol Chem 2021; 92:107492. [PMID: 33964803 DOI: 10.1016/j.compbiolchem.2021.107492] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 04/21/2021] [Indexed: 02/07/2023]
Abstract
Streptococcus pneumoniae is a major cause of mortality in children under five years old. In recent years, the emergence of antibiotic-resistant strains of S. pneumoniae increases the threat level of this pathogen. For that reason, the exploration of S. pneumoniae protein virulence factors should be considered in developing new drugs or vaccines, for instance by the analysis of host-pathogen protein-protein interactions (HP-PPIs). In this research, prediction of protein-protein interactions was performed with a logistic regression model with the number of protein domain occurrences as features. By utilizing HP-PPIs of three different pathogens as training data, the model achieved 57-77 % precision, 64-75 % recall, and 96-98 % specificity. Prediction of human-S. pneumoniae protein-protein interactions using the model yielded 5823 interactions involving thirty S. pneumoniae proteins and 324 human proteins. Pathway enrichment analysis showed that most of the pathways involved in the predicted interactions are immune system pathways. Network topology analysis revealed β-galactosidase (BgaA) as the most central among the S. pneumoniae proteins in the predicted HP-PPI networks, with a degree centrality of 1.0 and a betweenness centrality of 0.451853. Further experimental studies are required to validate the predicted interactions and examine their roles in S. pneumoniae infection.
Collapse
Affiliation(s)
- Vivitri Dewi Prasasty
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia.
| | - Rory Anthony Hutagalung
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia
| | - Reinhart Gunadi
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Dewi Yustika Sofia
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Rosmalena Rosmalena
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Fatmawaty Yazid
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Ernawati Sinaga
- Faculty of Biology, Universitas Nasional, Jakarta, 12520, Indonesia.
| |
Collapse
|
9
|
Zhang Q, Zhu L, Bao W, Huang DS. Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:679-689. [PMID: 30106688 DOI: 10.1109/tcbb.2018.2864203] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max, Average, Linear Regression, and Top-Bottom Instances. The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.
Collapse
|
10
|
Wang W, Langlois R, Langlois M, Genchev GZ, Wang X, Lu H. Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins. Front Genet 2019; 10:729. [PMID: 31543893 PMCID: PMC6729729 DOI: 10.3389/fgene.2019.00729] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 07/11/2019] [Indexed: 12/27/2022] Open
Abstract
Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein’s function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.
Collapse
Affiliation(s)
- Wenchuan Wang
- SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas
| | - Robert Langlois
- Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States
| | - Marina Langlois
- Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States
| | - Georgi Z Genchev
- SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas.,Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States.,Bulgarian Institute for Genomics and Precision Medicine, Sofia, Bulgaria
| | - Xiaolei Wang
- SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Hui Lu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas.,Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States.,Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China
| |
Collapse
|
11
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
12
|
Hu J, Wang J, Lin J, Liu T, Zhong Y, Liu J, Zheng Y, Gao Y, He J, Shang X. MD-SVM: a novel SVM-based algorithm for the motif discovery of transcription factor binding sites. BMC Bioinformatics 2019; 20:200. [PMID: 31074373 PMCID: PMC6509868 DOI: 10.1186/s12859-019-2735-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Transcription factors (TFs) play important roles in the regulation of gene expression. They can activate or block transcription of downstream genes in a manner of binding to specific genomic sequences. Therefore, motif discovery of these binding preference patterns is of central significance in the understanding of molecular regulation mechanism. Many algorithms have been proposed for the identification of transcription factor binding sites. However, it remains a challengeable problem. RESULTS Here, we proposed a novel motif discovery algorithm based on support vector machine (MD-SVM) to learn a discriminative model for TF binding sites. MD-SVM firstly obtains position weight matrix (PWM) from a set of training datasets. Then it translates the MD problem into a computational framework of multiple instance learning (MIL). It was applied to several real biological datasets. Results show that our algorithm outperforms MI-SVM in terms of both accuracy and specificity. CONCLUSIONS In this paper, we modeled the TF motif discovery problem as a MIL optimization problem. The SVM algorithm was adapted to discriminate positive and negative bags of instances. Compared to other svm-based algorithms, MD-SVM show its superiority over its competitors in term of ROC AUC. Hopefully, it could be of benefit to the research community in the understanding of molecular functions of DNA functional elements and transcription factors.
Collapse
Affiliation(s)
- Jialu Hu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
- Centre of Multidisciplinary Convergence Computing, School of Computer Science, Northwestern Polytechnical University, 1 Dong Xiang Road, Xi’an, 710129 China
| | - Jingru Wang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jianan Lin
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Tianwei Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yuanke Zhong
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jie Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yan Zheng
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yiqun Gao
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Junhao He
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| |
Collapse
|
13
|
Abstract
Motivation The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. Results We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. Availability and Implementation CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lin Zhu
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hong-Bo Zhang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|