1
|
Wei X, Hu T, Wu D, Yang F, Zhao C, Lu Y. ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation. Neural Netw 2024; 180:106684. [PMID: 39243506 DOI: 10.1016/j.neunet.2024.106684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 07/30/2024] [Accepted: 08/30/2024] [Indexed: 09/09/2024]
Abstract
Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenges in capturing global information and preserving semantic continuity. Additionally, these methods often exhibit relatively singular feature distributions, limiting the full potential of contrastive learning in clustering. These problems can have a negative impact on the performance of image clustering. To address the above problems, we propose a deep clustering framework termed Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation (ECCT). The core idea is to introduce Vision Transformer (ViT) to provide the global view, and improve it with Hilbert Patch Embedding (HPE) module to construct a new ViT branch. Finally, we fuse the features extracted from the two ViT branches to obtain both global view and semantic coherence. In addition, we employ multi-view random aggressive augmentation to broaden the feature distribution, enabling the model to learn more comprehensive and richer contrastive features. Our results on five datasets demonstrate that ECCT outperforms previous clustering methods. In particular, the ARI metric of ECCT on the STL-10 (ImageNet-Dogs) dataset is 0.852 (0.424), which is 10.3% (4.8%) higher than the best baseline.
Collapse
Affiliation(s)
- Xing Wei
- School of Computer and Information, Hefei University of Technology, Hefei, 230601, China; Intelligent Manufacturing Technology Research Institute, Hefei University of Technology, Hefei, 230601, China.
| | - Taizhang Hu
- School of Computer and Information, Hefei University of Technology, Hefei, 230601, China.
| | - Di Wu
- School of Computer and Artificial Intelligence, Hefei Normal University, Hefei, 230601, China.
| | - Fan Yang
- School of Computer and Information, Hefei University of Technology, Hefei, 230601, China; Intelligent Manufacturing Technology Research Institute, Hefei University of Technology, Hefei, 230601, China.
| | - Chong Zhao
- School of Computer and Information, Hefei University of Technology, Hefei, 230601, China; Intelligent Manufacturing Technology Research Institute, Hefei University of Technology, Hefei, 230601, China.
| | - Yang Lu
- School of Computer and Information, Hefei University of Technology, Hefei, 230601, China.
| |
Collapse
|
2
|
Zou HT, Ji BY, Xie XL. A multi-source molecular network representation model for protein-protein interactions prediction. Sci Rep 2024; 14:6184. [PMID: 38485942 PMCID: PMC10940665 DOI: 10.1038/s41598-024-56286-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 03/05/2024] [Indexed: 03/18/2024] Open
Abstract
The prediction of potential protein-protein interactions (PPIs) is a critical step in decoding diseases and understanding cellular mechanisms. Traditional biological experiments have identified plenty of potential PPIs in recent years, but this problem is still far from being solved. Hence, there is urgent to develop computational models with good performance and high efficiency to predict potential PPIs. In this study, we propose a multi-source molecular network representation learning model (called MultiPPIs) to predict potential protein-protein interactions. Specifically, we first extract the protein sequence features according to the physicochemical properties of amino acids by utilizing the auto covariance method. Second, a multi-source association network is constructed by integrating the known associations among miRNAs, proteins, lncRNAs, drugs, and diseases. The graph representation learning method, DeepWalk, is adopted to extract the multisource association information of proteins with other biomolecules. In this way, the known protein-protein interaction pairs can be represented as a concatenation of the protein sequence and the multi-source association features of proteins. Finally, the Random Forest classifier and corresponding optimal parameters are used for training and prediction. In the results, MultiPPIs obtains an average 86.03% prediction accuracy with 82.69% sensitivity at the AUC of 93.03% under five-fold cross-validation. The experimental results indicate that MultiPPIs has a good prediction performance and provides valuable insights into the field of potential protein-protein interactions prediction. MultiPPIs is free available at https://github.com/jiboyalab/multiPPIs .
Collapse
Affiliation(s)
- Hai-Tao Zou
- College of Information Science and Engineering, Guilin University of Technology, Guilin, 541000, China
| | - Bo-Ya Ji
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410000, China.
| | - Xiao-Lan Xie
- College of Information Science and Engineering, Guilin University of Technology, Guilin, 541000, China.
| |
Collapse
|
3
|
Cheng X, Yan C, Jiang H, Qiu Y. scHOIS: Determining Cell Heterogeneity Through Hierarchical Clustering Based on Optimal Imputation Strategy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1431-1444. [PMID: 37815942 DOI: 10.1109/tcbb.2022.3203592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2023]
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) technology provide an unbiased and high-throughput analysis of each cell at single-cell resolution, and further facilitate the development of cellular heterogeneity analysis. Despite the promise of scRNA-seq, the data generated by this method are sparse and noisy because of the presence of dropout events, which can greatly impact downstream analyses such as differential gene expression, cell type annotation, and linage trajectory reconstruction. The development of effective and robust computational methods to address both dropout and clustering are thus urgently needed. In this study, we propose a flexible, accurate two-stage algorithm for single cell heterogeneity analysis via hierarchical clustering based on an optimal imputation strategy, called scHOIS. At the first stage, masked non-negative matrix factorization is applied to approximate the original observed scRNA-seq data, with optimal rank determined by variance analysis. At the second stage, hierarchical clustering is applied to group the imputed cells using Pearson correlation to measure similarity, with the optimal number of clusters determined by integrating three classical indexes. We performed extensive experiments on real-world datasets, which showed that scHOIS effectively and robustly distinguished cellular differences and that the clustering performance of this algorithm was superior to that of other state-of-the-art methods.
Collapse
|
4
|
Li X, Fan H, Liu J. Noise-aware clustering based on maximum correntropy criterion and adaptive graph regularization. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
5
|
Huang W, Tan K, Zhang Z, Hu J, Dong S. A Review of Fusion Methods for Omics and Imaging Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:74-93. [PMID: 35044920 DOI: 10.1109/tcbb.2022.3143900] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The development of omics data and biomedical images has greatly advanced the progress of precision medicine in diagnosis, treatment, and prognosis. The fusion of omics and imaging data, i.e., omics-imaging fusion, offers a new strategy for understanding complex diseases. However, due to a variety of issues such as the limited number of samples, high dimensionality of features, and heterogeneity of different data types, efficiently learning complementary or associated discriminative fusion information from omics and imaging data remains a challenge. Recently, numerous machine learning methods have been proposed to alleviate these problems. In this review, from the perspective of fusion levels and fusion methods, we first provide an overview of preprocessing and feature extraction methods for omics and imaging data, and comprehensively analyze and summarize the basic forms and variations of commonly used and newly emerging fusion methods, along with their advantages, disadvantages and the applicable scope. We then describe public datasets and compare experimental results of various fusion methods on the ADNI and TCGA datasets. Finally, we discuss future prospects and highlight remaining challenges in the field.
Collapse
|
6
|
Ma H. Achieving deep clustering through the use of variational autoencoders and similarity-based loss. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:10344-10360. [PMID: 36031997 DOI: 10.3934/mbe.2022484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Clustering is an important and challenging research topic in many fields. Although various clustering algorithms have been developed in the past, traditional shallow clustering algorithms cannot mine the underlying structural information of the data. Recent advances have shown that deep clustering can achieve excellent performance on clustering tasks. In this work, a novel variational autoencoder-based deep clustering algorithm is proposed. It treats the Gaussian mixture model as the prior latent space and uses an additional classifier to distinguish different clusters in the latent space accurately. A similarity-based loss function is proposed consisting specifically of the cross-entropy of the predicted transition probabilities of clusters and the Wasserstein distance of the predicted posterior distributions. The new loss encourages the model to learn meaningful cluster-oriented representations to facilitate clustering tasks. The experimental results show that our method consistently achieves competitive results on various data sets.
Collapse
Affiliation(s)
- He Ma
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150000, China
| |
Collapse
|
7
|
Gao YL, Wu MJ, Liu JX, Zheng CH, Wang J. Robust Principal Component Analysis Based On Hypergraph Regularization for Sample Clustering and Co-Characteristic Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2420-2430. [PMID: 33690124 DOI: 10.1109/tcbb.2021.3065054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Extracting genes involved in cancer lesions from gene expression data is critical for cancer research and drug development. The method of feature selection has attracted much attention in the field of bioinformatics. Principal Component Analysis (PCA) is a widely used method for learning low-dimensional representation. Some variants of PCA have been proposed to improve the robustness and sparsity of the algorithm. However, the existing methods ignore the high-order relationships between data. In this paper, a new model named Robust Principal Component Analysis via Hypergraph Regularization (HRPCA) is proposed. In detail, HRPCA utilizes L2,1-norm to reduce the effect of outliers and make data sufficiently row-sparse. And the hypergraph regularization is introduced to consider the complex relationship among data. Important information hidden in the data are mined, and this method ensures the accuracy of the resulting data relationship information. Extensive experiments on multi-view biological data demonstrate that the feasible and effective of the proposed approach.
Collapse
|
8
|
Abdalrada AS, Abawajy J, Al-Quraishi T, Islam SMS. Machine learning models for prediction of co-occurrence of diabetes and cardiovascular diseases: a retrospective cohort study. J Diabetes Metab Disord 2022; 21:251-261. [PMID: 35673486 PMCID: PMC9167176 DOI: 10.1007/s40200-021-00968-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 12/29/2021] [Indexed: 12/15/2022]
Abstract
Background Diabetic mellitus (DM) and cardiovascular diseases (CVD) cause significant healthcare burden globally and often co-exists. Current approaches often fail to identify many people with co-occurrence of DM and CVD, leading to delay in healthcare seeking, increased complications and morbidity. In this paper, we aimed to develop and evaluate a two-stage machine learning (ML) model to predict the co-occurrence of DM and CVD. Methods We used the diabetes complications screening research initiative (DiScRi) dataset containing >200 variables from >2000 participants. In the first stage, we used two ML models (logistic regression and Evimp functions) implemented in multivariate adaptive regression splines model to infer the significant common risk factors for DM and CVD and applied the correlation matrix to reduce redundancy. In the second stage, we used classification and regression algorithm to develop our model. We evaluated the prediction models using prediction accuracy, sensitivity and specificity as performance metrics. Results Common risk factors for DM and CVD co-occurrence was family history of the diseases, gender, deep breathing heart rate change, lying to standing blood pressure change, HbA1c, HDL and TC\HDL ratio. The predictive model showed that the participants with HbA1c >6.45 and TC\HDL ratio > 5.5 were at risk of developing both diseases (97.9% probability). In contrast, participants with HbA1c >6.45 and TC\HDL ratio ≤ 5.5 were more likely to have only DM (84.5% probability) and those with HbA1c ≤5.45 and HDL >1.45 were likely to be healthy (82.4%. probability). Further, participants with HbA1c ≤5.45 and HDL <1.45 were at risk of only CVD (100% probability). The predictive accuracy of the ML model to detect co-occurrence of DM and CVD is 94.09%, sensitivity 93.5%, and specificity 95.8%. Conclusions Our ML model can significantly predict with high accuracy the co-occurrence of DM and CVD in people attending a screening program. This might help in early detection of patients with DM and CVD who could benefit from preventive treatment and reduce future healthcare burden.
Collapse
Affiliation(s)
- Ahmad Shaker Abdalrada
- Faculty of Computer Science and Information Technology, Wasit University, Al Kut, Iraq
- School of Information Technology, Deakin University, Melbourne, Victoria Australia
| | - Jemal Abawajy
- School of Information Technology, Deakin University, Melbourne, Victoria Australia
| | - Tahsien Al-Quraishi
- Faculty of Computer Science and Information Technology, Wasit University, Al Kut, Iraq
- School of Information Technology, Deakin University, Melbourne, Victoria Australia
| | - Sheikh Mohammed Shariful Islam
- Institute for Physical Activity and Nutrition, Deakin University, 221 Burwood Highway, Burwood, Melbourne, VIC 3125 Australia
| |
Collapse
|
9
|
Guo ZH, Chen ZH, You ZH, Wang YB, Yi HC, Wang MN. A learning-based method to predict LncRNA-disease associations by combining CNN and ELM. BMC Bioinformatics 2022; 22:622. [PMID: 35317723 PMCID: PMC8941737 DOI: 10.1186/s12859-022-04611-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 10/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background lncRNAs play a critical role in numerous biological processes and life activities, especially diseases. Considering that traditional wet experiments for identifying uncovered lncRNA-disease associations is limited in terms of time consumption and labor cost. It is imperative to construct reliable and efficient computational models as addition for practice. Deep learning technologies have been proved to make impressive contributions in many areas, but the feasibility of it in bioinformatics has not been adequately verified. Results In this paper, a machine learning-based model called LDACE was proposed to predict potential lncRNA-disease associations by combining Extreme Learning Machine (ELM) and Convolutional Neural Network (CNN). Specifically, the representation vectors are constructed by integrating multiple types of biology information including functional similarity and semantic similarity. Then, CNN is applied to mine both local and global features. Finally, ELM is chosen to carry out the prediction task to detect the potential lncRNA-disease associations. The proposed method achieved remarkable Area Under Receiver Operating Characteristic Curve of 0.9086 in Leave-one-out cross-validation and 0.8994 in fivefold cross-validation, respectively. In addition, 2 kinds of case studies based on lung cancer and endometrial cancer indicate the robustness and efficiency of LDACE even in a real environment. Conclusions Substantial results demonstrated that the proposed model is expected to be an auxiliary tool to guide and assist biomedical research, and the close integration of deep learning and biology big data will provide life sciences with novel insights.
Collapse
Affiliation(s)
- Zhen-Hao Guo
- School of Electronics and Information Engineering, Tongji University, No. 4800 Cao'an Road, Shanghai, 201804, China
| | - Zhan-Heng Chen
- College of Computer Science and Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Yan-Bin Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277100, Shandong, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mei-Neng Wang
- School of Mathematics and Computer Science, Yichun University, Yichun, 336000, Jiangxi, China
| |
Collapse
|
10
|
Gao Z, Wang YT, Wu QW, Li L, Ni JC, Zheng CH. A New Method Based on Matrix Completion and Non-Negative Matrix Factorization for Predicting Disease-Associated miRNAs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:763-772. [PMID: 32991287 DOI: 10.1109/tcbb.2020.3027444] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Numerous studies have shown that microRNAs are associated with the occurrence and development of human diseases. Thus, studying disease-associated miRNAs is significantly valuable to the prevention, diagnosis and treatment of diseases. In this paper, we proposed a novel method based on matrix completion and non-negative matrix factorization (MCNMF)for predicting disease-associated miRNAs. Due to the information inadequacy on miRNA similarities and disease similarities, we calculated the latter via two models, and introduced the Gaussian interaction profile kernel similarity. In addition, the matrix completion (MC)was employed to further replenish the miRNA and disease similarities to improve the prediction performance. And to reduce the sparsity of miRNA-disease association matrix, the method of weighted K nearest neighbor (WKNKN)was used, which is a pre-processing step. We also utilized non-negative matrix factorization (NMF)using dual L2,1-norm, graph Laplacian regularization, and Tikhonov regularization to effectively avoid the overfitting during the prediction. Finally, several experiments and a case study were implemented to evaluate the effectiveness and performance of the proposed MCNMF model. The results indicated that our method could reliably and effectively predict disease-associated miRNAs.
Collapse
|
11
|
Shen Z, Zhang Q, Han K, Huang DS. A Deep Learning Model for RNA-Protein Binding Preference Prediction Based on Hierarchical LSTM and Attention Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:753-762. [PMID: 32750884 DOI: 10.1109/tcbb.2020.3007544] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Attention mechanism has the ability to find important information in the sequence. The regions of the RNA sequence that can bind to proteins are more important than those that cannot bind to proteins. Neither conventional methods nor deep learning-based methods, they are not good at learning this information. In this study, LSTM is used to extract the correlation features between different sites in RNA sequence. We also use attention mechanism to evaluate the importance of different sites in RNA sequence. We get the optimal combination of k-mer length, k-mer stride window, k-mer sentence length, k-mer sentence stride window, and optimization function through hyper-parm experiments. The results show that the performance of our method is better than other methods. We tested the effects of changes in k-mer vector length on model performance. We show model performance changes under various k-mer related parameter settings. Furthermore, we investigate the effect of attention mechanism and RNA structure data on model performance.
Collapse
|
12
|
Wang J, Lu CH, Kong XZ, Dai LY, Yuan S, Zhang X. Multi-view manifold regularized compact low-rank representation for cancer samples clustering on multi-omics data. BMC Bioinformatics 2022; 22:334. [PMID: 35057729 PMCID: PMC8772048 DOI: 10.1186/s12859-021-04220-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 05/27/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of cancer types is of great significance for early diagnosis and clinical treatment of cancer. Clustering cancer samples is an important means to identify cancer types, which has been paid much attention in the field of bioinformatics. The purpose of cancer clustering is to find expression patterns of different cancer types, so that the samples with similar expression patterns can be gathered into the same type. In order to improve the accuracy and reliability of cancer clustering, many clustering methods begin to focus on the integration analysis of cancer multi-omics data. Obviously, the methods based on multi-omics data have more advantages than those using single omics data. However, the high heterogeneity and noise of cancer multi-omics data pose a great challenge to the multi-omics analysis method. RESULTS In this study, in order to extract more complementary information from cancer multi-omics data for cancer clustering, we propose a low-rank subspace clustering method called multi-view manifold regularized compact low-rank representation (MmCLRR). In MmCLRR, each omics data are regarded as a view, and it learns a consistent subspace representation by imposing a consistence constraint on the low-rank affinity matrix of each view to balance the agreement between different views. Moreover, the manifold regularization and concept factorization are introduced into our method. Relying on the concept factorization, the dictionary can be updated in the learning, which greatly improves the subspace learning ability of low-rank representation. We adopt linearized alternating direction method with adaptive penalty to solve the optimization problem of MmCLRR method. CONCLUSIONS Finally, we apply MmCLRR into the clustering of cancer samples based on multi-omics data, and the clustering results show that our method outperforms the existing multi-view methods.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Cong-Hai Lu
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiang-Zhen Kong
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiaofeng Zhang
- School of Information and Electrical Engineering, Ludong University, Yantai, 264025 China
| |
Collapse
|
13
|
Oujja A, Abid MR, Boumhidi J, Bourhnane S, Mourhir A, Merchant F, Benhaddou D. High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach. Genomics Inform 2022; 19:e49. [PMID: 35012291 PMCID: PMC8752974 DOI: 10.5808/gi.21056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 12/08/2021] [Indexed: 11/20/2022] Open
Abstract
Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.
Collapse
Affiliation(s)
- Anas Oujja
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.,Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
| | - Mohamed Riduan Abid
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco
| | - Jaouad Boumhidi
- Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
| | - Safae Bourhnane
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.,Faculty of Sciences, Chouaib Doukkali University, El Jadida 24000, Morocco
| | - Asmaa Mourhir
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco
| | - Fatima Merchant
- Computer Engineering Technology Faculty, University of Houston, Houston, TX 77204, USA
| | - Driss Benhaddou
- Computer Engineering Technology Faculty, University of Houston, Houston, TX 77204, USA
| |
Collapse
|
14
|
Liu J, Cheng Y, Wang X, Ge S. One-Step Robust Low-Rank Subspace Segmentation for Tumor Sample Clustering. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9990297. [PMID: 34925501 PMCID: PMC8674076 DOI: 10.1155/2021/9990297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 11/13/2021] [Accepted: 11/19/2021] [Indexed: 11/18/2022]
Abstract
Clustering of tumor samples can help identify cancer types and discover new cancer subtypes, which is essential for effective cancer treatment. Although many traditional clustering methods have been proposed for tumor sample clustering, advanced algorithms with better performance are still needed. Low-rank subspace clustering is a popular algorithm in recent years. In this paper, we propose a novel one-step robust low-rank subspace segmentation method (ORLRS) for clustering the tumor sample. For a gene expression data set, we seek its lowest rank representation matrix and the noise matrix. By imposing the discrete constraint on the low-rank matrix, without performing spectral clustering, ORLRS learns the cluster indicators of subspaces directly, i.e., performing the clustering task in one step. To improve the robustness of the method, capped norm is adopted to remove the extreme data outliers in the noise matrix. Furthermore, we conduct an efficient solution to solve the problem of ORLRS. Experiments on several tumor gene expression data demonstrate the effectiveness of ORLRS.
Collapse
Affiliation(s)
- Jian Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yuhu Cheng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xuesong Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Shuguang Ge
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
15
|
Tayara H, Chong KT. Improved Predicting of The Sequence Specificities of RNA Binding Proteins by Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2526-2534. [PMID: 32191896 DOI: 10.1109/tcbb.2020.2981335] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) have a significant role in various regulatory tasks. However, the mechanism by which RBPs identify the subsequence target RNAs is still not clear. In recent years, several machine and deep learning-based computational models have been proposed for understanding the binding preferences of RBPs. These methods required integrating multiple features with raw RNA sequences such as secondary structure and their performances can be further improved. In this paper, we propose an efficient and simple convolution neural network, RBPCNN, that relies on the combination of the raw RNA sequence and evolutionary information. We show that conservation scores (evolutionary information) for the RNA sequences can significantly improve the overall performance of the proposed predictor. In addition, the automatic extraction of the binding sequence motifs can enhance our understanding of the binding specificities of RBPs. The experimental results show that RBPCNN outperforms significantly the current state-of-the-art methods. More specifically, the average area under the receiver operator curve was improved by 2.67 percent and the mean average precision was improved by 8.03 percent. The datasets and results can be downloaded from https://home.jbnu.ac.kr/NSCL/RBPCNN.htm.
Collapse
|
16
|
Luo J, Shen C, Lai Z, Cai J, Ding P. Incorporating Clinical, Chemical and Biological Information for Predicting Small Molecule-microRNA Associations Based on Non-Negative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2535-2545. [PMID: 32092012 DOI: 10.1109/tcbb.2020.2975780] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Small molecule(SM) drugs can affect the expression of miRNAs, which plays crucial roles in many important biological processes. The chemical structure and clinical information of small molecule can simultaneously incorporate information such as anatomical distribution, therapeutic effects and structural characteristics. It is necessary to develop a novel model that incorporates small molecule chemical structure and clinical information to reveal the unknown small molecule-miRNA associations. In this study, we developed a new framework based on non-negative matrix factorization, called SMANMF, to discover the potential small molecules-miRNAs associations. First, the functional similarity of two miRNAs can be obtained by computing the overlap of the target gene sets in which the miRNAs interact together, and we integrated two types of small molecule similarities, including chemical similarity and clinical similarity. Then, we utilized a non-negative matrix factorization model to discover the unknown relationship between small molecules and miRNAs. The evaluation results indicate that our model can achieve superior prediction performance compared with previous approaches in 5-fold cross-validation. At the same time, the results of case studies also reveal that the SMANMF model has good predictive performance for predicting the potential association between small molecules and miRNAs.
Collapse
|
17
|
Yang B, Bao W, Zhang W, Wang H, Song C, Chen Y, Jiang X. Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model. BMC Bioinformatics 2021; 22:448. [PMID: 34544363 PMCID: PMC8451084 DOI: 10.1186/s12859-021-04367-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 09/09/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The growing researches of molecular biology reveal that complex life phenomena have the ability to demonstrating various types of interactions in the level of genomics. To establish the interactions between genes or proteins and understand the intrinsic mechanisms of biological systems have become an urgent need and study hotspot. RESULTS In order to forecast gene expression data and identify more accurate gene regulatory network, complex-valued version of ordinary differential equation (CVODE) is proposed in this paper. In order to optimize CVODE model, a complex-valued hybrid evolutionary method based on Grammar-guided genetic programming and complex-valued firefly algorithm is presented. CONCLUSIONS When tested on three real gene expression datasets from E. coli and Human Cell, the experiment results suggest that CVODE model could improve 20-50% prediction accuracy of gene expression data, which could also infer more true-positive regulatory relationships and less false-positive regulations than ordinary differential equation.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, 221018, China.
| | - Wei Zhang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Haifeng Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Chuandong Song
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
| | - Xiuying Jiang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| |
Collapse
|
18
|
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multi-Scale Capsule Network for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1793-1800. [PMID: 32960766 DOI: 10.1109/tcbb.2020.3025579] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analysis of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) have been introduced to motif discovery task and have achieved state-of-art performance. However, due to the limitations of CNN, motif discovery methods based on CNN do not take full advantage of large-scale sequencing data generated by high-throughput sequencing technology. Hence, in this paper we propose multi-scale capsule network architecture (MSC) integrating multi-scale CNN, a variant of CNN able to extract motif features of different lengths, and capsule network, a novel type of artificial neural network architecture aimed at improving CNN. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Collapse
|
19
|
Liu Y, Wang SL, Zhang JF, Zhang W, Zhou S, Li W. DMFMDA: Prediction of Microbe-Disease Associations Based on Deep Matrix Factorization Using Bayesian Personalized Ranking. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1763-1772. [PMID: 32816678 DOI: 10.1109/tcbb.2020.3018138] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying the microbe-disease associations is conducive to understanding the pathogenesis of disease from the perspective of microbe. In this paper, we propose a deep matrix factorization prediction model (DMFMDA) based on deep neural network. First, the disease one-hot encoding is fed into neural network, which is transformed into a low-dimensional dense vector in implicit semantic space via embedding layer, and so is microbe. Then, matrix factorization is realized by neural network with embedding layer. Furthermore, our model synthesizes the non-linear modeling advantages of multi-layer perceptron based on the linear modeling advantages of matrix factorization. Finally, different from other methods using square error loss function, Bayesian Personalized Ranking optimizes the model from a ranking perspective to obtain the optimal model parameters, which makes full use of the unobserved data. Experiments show that DMFMDA reaches average AUCs of 0.9091 and 0.9103 in the framework of 5-fold cross validation and Leave-one-out cross validation, which is superior to three the-state-of-art methods. In case studies, 10, 9 and 9 out of top-10 candidate microbes are verified by recently published literature for asthma, inflammatory bowel disease and colon cancer, respectively. In conclusion, DMFMDA is successful application of deep learning in the prediction of microbe-disease association.
Collapse
|
20
|
Song J, Peng W, Wang F. Identifying cancer patient subgroups by finding co-modules from the driver mutation profiles and downstream gene expression profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2863-2872. [PMID: 34415837 DOI: 10.1109/tcbb.2021.3106344] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identifying cancer subtypes shed new light on effective personalized cancer medicine, future therapeutic strategies and minimizing treatment-related costs. Recently, there are many clustering methods have been proposed in categorizing cancer patients. However, these methods still fail to fully use the prior known biological information in the model designing process to improve precision and efficiency. It is acknowledged that the driver gene always regulates its downstream genes in the net-work to perform a certain function. By analyzing the known clinic cancer subtype data, we found some special co-pathways between the driver genes and the downstream genes in the cancer patients of the same subgroup. Hence, we proposed a novel model named DDCMNMF(Driver and Downstream gene Co-Module Assisted Multiple Non-negative Matrix Factorization model) that first stratify cancer sub-types by identifying co-modules of driver genes and downstream genes. We applied our model on lung and breast cancer datasets and compared it with the other four state-of-the-art models. The final results show that our model could identify the cancer subtypes with high compactness and separateness and achieve a high degree of consistency with the known cancer subtypes. The survival time analysis further proves the significant clinical characteristic of identified cancer subgroups by our model.
Collapse
|
21
|
Wang S, Zhang Q, Shen Z, He Y, Chen ZH, Li J, Huang DS. Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. MOLECULAR THERAPY-NUCLEIC ACIDS 2021; 24:154-163. [PMID: 33767912 PMCID: PMC7972936 DOI: 10.1016/j.omtn.2021.02.014] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/14/2021] [Indexed: 12/26/2022]
Abstract
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. Recent research has shown that the double helix structure of nucleotides plays an important role in improving the accuracy and interpretability of transcription factor binding sites (TFBSs). Although several computational methods have been designed to take both DNA sequence and DNA shape features into consideration simultaneously, how to design an efficient model is still an intractable topic. In this paper, we proposed a hybrid convolutional recurrent neural network (CNN/RNN) architecture, CRPTS, to predict TFBSs by combining DNA sequence and DNA shape features. The novelty of our proposed method relies on three critical aspects: (1) the application of a shared hybrid CNN and RNN has the ability to efficiently extract features from large-scale genomic sequences obtained by high-throughput technology; (2) the common patterns were found from DNA sequences and their corresponding DNA shape features; (3) our proposed CRPTS can capture local structural information of DNA sequences without completely relying on DNA shape data. A series of comprehensive experiments on 66 in vitro datasets derived from universal protein binding microarrays (uPBMs) shows that our proposed method CRPTS obviously outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Siguo Wang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Qinhu Zhang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China.,Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Siping Road 1239, Shanghai 200092, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, Henan 473004, China
| | - Ying He
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Zhen-Heng Chen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - De-Shuang Huang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| |
Collapse
|
22
|
Sun P, Yang S, Cao Y, Cheng R, Han S. Prediction of Potential Associations Between miRNAs and Diseases Based on Matrix Decomposition. Front Genet 2020; 11:598185. [PMID: 33304393 PMCID: PMC7701300 DOI: 10.3389/fgene.2020.598185] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 10/22/2020] [Indexed: 01/06/2023] Open
Abstract
It is known that miRNA plays an increasingly important role in many physiological processes. Disease-related miRNAs could be potential biomarkers for clinical diagnosis, prognosis, and treatment. Therefore, accurately inferring potential miRNAs related to diseases has become a hot topic in the bioinformatics community recently. In this study, we proposed a mathematical model based on matrix decomposition, named MFMDA, to identify potential miRNA-disease associations by integrating known miRNA and disease-related data, similarities between miRNAs and between diseases. We also compared MFMDA with some of the latest algorithms in several established miRNA disease databases. MFMDA reached an AUC of 0.9061 in the fivefold cross-validation. The experimental results show that MFMDA effectively infers novel miRNA-disease associations. In addition, we conducted case studies by applying MFMDA to three types of high-risk human cancers. While most predicted miRNAs are confirmed by external databases of experimental literature, we also identified a few novel disease-related miRNAs for further experimental validation.
Collapse
Affiliation(s)
- Pengcheng Sun
- Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Shuyan Yang
- Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Ye Cao
- Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Rongjie Cheng
- Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Shiyu Han
- Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
23
|
Hind J, Lisboa P, Hussain AJ, Al-Jumeily D. A Novel Approach to Detecting Epistasis using Random Sampling Regularisation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1535-1545. [PMID: 31634840 DOI: 10.1109/tcbb.2019.2948330] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Epistasis is a progressive approach that complements the 'common disease, common variant' hypothesis that highlights the potential for connected networks of genetic variants collaborating to produce a phenotypic expression. Epistasis is commonly performed as a pairwise or limitless-arity capacity that considers variant networks as either variant vs variant or as high order interactions. This type of analysis extends the number of tests that were previously performed in a standard approach such as Genome-Wide Association Study (GWAS), in which False Discovery Rate (FDR) is already an issue, therefore by multiplying the number of tests up to a factorial rate also increases the issue of FDR. Further to this, epistasis introduces its own limitations of computational complexity and intensity that are generated based on the analysis performed; to consider the most intense approach, a multivariate analysis introduces a time complexity of O(n!). Proposed in this paper is a novel methodology for the detection of epistasis using interpretable methods and best practice to outline interactions through filtering processes. Using a process of Random Sampling Regularisation which randomly splits and produces sample sets to conduct a voting system to regularise the significance and reliability of biological markers, SNPs. Preliminary results are promising, outlining a concise detection of interactions. Results for the detection of epistasis, in the classification of breast cancer patients, indicated eight outlined risk candidate interactions from five variants and a singular candidate variant with high protective association.
Collapse
|
24
|
Lin X, Zhang X, Xu X. Efficient Classification of Hot Spots and Hub Protein Interfaces by Recursive Feature Elimination and Gradient Boosting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1525-1534. [PMID: 31380766 DOI: 10.1109/tcbb.2019.2931717] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Proteins are not isolated biological molecules, which have the specific three-dimensional structures and interact with other proteins to perform functions. A small number of residues (hot spots) in protein-protein interactions (PPIs) play the vital role in bioinformatics to influence and control of biological processes. This paper uses the boosting algorithm and gradient boosting algorithm based on two feature selection strategies to classify hot spots with three common datasets and two hub protein datasets. First, the correlation-based feature selection is used to remove the highly related features for improving accuracy of prediction. Then, the recursive feature elimination based on support vector machine (SVM-RFE) is adopted to select the optimal feature subset to improve the training performance. Finally, boosting and gradient boosting (G-boosting) methods are invoked to generate classification results. Gradient boosting is capable of obtaining an excellent model by reducing the loss function in the gradient direction to avoid overfitting. Five datasets from different protein databases are used to verify our models in the experiments. Experimental results show that our proposed classification models have the competitive performance compared with existing classification methods.
Collapse
|
25
|
Shen Z, Deng SP, Huang DS. RNA-Protein Binding Sites Prediction via Multi Scale Convolutional Gated Recurrent Unit Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1741-1750. [PMID: 30990191 DOI: 10.1109/tcbb.2019.2910513] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
RNA-Protein binding plays important roles in the field of gene expression. With the development of high throughput sequencing, several conventional methods and deep learning-based methods have been proposed to predict the binding preference of RNA-protein binding. These methods can hardly meet the need of consideration of the dependencies between subsequence and the various motif lengths of different translation factors (TFs). To overcome such limitations, we propose a predictive model that utilizes a combination of multi-scale convolutional layers and bidirectional gated recurrent unit (GRU) layer. Multi-scale convolution layer has the ability to capture the motif features of different lengths, and bidirectional GRU layer is able to capture the dependencies among subsequence. Experimental results show that the proposed method performs better than four state-of-the-art methods in this field. In addition, we investigate the effect of model structure on model performance by performing our proposed method with a different convolution layer and a different number of kernel size. We also demonstrate the effectiveness of bidirectional GRU in improving model performance through comparative experiments.
Collapse
|
26
|
Li J, Shi X, You ZH, Yi HC, Chen Z, Lin Q, Fang M. Using Weighted Extreme Learning Machine Combined With Scale-Invariant Feature Transform to Predict Protein-Protein Interactions From Protein Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1546-1554. [PMID: 31940546 DOI: 10.1109/tcbb.2020.2965919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein-Protein Interactions (PPIs) play an irreplaceable role in biological activities of organisms. Although many high-throughput methods are used to identify PPIs from different kinds of organisms, they have some shortcomings, such as high cost and time-consuming. To solve the above problems, computational methods are developed to predict PPIs. Thus, in this paper, we present a method to predict PPIs using protein sequences. First, protein sequences are transformed into Position Weight Matrix (PWM), in which Scale-Invariant Feature Transform (SIFT) algorithm is used to extract features. Then Principal Component Analysis (PCA) is applied to reduce the dimension of features. At last, Weighted Extreme Learning Machine (WELM) classifier is employed to predict PPIs and a series of evaluation results are obtained. In our method, since SIFT and WELM are used to extract features and classify respectively, we called the proposed method SIFT-WELM. When applying the proposed method on three well-known PPIs datasets of Yeast, Human and Helicobacter.pylori, the average accuracies of our method using five-fold cross validation are obtained as high as 94.83, 97.60 and 83.64 percent, respectively. In order to evaluate the proposed approach properly, we compare it with Support Vector Machine (SVM) classifier and other recent-developed methods in different aspects. Moreover, the training time of our method is greatly shortened, which is obviously superior to the previous methods, such as SVM, ACC, PCVMZM and so on.
Collapse
|
27
|
Peng C, Zheng Y, Huang DS. Capsule Network Based Modeling of Multi-omics Data for Discovery of Breast Cancer-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1605-1612. [PMID: 30969931 DOI: 10.1109/tcbb.2019.2909905] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Breast cancer is one of the most common cancers all over the world, which bring about more than 450,000 deaths each year. Although this malignancy has been extensively studied by a large number of researchers, its prognosis is still poor. Since therapeutic advance can be obtained based on gene signatures, there is an urgent need to discover genes related to breast cancer that may help uncover the mechanisms in cancer progression. We propose a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD). In CapsNetMMD, we make use of known breast cancer-related genes to transform the issue of gene identification into the issue of supervised classification. The features of genes are generated through comprehensive integration of multi-omics data, e.g., mRNA expression, z scores for mRNA expression, DNA methylation, and two forms of DNA copy-number alterations (CNAs). By modeling features based on the capsule network, we identify breast cancer-related genes with a significantly better performance than other existing machine learning methods. The predicted genes with prognostic values play potential important roles in breast cancer and may serve as candidates for biologists and medical scientists in the future studies of biomarkers.
Collapse
|
28
|
Shen Z, Deng SP, Huang DS. Capsule Network for Predicting RNA-Protein Binding Preferences Using Hybrid Feature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1483-1492. [PMID: 31562101 DOI: 10.1109/tcbb.2019.2943465] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-Protein binding is involved in many different biological processes. With the progress of technology, more and more data are available for research. Based on these data, many prediction methods have been proposed to predict RNA-Protein binding preference. Some of these methods use only RNA sequence features for prediction, and some methods use multiple features for prediction. But, the performance of these methods is not satisfactory. In this study, we propose an improved capsule network to predict RNA-protein binding preferences, which can use both RNA sequence features and structure features. Experimental results show that our proposed method iCapsule performs better than three baseline methods in this field. We used both RNA sequence features and structure features in the model, so we tested the effect of primary capsule layer changes on model performance. In addition, we also studied the impact of model structure on model performance by performing our proposed method with different number of convolution layers and different kernel sizes.
Collapse
|
29
|
Abdulla M, Khasawneh MT. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med 2020; 108:101941. [DOI: 10.1016/j.artmed.2020.101941] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 06/27/2020] [Accepted: 08/07/2020] [Indexed: 12/27/2022]
|
30
|
Liu Y, Ye D, Li W, Wang H, Gao Y. Robust neighborhood embedding for unsupervised feature selection. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105462] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
31
|
Gao Z, Wang YT, Wu QW, Ni JC, Zheng CH. Graph regularized L 2,1-nonnegative matrix factorization for miRNA-disease association prediction. BMC Bioinformatics 2020; 21:61. [PMID: 32070280 PMCID: PMC7029547 DOI: 10.1186/s12859-020-3409-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 02/11/2020] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND The aberrant expression of microRNAs is closely connected to the occurrence and development of a great deal of human diseases. To study human diseases, numerous effective computational models that are valuable and meaningful have been presented by researchers. RESULTS Here, we present a computational framework based on graph Laplacian regularized L2, 1-nonnegative matrix factorization (GRL2, 1-NMF) for inferring possible human disease-connected miRNAs. First, manually validated disease-connected microRNAs were integrated, and microRNA functional similarity information along with two kinds of disease semantic similarities were calculated. Next, we measured Gaussian interaction profile (GIP) kernel similarities for both diseases and microRNAs. Then, we adopted a preprocessing step, namely, weighted K nearest known neighbours (WKNKN), to decrease the sparsity of the miRNA-disease association matrix network. Finally, the GRL2,1-NMF framework was used to predict links between microRNAs and diseases. CONCLUSIONS The new method (GRL2, 1-NMF) achieved AUC values of 0.9280 and 0.9276 in global leave-one-out cross validation (global LOOCV) and five-fold cross validation (5-CV), respectively, showing that GRL2, 1-NMF can powerfully discover potential disease-related miRNAs, even if there is no known associated disease.
Collapse
Affiliation(s)
- Zhen Gao
- School of Software, Qufu Normal University, Qufu, 273165, China
| | - Yu-Tian Wang
- School of Software, Qufu Normal University, Qufu, 273165, China
| | - Qing-Wen Wu
- School of Software, Qufu Normal University, Qufu, 273165, China
| | - Jian-Cheng Ni
- School of Software, Qufu Normal University, Qufu, 273165, China.
| | - Chun-Hou Zheng
- School of Software, Qufu Normal University, Qufu, 273165, China.
| |
Collapse
|
32
|
Wang J, Lu CH, Liu JX, Dai LY, Kong XZ. Multi-cancer samples clustering via graph regularized low-rank representation method under sparse and symmetric constraints. BMC Bioinformatics 2019; 20:718. [PMID: 31888442 PMCID: PMC6936083 DOI: 10.1186/s12859-019-3231-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Background Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. Results In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. Conclusions A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.
Collapse
Affiliation(s)
- Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Cong-Hai Lu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
33
|
Hu S, Zhang C, Chen P, Gu P, Zhang J, Wang B. Predicting drug-target interactions from drug structure and protein sequence using novel convolutional neural networks. BMC Bioinformatics 2019; 20:689. [PMID: 31874614 PMCID: PMC6929541 DOI: 10.1186/s12859-019-3263-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background Accurate identification of potential interactions between drugs and protein targets is a critical step to accelerate drug discovery. Despite many relative experimental researches have been done in the past decades, detecting drug-target interactions (DTIs) remains to be extremely resource-intensive and time-consuming. Therefore, many computational approaches have been developed for predicting drug-target associations on a large scale. Results In this paper, we proposed an deep learning-based method to predict DTIs only using the information of drug structures and protein sequences. The final results showed that our method can achieve good performance with the accuracies up to 92.0%, 90.0%, 92.0% and 90.7% for the target families of enzymes, ion channels, GPCRs and nuclear receptors of our created dataset, respectively. Another dataset derived from DrugBank was used to further assess the generalization of the model, which yielded an accuracy of 0.9015 and an AUC value of 0.9557. Conclusion It was elucidated that our model shows improved performance in comparison with other state-of-the-art computational methods on the common benchmark datasets. Experimental results demonstrated that our model successfully extracted more nuanced yet useful features, and therefore can be used as a practical tool to discover new drugs. Availability http://deeplearner.ahu.edu.cn/web/CnnDTI.htm.
Collapse
Affiliation(s)
- ShanShan Hu
- School of Computer Science and Technology, Anhui University, Jiulong Road, Hefei, 230601, China
| | - Chenglin Zhang
- Institutes of Physical Science and Information Technology, Anhui University, Jiulong Road, Hefei, 230601, China
| | - Peng Chen
- School of Computer Science and Technology, Anhui University, Jiulong Road, Hefei, 230601, China. .,Institutes of Physical Science and Information Technology, Anhui University, Jiulong Road, Hefei, 230601, China. .,Cadre's Ward (South District), The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230001, China.
| | - Pengying Gu
- Cadre's Ward (South District), The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230001, China.
| | - Jun Zhang
- School of Electrical and Information Engineering, Anhui University, Hefei, 230601, China
| | - Bing Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan, 243032, China
| |
Collapse
|
34
|
Abstract
Background Cost-sensitive algorithm is an effective strategy to solve imbalanced classification problem. However, the misclassification costs are usually determined empirically based on user expertise, which leads to unstable performance of cost-sensitive classification. Therefore, an efficient and accurate method is needed to calculate the optimal cost weights. Results In this paper, two approaches are proposed to search for the optimal cost weights, targeting at the highest weighted classification accuracy (WCA). One is the optimal cost weights grid searching and the other is the function fitting. Comparisons are made between these between the two algorithms above. In experiments, we classify imbalanced gene expression data using extreme learning machine to test the cost weights obtained by the two approaches. Conclusions Comprehensive experimental results show that the function fitting method is generally more efficient, which can well find the optimal cost weights with acceptable WCA.
Collapse
|
35
|
Cui Z, Liu JX, Gao YL, Zheng CH, Wang J. RCMF: a robust collaborative matrix factorization method to predict miRNA-disease associations. BMC Bioinformatics 2019; 20:686. [PMID: 31874608 PMCID: PMC6929455 DOI: 10.1186/s12859-019-3260-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Predicting miRNA-disease associations (MDAs) is time-consuming and expensive. It is imminent to improve the accuracy of prediction results. So it is crucial to develop a novel computing technology to predict new MDAs. Although some existing methods can effectively predict novel MDAs, there are still some shortcomings. Especially when the disease matrix is processed, its sparsity is an important factor affecting the final results. Results A robust collaborative matrix factorization (RCMF) is proposed to predict novel MDAs. The L2,1-norm are introduced to our method to achieve the highest AUC value than other advanced methods. Conclusions 5-fold cross validation is used to evaluate our method, and simulation experiments are used to predict novel associations on Gold Standard Dataset. Finally, our prediction accuracy is better than other existing advanced methods. Therefore, our approach is effective and feasible in predicting novel MDAs.
Collapse
Affiliation(s)
- Zhen Cui
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China. .,Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, 230601, China.
| | - Ying-Lian Gao
- Qufu Normal University Library, Qufu Normal University, Rizhao, 276826, China
| | - Chun-Hou Zheng
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, 230601, China
| | - Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China.
| |
Collapse
|
36
|
Inferring the Disease-Associated miRNAs Based on Network Representation Learning and Convolutional Neural Networks. Int J Mol Sci 2019; 20:ijms20153648. [PMID: 31349729 PMCID: PMC6696449 DOI: 10.3390/ijms20153648] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Revised: 07/17/2019] [Accepted: 07/18/2019] [Indexed: 02/06/2023] Open
Abstract
Identification of disease-associated miRNAs (disease miRNAs) are critical for understanding etiology and pathogenesis. Most previous methods focus on integrating similarities and associating information contained in heterogeneous miRNA-disease networks. However, these methods establish only shallow prediction models that fail to capture complex relationships among miRNA similarities, disease similarities, and miRNA-disease associations. We propose a prediction method on the basis of network representation learning and convolutional neural networks to predict disease miRNAs, called CNNMDA. CNNMDA deeply integrates the similarity information of miRNAs and diseases, miRNA-disease associations, and representations of miRNAs and diseases in low-dimensional feature space. The new framework based on deep learning was built to learn the original and global representation of a miRNA-disease pair. First, diverse biological premises about miRNAs and diseases were combined to construct the embedding layer in the left part of the framework, from a biological perspective. Second, the various connection edges in the miRNA-disease network, such as similarity and association connections, were dependent on each other. Therefore, it was necessary to learn the low-dimensional representations of the miRNA and disease nodes based on the entire network. The right part of the framework learnt the low-dimensional representation of each miRNA and disease node based on non-negative matrix factorization, and these representations were used to establish the corresponding embedding layer. Finally, the left and right embedding layers went through convolutional modules to deeply learn the complex and non-linear relationships among the similarities and associations between miRNAs and diseases. Experimental results based on cross validation indicated that CNNMDA yields superior performance compared to several state-of-the-art methods. Furthermore, case studies on lung, breast, and pancreatic neoplasms demonstrated the powerful ability of CNNMDA to discover potential disease miRNAs.
Collapse
|
37
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
38
|
Gao YL, Cui Z, Liu JX, Wang J, Zheng CH. NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations. BMC Bioinformatics 2019; 20:353. [PMID: 31234797 PMCID: PMC6591872 DOI: 10.1186/s12859-019-2956-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Accepted: 06/17/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Predicting meaningful miRNA-disease associations (MDAs) is costly. Therefore, an increasing number of researchers are beginning to focus on methods to predict potential MDAs. Thus, prediction methods with improved accuracy are under development. An efficient computational method is proposed to be crucial for predicting novel MDAs. For improved experimental productivity, large biological datasets are used by researchers. Although there are many effective and feasible methods to predict potential MDAs, the possibility remains that these methods are flawed. RESULTS A simple and effective method, known as Nearest Profile-based Collaborative Matrix Factorization (NPCMF), is proposed to identify novel MDAs. The nearest profile is introduced to our method to achieve the highest AUC value compared with other advanced methods. For some miRNAs and diseases without any association, we use the nearest neighbour information to complete the prediction. CONCLUSIONS To evaluate the performance of our method, five-fold cross-validation is used to calculate the AUC value. At the same time, three disease cases, gastric neoplasms, rectal neoplasms and colonic neoplasms, are used to predict novel MDAs on a gold-standard dataset. We predict the vast majority of known MDAs and some novel MDAs. Finally, the prediction accuracy of our method is determined to be better than that of other existing methods. Thus, the proposed prediction model can obtain reliable experimental results.
Collapse
Affiliation(s)
- Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Zhen Cui
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China. .,Co-Innovation Center for Information Supply and Assurance Technology, Anhui University, Hefei, China.
| | - Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- Co-Innovation Center for Information Supply and Assurance Technology, Anhui University, Hefei, China
| |
Collapse
|
39
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
40
|
Zhang X, Lin X, Zhao J, Huang Q, Xu X. Efficiently Predicting Hot Spots in PPIs by Combining Random Forest and Synthetic Minority Over-Sampling Technique. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:774-781. [PMID: 33156780 DOI: 10.1109/tcbb.2018.2871674] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Hot spot residues bring into play the vital function in bioinformatics to find new medications such as drug design. However, current datasets are predominately composed of non-hot spots with merely a tiny percentage of hot spots. Conventional hot spots prediction methods may face great challenges towards the problem of imbalance training samples. This paper presents a classification method combining with random forest classification and oversampling strategy to improve the training performance. A strategy with an oversampling ability is used to generate hot spots data to balance the given training set. Random forest classification is then invoked to generate a set of forest trees for this oversampled training set. The final prediction performance can be computed recursively after the oversampling and training process. This proposed method is capable of randomly selecting features and constructing a robust random forest to avoid overfitting the training set. Experimental results from three data sets indicate that the performance of hot spots prediction has been significantly improved compared with existing classification methods.
Collapse
|
41
|
Yuan L, Huang DS. A Network-guided Association Mapping Approach from DNA Methylation to Disease. Sci Rep 2019; 9:5601. [PMID: 30944378 PMCID: PMC6447594 DOI: 10.1038/s41598-019-42010-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 03/12/2019] [Indexed: 01/11/2023] Open
Abstract
Aberrant DNA methylation may contribute to development of cancer. However, understanding the associations between DNA methylation and cancer remains a challenge because of the complex mechanisms involved in the associations and insufficient sample sizes. The unprecedented wealth of DNA methylation, gene expression and disease status data give us a new opportunity to design machine learning methods to investigate the underlying associated mechanisms. In this paper, we propose a network-guided association mapping approach from DNA methylation to disease (NAMDD). Compared with existing methods, NAMDD finds methylation-disease path associations by integrating analysis of multiple data combined with a stability selection strategy, thereby mining more information in the datasets and improving the quality of resultant methylation sites. The experimental results on both synthetic and real ovarian cancer data show that NAMDD substantially outperforms former disease-related methylation site research methods (including NsRRR and PCLOGIT) under false positive control. Furthermore, we applied NAMDD to ovarian cancer data, identified significant path associations and provided hypothetical biological path associations to explain our findings.
Collapse
Affiliation(s)
- Lin Yuan
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
42
|
Xu W, Zhu L, Huang DS. DCDE: An Efficient Deep Convolutional Divergence Encoding Method for Human Promoter Recognition. IEEE Trans Nanobioscience 2019; 18:136-145. [PMID: 30624223 DOI: 10.1109/tnb.2019.2891239] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Efficient human promoter feature extraction is still a major challenge in genome analysis as it can better understand human gene regulation and will be useful for experimental guidance. Although many machine learning algorithms have been developed for eukaryotic gene recognition, performance on promoters is unsatisfactory due to the diverse nature. To extract discriminative features from human promoters, an efficient deep convolutional divergence encoding method (DCDE) is proposed based on statistical divergence (SD) and convolutional neural network (CNN). SD can help optimize kmer feature extraction for human promoters. CNN can also be used to automatically extract features in gene analysis. In DCDE, we first perform informative kmers settlement to encode original gene sequences. A series of SD methods can optimize the most discriminative kmers distributions while maintaining important positional information. Then, CNN is utilized to extract lower dimensional deep features by secondary encoding. Finally, we construct a hybrid recognition architecture with multiple support vector machines and a bilayer decision method. It is flexible to add new features or new models and can be extended to identify other genomic functional elements. The extensive experiments demonstrate that DCDE is effective in promoter encoding and can significantly improve the performance of promoter recognition.
Collapse
|
43
|
Identification of cancer subtypes by integrating multiple types of transcriptomics data with deep learning in breast cancer. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.03.072] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
44
|
Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.097] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
45
|
Wang J, Liu JX, Zheng CH, Wang YX, Kong XZ, Wen CG. A Mixed-Norm Laplacian Regularized Low-Rank Representation Method for Tumor Samples Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:172-182. [PMID: 29990217 DOI: 10.1109/tcbb.2017.2769647] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Tumor samples clustering based on biomolecular data is a hot issue of cancer classifications discovery. How to extract the valuable information from high dimensional genomic data is becoming an urgent problem in tumor samples clustering. In this paper, we introduce manifold regularization into low-rank representation model and present a novel method named Mixed-norm Laplacian regularized Low-Rank Representation (MLLRR) to identify the differentially expressed genes for tumor clustering based on gene expression data. Then, in order to advance the accuracy and stability of tumor clustering, we establish the clustering model based on Penalized Matrix Decomposition (PMD) and propose a novel cluster method named MLLRR-PMD. In this method, the cancer clustering research includes three steps. First, the matrix of gene expression data is decomposed into a low rank representation matrix and a sparse matrix by MLLRR. Second, the differentially expressed genes are identified based on the sparse matrix. Finally, the PMD is applied to cluster the samples based on the differentially expressed genes. The experiment results on simulation data and real genomic data illustrate that MLLRR method enhances the robustness to outliers and achieves remarkable performance in the extraction of differentially expressed genes.
Collapse
|
46
|
Hou MX, Gao YL, Liu JX, Dai LY, Kong XZ, Shang J. Network analysis based on low-rank method for mining information on integrated data of multi-cancers. Comput Biol Chem 2018; 78:468-473. [PMID: 30563751 DOI: 10.1016/j.compbiolchem.2018.11.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 11/30/2018] [Accepted: 11/30/2018] [Indexed: 02/01/2023]
Abstract
The noise problem of cancer sequencing data has been a problem that can't be ignored. Utilizing considerable way to reduce noise of these cancer data is an important issue in the analysis of gene co-expression network. In this paper, we apply a sparse and low-rank method which is Robust Principal Component Analysis (RPCA) to solve the noise problem for integrated data of multi-cancers from The Cancer Genome Atlas (TCGA). And then we build the gene co-expression network based on the integrated data after noise reduction. Finally, we perform nodes and pathways mining on the denoising networks. Experiments in this paper show that after denoising by RPCA, the gene expression data tend to be orderly and neat than before, and the constructed networks contain more pathway enrichment information than unprocessed data. Moreover, learning from the betweenness centrality of the nodes in the network, we find some abnormally expressed genes and pathways proven that are associated with many cancers from the denoised network. The experimental results indicate that our method is reasonable and effective, and we also find some candidate suspicious genes that may be linked to multi-cancers.
Collapse
Affiliation(s)
- Mi-Xiao Hou
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China; Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
47
|
Laplacian regularized low-rank representation for cancer samples clustering. Comput Biol Chem 2018; 78:504-509. [PMID: 30528509 DOI: 10.1016/j.compbiolchem.2018.11.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/07/2018] [Indexed: 12/18/2022]
Abstract
Cancer samples clustering based on biomolecular data has been becoming an important tool for cancer classification. The recognition of cancer types is of great importance for cancer treatment. In this paper, in order to improve the accuracy of cancer recognition, we propose to use Laplacian regularized Low-Rank Representation (LLRR) to cluster the cancer samples based on genomic data. In LLRR method, the high-dimensional genomic data are approximately treated as samples extracted from a combination of several low-rank subspaces. The purpose of LLRR method is to seek the lowest-rank representation matrix based on a dictionary. Because a Laplacian regularization based on manifold is introduced into LLRR, compared to the Low-Rank Representation (LRR) method, besides capturing the global geometric structure, LLRR can capture the intrinsic local structure of high-dimensional observation data well. And what is more, in LLRR, the original data themselves are selected as a dictionary, so the lowest-rank representation is actually a similar expression between the samples. Therefore, corresponding to the low-rank representation matrix, the samples with high similarity are considered to come from the same subspace and are grouped into a class. The experiment results on real genomic data illustrate that LLRR method, compared with LRR and MLLRR, is more robust to noise and has a better ability to learn the inherent subspace structure of data, and achieves remarkable performance in the clustering of cancer samples.
Collapse
|
48
|
Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 2018; 8:15270. [PMID: 30323198 PMCID: PMC6189047 DOI: 10.1038/s41598-018-33321-1] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 09/25/2018] [Indexed: 12/23/2022] Open
Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Collapse
|
49
|
Deng SP, Guo WL. Identifying Key Genes of Liver Cancer by Networking of Multiple Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:792-800. [PMID: 30296239 DOI: 10.1109/tcbb.2018.2874238] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Liver cancer is one of the deadliest cancers in the world. To find effective therapies for this cancer, it is indispensable to identify key genes, which may play critical roles in the incidence of the liver cancer. To identify key genes of the liver cancer with high accuracy, we integrated multiple microarray gene expression data sets to compute common differentially expressed genes, which will result more accurate than those from individual data set. To find the main functions or pathways that these genes are involved in, some enrichment analyses were performed including functional enrichment analysis, pathway enrichment analysis, and disease association study. Based on these genes, a protein-protein interaction network was constructed and analyzed to identify key genes of the liver cancer by combining the local and global influence of nodes in the network. The identified key genes, such as TOP2A, ESR1, and KMO, have been demonstrated to be key biomarkers of the liver cancer in many publications. All the results suggest that our method can effectively identify key genes of the liver cancer. Moreover, our method can be applied to other types of data sets to select key genes of other complex diseases.
Collapse
|
50
|
Wu P, Wang D. Classification of a DNA Microarray for Diagnosing Cancer Using a Complex Network Based Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:801-808. [PMID: 30183642 DOI: 10.1109/tcbb.2018.2868341] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Applications that classify DNA microarray expression data are helpful for diagnosing cancer. Many attempts have been made to analyze these data; however, new methods are needed to obtain better results. In this study, a Complex Network (CN) classifier was exploited to implement the classification task. An algorithm was used to initialize the structure, which allowed input variables to be selected over layered connections and different activation functions for different nodes. Then, a hybrid method integrated the Genetic Programming and the Particle Swarm Optimization algorithms was used to identify an optimal structure with the parameters encoded in the classifier. The single CN classifier and an ensemble of CN classifiers were tested on four bench data sets. To ensure diversity of the ensemble classifiers, we constructed a base classifier using different feature sets, i.e., Pearson's correlation, Spearman's correlation, Euclidean distance, Cosine coefficient and the Fisher-ratio. The experimental results suggest that a single classifier can be used to obtain state-of-the-art results and the ensemble yielded better results.
Collapse
|