1
|
Li J, Zhang Q, Ma S, Fang K, Xu Y. Hierarchical Multi-Label Classification With Gene-Environment Interactions in Disease Modeling. Stat Med 2025; 44:e10330. [PMID: 39865593 DOI: 10.1002/sim.10330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 11/25/2024] [Accepted: 12/12/2024] [Indexed: 01/28/2025]
Abstract
In biomedical studies, gene-environment (G-E) interactions have been demonstrated to have important implications for analyzing disease outcomes beyond the main G and main E effects. Many approaches have been developed for G-E interaction analysis, yielding important findings. However, hierarchical multi-label classification, which provides insightful information on disease outcomes, remains unexplored in G-E analysis literature. Moreover, unlabeled data are commonly observed in practical settings but omitted by many existing methods of hierarchical multi-label classification. In this study, we consider a semi-supervised scenario and develop a novel approach for the two-layer hierarchical response with G-E interactions. A two-step penalized estimation is then proposed using an efficient expectation-maximization (EM) algorithm. Simulation shows that it has superior performance in classification and feature selection. The analysis of The Cancer Genome Atlas (TCGA) data on lung cancer demonstrates the practical utility of the proposed method. Overall, this study can fill the important knowledge gap in G-E interaction analysis by providing a widely applicable framework for hierarchical multi-label classification of complex disease outcomes.
Collapse
Affiliation(s)
- Jingmao Li
- Department of Statistics and Data Science, School of Economics, Xiamen University, Fujian, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Fujian, China
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Kuangnan Fang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Fujian, China
| | - Yaqing Xu
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| |
Collapse
|
2
|
Hazan JM, Amador R, Ali-Nasser T, Lahav T, Shotan SR, Steinberg M, Cohen Z, Aran D, Meiri D, Assaraf YG, Guigó R, Bester AC. Integration of transcription regulation and functional genomic data reveals lncRNA SNHG6's role in hematopoietic differentiation and leukemia. J Biomed Sci 2024; 31:27. [PMID: 38419051 PMCID: PMC10900714 DOI: 10.1186/s12929-024-01015-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 02/22/2024] [Indexed: 03/02/2024] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) are pivotal players in cellular processes, and their unique cell-type specific expression patterns render them attractive biomarkers and therapeutic targets. Yet, the functional roles of most lncRNAs remain enigmatic. To address the need to identify new druggable lncRNAs, we developed a comprehensive approach integrating transcription factor binding data with other genetic features to generate a machine learning model, which we have called INFLAMeR (Identifying Novel Functional LncRNAs with Advanced Machine Learning Resources). METHODS INFLAMeR was trained on high-throughput CRISPR interference (CRISPRi) screens across seven cell lines, and the algorithm was based on 71 genetic features. To validate the predictions, we selected candidate lncRNAs in the human K562 leukemia cell line and determined the impact of their knockdown (KD) on cell proliferation and chemotherapeutic drug response. We further performed transcriptomic analysis for candidate genes. Based on these findings, we assessed the lncRNA small nucleolar RNA host gene 6 (SNHG6) for its role in myeloid differentiation. Finally, we established a mouse K562 leukemia xenograft model to determine whether SNHG6 KD attenuates tumor growth in vivo. RESULTS The INFLAMeR model successfully reconstituted CRISPRi screening data and predicted functional lncRNAs that were previously overlooked. Intensive cell-based and transcriptomic validation of nearly fifty genes in K562 revealed cell type-specific functionality for 85% of the predicted lncRNAs. In this respect, our cell-based and transcriptomic analyses predicted a role for SNHG6 in hematopoiesis and leukemia. Consistent with its predicted role in hematopoietic differentiation, SNHG6 transcription is regulated by hematopoiesis-associated transcription factors. SNHG6 KD reduced the proliferation of leukemia cells and sensitized them to differentiation. Treatment of K562 leukemic cells with hemin and PMA, respectively, demonstrated that SNHG6 inhibits red blood cell differentiation but strongly promotes megakaryocyte differentiation. Using a xenograft mouse model, we demonstrate that SNHG6 KD attenuated tumor growth in vivo. CONCLUSIONS Our approach not only improved the identification and characterization of functional lncRNAs through genomic approaches in a cell type-specific manner, but also identified new lncRNAs with roles in hematopoiesis and leukemia. Such approaches can be readily applied to identify novel targets for precision medicine.
Collapse
Affiliation(s)
- Joshua M Hazan
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Raziel Amador
- Centre for Genomic Regulation (CRG), Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain
- Universitat de Barcelona (UB), Barcelona, Catalonia, Spain
| | - Tahleel Ali-Nasser
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Tamar Lahav
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Stav Roni Shotan
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Miryam Steinberg
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Ziv Cohen
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
- The Taub Faculty of Computer Science, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Dvir Aran
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
- The Taub Faculty of Computer Science, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - David Meiri
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Yehuda G Assaraf
- The Fred Wyszkowski Cancer Research Laboratory, Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Assaf C Bester
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel.
| |
Collapse
|
3
|
Teng Z, Shi L, Yu H, Wu C, Tian Z. Measuring functional similarity of lncRNAs based on variable K-mer profiles of nucleotide sequences. Methods 2023; 212:21-30. [PMID: 36813016 DOI: 10.1016/j.ymeth.2023.02.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 02/10/2023] [Accepted: 02/17/2023] [Indexed: 02/22/2023] Open
Abstract
Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Linyue Shi
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Haihao Yu
- College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin 150040, China
| | - Chengyan Wu
- Baotou Teacher's College, Inner Mongolia University of Science and Technology, Baotou 014030, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou 450001, China.
| |
Collapse
|
4
|
Gao W, Li Y, Hu L. Multilabel Feature Selection With Constrained Latent Structure Shared Term. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1253-1262. [PMID: 34437074 DOI: 10.1109/tnnls.2021.3105142] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
High-dimensional multilabel data have increasingly emerged in many application areas, suffering from two noteworthy issues: instances with high-dimensional features and large-scale labels. Multilabel feature selection methods are widely studied to address the issues. Previous multilabel feature selection methods focus on exploring label correlations to guide the feature selection process, ignoring the impact of latent feature structure on label correlations. In addition, one encouraging property regarding correlations between features and labels is that similar features intend to share similar labels. To this end, a latent structure shared (LSS) term is designed, which shares and preserves both latent feature structure and latent label structure. Furthermore, we employ the graph regularization technique to guarantee the consistency between original feature space and latent feature structure space. Finally, we derive the shared latent feature and label structure feature selection (SSFS) method based on the constrained LSS term, and then, an effective optimization scheme with provable convergence is proposed to solve the SSFS method. Better experimental results on benchmark datasets are achieved in terms of multiple evaluation criteria.
Collapse
|
5
|
Feng S, Li H, Qiao J. Hierarchical multi-label classification based on LSTM network and Bayesian decision theory for LncRNA function prediction. Sci Rep 2022; 12:5819. [PMID: 35388048 PMCID: PMC8986818 DOI: 10.1038/s41598-022-09672-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 03/21/2022] [Indexed: 02/01/2023] Open
Abstract
Growing evidence shows that long noncoding RNAs (lncRNAs) play an important role in cellular biological processes at multiple levels, such as gene imprinting, immune response, and genetic regulation, and are closely related to diseases because of their complex and precise control. However, most functions of lncRNAs remain undiscovered. Current computational methods for exploring lncRNA functions can avoid high-throughput experiments, but they usually focus on the construction of similarity networks and ignore the certain directed acyclic graph (DAG) formed by gene ontology annotations. In this paper, we view the function annotation work as a hierarchical multilabel classification problem and design a method HLSTMBD for classification with DAG-structured labels. With the help of a mathematical model based on Bayesian decision theory, the HLSTMBD algorithm is implemented with the long-short term memory network and a hierarchical constraint method DAGLabel. Compared with other state-of-the-art algorithms, the results on GOA-lncRNA datasets show that the proposed method can efficiently and accurately complete the label prediction work.
Collapse
Affiliation(s)
- Shou Feng
- College of Information and Communication Engineering, Harbin Engineering University, Harbin, 150001, China.,Ministry of Industry and Information Technology, Key Laboratory of Advanced Marine Communication and Information Technology, Harbin, 150001, China
| | - Huiying Li
- Harbin Institute of Technology, School of Electronic and Information Engineering, Harbin, 150001, China
| | - Jiaqing Qiao
- Harbin Institute of Technology, School of Electronic and Information Engineering, Harbin, 150001, China.
| |
Collapse
|
6
|
V SKP, Thahsin A, M M, G G. A Heterogeneous Information Network Model for Long Non-Coding RNA Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:255-266. [PMID: 32750859 DOI: 10.1109/tcbb.2020.3000518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Exciting information on the functional roles played by long non-coding RNA (lncRNA) has drawn substantial research attention these days. With the advent of techniques such as RNA-Seq, thousands of lncRNAs are identified in very short time spans. However, due to the poor annotation rate, only a few of them are functionally characterised. The wet lab experiments to elucidate lncRNA functions are challenging, slow progressing and sometimes prohibitively expensive. This work attempts to solve the crucial problem of developing computational methods to predict lncRNA functions. The model presented here, predicts the functions of lncRNAs by making use of a meta-path based measure, AvgSim on a Heterogeneous Information Network (HIN). The network is constructed from existing protein and function association data of lncRNAs, lncRNA co-expression data and protein protein interaction data. Out of the 2,758 lncRNA considered for the experiment, the proposed method predicts possible functions for 2,695 lncRNAs with an accuracy of 73.68 percent and found to perform better than the other state-of-the-art approaches for an independent test set. A case study of two well-known lncRNAs (HOTAIR and H19) is conducted and the associated functions are identified. The results were validated using experimental evidence from the literature. The script and data used for the implementation of the model is freely available at: http://bdbl.nitc.ac.in/LncFunPred/index.html.
Collapse
|
7
|
Deng L, Li W, Zhang J. LDAH2V: Exploring Meta-Paths Across Multiple Networks for lncRNA-Disease Association Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1572-1581. [PMID: 31725386 DOI: 10.1109/tcbb.2019.2946257] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Accumulating evidence has demonstrated dysfunctions of long non-coding RNAs (lncRNAs) are involved in various complex human diseases. However, even today, the relationships between lncRNAs and diseases remain unknown in most cases. Developing effective computational approaches to identify potential lncRNA-disease associations has become a hot topic. Existing network-based approaches are usually focused on the intrinsic features of lncRNAs and diseases but ignore the heterogeneous information of biological networks. Considering the limitations in previous methods, we propose LDAH2V, an efficient computational framework for predicting potential lncRNA-disease associations. LDAH2V uses the HIN2Vec to calculate the meta-path and feature vector for each lncRNA-disease pair in the heterogeneous information network (HIN), which consists of lncRNA similarity network, disease similarity network, miRNA similarity network, and the associations between them. Then, a Gradient Boosting Tree (GBT) classifier to predict lncRNA-disease associations is built with the feature vectors. The results show that LDAH2V performs significantly better than the four existing state-of-the-art methods and gains an AUC of 0.97 in the 10-fold cross-validation test. Furthermore, case studies of colon cancer and ovarian cancer-related lncRNAs have been confirmed in related databases and medical literature.
Collapse
|
8
|
GAPGOM-an R package for gene annotation prediction using GO Metrics. BMC Res Notes 2021; 14:162. [PMID: 33931103 PMCID: PMC8086094 DOI: 10.1186/s13104-021-05580-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 04/20/2021] [Indexed: 11/10/2022] Open
Abstract
Objective Properties of gene products can be described or annotated with Gene Ontology (GO) terms. But for many genes we have limited information about their products, for example with respect to function. This is particularly true for long non-coding RNAs (lncRNAs), where the function in most cases is unknown. However, it has been shown that annotation as described by GO terms to some extent can be predicted by enrichment analysis on properties of co-expressed genes. Results GAPGOM integrates two relevant algorithms, lncRNA2GOA and TopoICSim, into a user-friendly R package. Here lncRNA2GOA does annotation prediction by co-expression, whereas TopoICSim estimates similarity between GO graphs, which can be used for benchmarking of prediction performance, but also for comparison of GO graphs in general. The package provides an improved implementation of the original tools, with substantial improvements in performance and documentation, unified interfaces, and additional features.
Collapse
|
9
|
Li GQ, Wang YK, Zhou H, Jin LG, Wang CY, Albahde M, Wu Y, Li HY, Zhang WK, Li BH, Ye ZM. Application of Immune Infiltration Signature and Machine Learning Model in the Differential Diagnosis and Prognosis of Bone-Related Malignancies. Front Cell Dev Biol 2021; 9:630355. [PMID: 33937231 PMCID: PMC8082117 DOI: 10.3389/fcell.2021.630355] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 03/15/2021] [Indexed: 01/16/2023] Open
Abstract
Bone-related malignancies, such as osteosarcoma, Ewing's sarcoma, multiple myeloma, and cancer bone metastases have similar histological context, but they are distinct in origin and biological behavior. We hypothesize that a distinct immune infiltrative microenvironment exists in these four most common malignant bone-associated tumors and can be used for tumor diagnosis and patient prognosis. After sample cleaning, data integration, and batch effect removal, we used 22 publicly available datasets to draw out the tumor immune microenvironment using the ssGSEA algorithm. The diagnostic model was developed using the random forest. Further statistical analysis of the immune microenvironment and clinical data of patients with osteosarcoma and Ewing's sarcoma was carried out. The results suggested significant differences in the microenvironment of bone-related tumors, and the diagnostic accuracy of the model was higher than 97%. Also, high infiltration of multiple immune cells in Ewing's sarcoma was suggestive of poor patient prognosis. Meanwhile, increased infiltration of macrophages and B cells suggested a better prognosis for patients with osteosarcoma, and effector memory CD8 T cells and type 2 T helper cells correlated with patients' chemotherapy responsiveness and tumor metastasis. Our study revealed that the random forest diagnostic model based on immune infiltration can accurately perform the differential diagnosis of bone-related malignancies. The immune microenvironment of osteosarcoma and Ewing's sarcoma has an important impact on patient prognosis. Suppressing the highly inflammatory environment of Ewing's sarcoma and promoting macrophage and B cell infiltration may have good potential to be a novel adjuvant treatment option for osteosarcoma and Ewing's sarcoma.
Collapse
Affiliation(s)
- Guo-Qi Li
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
- Key Laboratory of Motor System Disease Research and Precision Therapy of Zhejiang Province, Hangzhou, China
| | - Yi-Kai Wang
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Hao Zhou
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Lin-Guang Jin
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Chun-Yu Wang
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Mugahed Albahde
- Department of Hepatobiliary and Pancreatic Surgery, School of Medicine, The Second Affiliated Hospital, Zhejiang University, Hangzhou, China
| | - Yan Wu
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Heng-Yuan Li
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Wen-Kan Zhang
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
| | - Bing-Hao Li
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
- Key Laboratory of Motor System Disease Research and Precision Therapy of Zhejiang Province, Hangzhou, China
| | - Zhao-Ming Ye
- Department of Orthopedics, Musculoskeletal Tumor Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
- Orthopedics Research Institute of Zhejiang University, Hangzhou, China
- Key Laboratory of Motor System Disease Research and Precision Therapy of Zhejiang Province, Hangzhou, China
| |
Collapse
|
10
|
Peng J, Lu G, Shang X. A Survey of Network Representation Learning Methods for Link Prediction in Biological Network. Curr Pharm Des 2021; 26:3076-3084. [PMID: 31951161 DOI: 10.2174/1381612826666200116145057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 01/09/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND Networks are powerful resources for describing complex systems. Link prediction is an important issue in network analysis and has important practical application value. Network representation learning has proven to be useful for network analysis, especially for link prediction tasks. OBJECTIVE To review the application of network representation learning on link prediction in a biological network, we summarize recent methods for link prediction in a biological network and discuss the application and significance of network representation learning in link prediction task. METHOD & RESULTS We first introduce the widely used link prediction algorithms, then briefly introduce the development of network representation learning methods, focusing on a few widely used methods, and their application in biological network link prediction. Existing studies demonstrate that using network representation learning to predict links in biological networks can achieve better performance. In the end, some possible future directions have been discussed.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Guilin Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
11
|
Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure. BMC Genomics 2020; 21:866. [PMID: 33334313 PMCID: PMC7745412 DOI: 10.1186/s12864-020-07239-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Background RNA binding proteins (RBPs) play a vital role in post-transcriptional processes in all eukaryotes, such as splicing regulation, mRNA transport, and modulation of mRNA translation and decay. The identification of RBP binding sites is a crucial step in understanding the biological mechanism of post-transcriptional gene regulation. However, the determination of RBP binding sites on a large scale is a challenging task due to high cost of biochemical assays. Quite a number of studies have exploited machine learning methods to predict binding sites. Especially, deep learning is increasingly used in the bioinformatics field by virtue of its ability to learn generalized representations from DNA and protein sequences. Results In this paper, we implemented a novel deep neural network model, DeepRKE, which combines primary RNA sequence and secondary structure information to effectively predict RBP binding sites. Specifically, we used word embedding algorithm to extract features of RNA sequences and secondary structures, i.e., distributed representation of k-mers sequence rather than traditional one-hot encoding. The distributed representations are taken as input of convolutional neural networks (CNN) and bidirectional long-term short-term memory networks (BiLSTM) to identify RBP binding sites. Our results show that deepRKE outperforms existing counterpart methods on two large-scale benchmark datasets. Conclusions Our extensive experimental results show that DeepRKE is an efficacious tool for predicting RBP binding sites. The distributed representations of RNA sequences and secondary structures can effectively detect the latent relationship and similarity between k-mers, and thus improve the predictive performance. The source code of DeepRKE is available at https://github.com/youzhiliu/DeepRKE/. Supplementary Information The online version contains supplementary material available at (doi:10.1186/s12864-020-07239-w).
Collapse
|
12
|
Yang B, Xin T, Han M, Zhao X, Chen J. Structured feature for multi-label learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
13
|
|
14
|
Deng S, Sun Y, Zhao T, Hu Y, Zang T. A Review of Drug Side Effect Identification Methods. Curr Pharm Des 2020; 26:3096-3104. [PMID: 32532187 DOI: 10.2174/1381612826666200612163819] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 05/18/2020] [Indexed: 11/22/2022]
Abstract
Drug side effects have become an important indicator for evaluating the safety of drugs. There are two main factors in the frequent occurrence of drug safety problems; on the one hand, the clinical understanding of drug side effects is insufficient, leading to frequent adverse drug reactions, while on the other hand, due to the long-term period and complexity of clinical trials, side effects of approved drugs on the market cannot be reported in a timely manner. Therefore, many researchers have focused on developing methods to identify drug side effects. In this review, we summarize the methods of identifying drug side effects and common databases in this field. We classified methods of identifying side effects into four categories: biological experimental, machine learning, text mining and network methods. We point out the key points of each kind of method. In addition, we also explain the advantages and disadvantages of each method. Finally, we propose future research directions.
Collapse
Affiliation(s)
- Shuai Deng
- College of Science, Beijing Forestry University, Beijing, China
| | - Yige Sun
- Microbiology Department, Harbin Medical University, Harbin, 150081, China
| | - Tianyi Zhao
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yang Hu
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Tianyi Zang
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
15
|
|
16
|
Wang C, Zhang J, Wang X, Han K, Guo M. Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion. Front Genet 2020; 11:5. [PMID: 32117433 PMCID: PMC7010852 DOI: 10.3389/fgene.2020.00005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 01/06/2020] [Indexed: 12/23/2022] Open
Abstract
Complex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period and high cost of traditional biomedical methods, researchers have proposed many gene prioritization algorithms that use a large amount of biological data to mine pathogenic genes. However, because the currently known gene-disease association matrix is still very sparse and lacks evidence that genes and diseases are unrelated, there are limits to the predictive performance of gene prioritization algorithms. Based on the hypothesis that functionally related gene mutations may lead to similar disease phenotypes, this paper proposes a PU induction matrix completion algorithm based on heterogeneous information fusion (PUIMCHIF) to predict candidate genes involved in the pathogenicity of human diseases. On the one hand, PUIMCHIF uses different compact feature learning methods to extract features of genes and diseases from multiple data sources, making up for the lack of sparse data. On the other hand, based on the prior knowledge that most of the unknown gene-disease associations are unrelated, we use the PU-Learning strategy to treat the unknown unlabeled data as negative examples for biased learning. The experimental results of the PUIMCHIF algorithm regarding the three indexes of precision, recall, and mean percentile ranking (MPR) were significantly better than those of other algorithms. In the top 100 global prediction analysis of multiple genes and multiple diseases, the probability of recovering true gene associations using PUIMCHIF reached 50% and the MPR value was 10.94%. The PUIMCHIF algorithm has higher priority than those from other methods, such as IMC and CATAPULT.
Collapse
Affiliation(s)
- Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jie Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xueping Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
- Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing University of Civil Engineering and Architecture, Beijing, China
| |
Collapse
|
17
|
Abstract
BACKGROUND A collection of disease-associated data contributes to study the association between diseases. Discovering closely related diseases plays a crucial role in revealing their common pathogenic mechanisms. This might further imply treatment that can be appropriated from one disease to another. During the past decades, a number of approaches for calculating disease similarity have been developed. However, most of them are designed to take advantage of single or few data sources, which results in their low accuracy. METHODS In this paper, we propose a novel method, called MultiSourcDSim, to calculate disease similarity by integrating multiple data sources, namely, gene-disease associations, GO biological process-disease associations and symptom-disease associations. Firstly, we establish three disease similarity networks according to the three disease-related data sources respectively. Secondly, the representation of each node is obtained by integrating the three small disease similarity networks. In the end, the learned representations are applied to calculate the similarity between diseases. RESULTS Our approach shows the best performance compared to the other three popular methods. Besides, the similarity network built by MultiSourcDSim suggests that our method can also uncover the latent relationships between diseases. CONCLUSIONS MultiSourcDSim is an efficient approach to predict similarity between diseases.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075 China
| | - Danyi Ye
- School of Computer Science and Engineering, Central South University, Changsha, 410075 China
| | - Junmin Zhao
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000 China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000 China
| |
Collapse
|
18
|
Zheng N, Wang K, Zhan W, Deng L. Targeting Virus-host Protein Interactions: Feature Extraction and Machine Learning Approaches. Curr Drug Metab 2019; 20:177-184. [PMID: 30156155 DOI: 10.2174/1389200219666180829121038] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 05/21/2018] [Accepted: 08/02/2018] [Indexed: 01/15/2023]
Abstract
BACKGROUND Targeting critical viral-host Protein-Protein Interactions (PPIs) has enormous application prospects for therapeutics. Using experimental methods to evaluate all possible virus-host PPIs is labor-intensive and time-consuming. Recent growth in computational identification of virus-host PPIs provides new opportunities for gaining biological insights, including applications in disease control. We provide an overview of recent computational approaches for studying virus-host PPI interactions. METHODS In this review, a variety of computational methods for virus-host PPIs prediction have been surveyed. These methods are categorized based on the features they utilize and different machine learning algorithms including classical and novel methods. RESULTS We describe the pivotal and representative features extracted from relevant sources of biological data, mainly include sequence signatures, known domain interactions, protein motifs and protein structure information. We focus on state-of-the-art machine learning algorithms that are used to build binary prediction models for the classification of virus-host protein pairs and discuss their abilities, weakness and future directions. CONCLUSION The findings of this review confirm the importance of computational methods for finding the potential protein-protein interactions between virus and host. Although there has been significant progress in the prediction of virus-host PPIs in recent years, there is a lot of room for improvement in virus-host PPI prediction.
Collapse
Affiliation(s)
- Nantao Zheng
- School of Software, Central South University, Changsha, 410075, China
| | - Kairou Wang
- School of Software, Central South University, Changsha, 410075, China
| | - Weihua Zhan
- School of Electronics and Computer Science, Zhejiang Wanli University, Ningbo 315100, China
| | - Lei Deng
- School of Software, Central South University, Changsha, 410075, China.,Shanghai Key Lab of Intelligent Information Processing, Shanghai 200433, China
| |
Collapse
|
19
|
Chen X, Shi W, Deng L. Prediction of Disease Comorbidity Using HeteSim Scores based on Multiple Heterogeneous Networks. Curr Gene Ther 2019; 19:232-241. [DOI: 10.2174/1566523219666190917155959] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 06/14/2019] [Accepted: 06/16/2019] [Indexed: 12/25/2022]
Abstract
Background:
Accumulating experimental studies have indicated that disease comorbidity
causes additional pain to patients and leads to the failure of standard treatments compared to patients
who have a single disease. Therefore, accurate prediction of potential comorbidity is essential to design
more efficient treatment strategies. However, only a few disease comorbidities have been discovered
in the clinic.
Objective:
In this work, we propose PCHS, an effective computational method for predicting disease
comorbidity.
Materials and Methods:
We utilized the HeteSim measure to calculate the relatedness score for different
disease pairs in the global heterogeneous network, which integrates six networks based on biological
information, including disease-disease associations, drug-drug interactions, protein-protein interactions
and associations among them. We built the prediction model using the Support Vector Machine
(SVM) based on the HeteSim scores.
Results and Conclusion:
The results showed that PCHS performed significantly better than previous
state-of-the-art approaches and achieved an AUC score of 0.90 in 10-fold cross-validation. Furthermore,
some of our predictions have been verified in literatures, indicating the effectiveness of our method.
Collapse
Affiliation(s)
- Xuegong Chen
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Wanwan Shi
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| |
Collapse
|
20
|
Mora A. Gene set analysis methods for the functional interpretation of non-mRNA data—Genomic range and ncRNA data. Brief Bioinform 2019; 21:1495-1508. [DOI: 10.1093/bib/bbz090] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Revised: 05/30/2019] [Accepted: 06/28/2019] [Indexed: 12/31/2022] Open
Abstract
Abstract
Gene set analysis (GSA) is one of the methods of choice for analyzing the results of current omics studies; however, it has been mainly developed to analyze mRNA (microarray, RNA-Seq) data. The following review includes an update regarding general methods and resources for GSA and then emphasizes GSA methods and tools for non-mRNA omics datasets, specifically genomic range data (ChIP-Seq, SNP and methylation) and ncRNA data (miRNAs, lncRNAs and others). In the end, the state of the GSA field for non-mRNA datasets is discussed, and some current challenges and trends are highlighted, especially the use of network approaches to face complexity issues.
Collapse
Affiliation(s)
- Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health - Chinese Academy of Sciences
| |
Collapse
|
21
|
Nakano FK, Lietaert M, Vens C. Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets. BMC Bioinformatics 2019; 20:485. [PMID: 31547800 PMCID: PMC6755698 DOI: 10.1186/s12859-019-3060-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 08/27/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.
Collapse
Affiliation(s)
- Felipe Kenji Nakano
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| | - Mathias Lietaert
- Howest University of Applied Sciences, Campus Brugge Station, Rijselstraat 5, Brugge, 8200 Belgium
| | - Celine Vens
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| |
Collapse
|
22
|
Core-reviewer recommendation based on Pull Request topic model and collaborator social network. Soft comput 2019. [DOI: 10.1007/s00500-019-04217-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
23
|
Fusion of multiple heterogeneous networks for predicting circRNA-disease associations. Sci Rep 2019; 9:9605. [PMID: 31270357 PMCID: PMC6610109 DOI: 10.1038/s41598-019-45954-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 06/18/2019] [Indexed: 12/20/2022] Open
Abstract
Circular RNAs (circRNAs) are a newly identified type of non-coding RNA (ncRNA) that plays crucial roles in many cellular processes and human diseases, and are potential disease biomarkers and therapeutic targets in human diseases. However, experimentally verified circRNA-disease associations are very rare. Hence, developing an accurate and efficient method to predict the association between circRNA and disease may be beneficial to disease prevention, diagnosis, and treatment. Here, we propose a computational method named KATZCPDA, which is based on the KATZ method and the integrations among circRNAs, proteins, and diseases to predict circRNA-disease associations. KATZCPDA not only verifies existing circRNA-disease associations but also predicts unknown associations. As demonstrated by leave-one-out and 10-fold cross-validation, KATZCPDA achieves AUC values of 0.959 and 0.958, respectively. The performance of KATZCPDA was substantially higher than those of previously developed network-based methods. To further demonstrate the effectiveness of KATZCPDA, we apply KATZCPDA to predict the associated circRNAs of Colorectal cancer, glioma, breast cancer, and Tuberculosis. The results illustrated that the predicted circRNA-disease associations could rank the top 10 of the experimentally verified associations.
Collapse
|
24
|
Rabajante JF, Del Rosario RCH. Modeling Long ncRNA-Mediated Regulation in the Mammalian Cell Cycle. Methods Mol Biol 2019; 1912:427-445. [PMID: 30635904 DOI: 10.1007/978-1-4939-8982-9_17] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Long noncoding RNAs (lncRNAs) are transcripts longer than 200 nucleotides that are not translated into proteins. They have recently gained widespread attention due to the finding that tens of thousands of lncRNAs reside in the human genome, and due to an increasing number of lncRNAs that are found to be associated with disease. Some lncRNAs, including disease-associated ones, play different roles in regulating the cell cycle. Mathematical models of the cell cycle have been useful in better understanding this biological system, such as how it could be robust to some perturbations and how the cell cycle checkpoints could act as a switch. Here, we discuss mathematical modeling techniques for studying lncRNA regulation of the mammalian cell cycle. We present examples on how modeling via network analysis and differential equations can provide novel predictions toward understanding cell cycle regulation in response to perturbations such as DNA damage.
Collapse
Affiliation(s)
- Jomar F Rabajante
- Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, Laguna, Philippines.
| | - Ricardo C H Del Rosario
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
25
|
Deng L, Sui Y, Zhang J. XGBPRH: Prediction of Binding Hot Spots at Protein⁻RNA Interfaces Utilizing Extreme Gradient Boosting. Genes (Basel) 2019; 10:genes10030242. [PMID: 30901953 PMCID: PMC6471955 DOI: 10.3390/genes10030242] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 03/14/2019] [Accepted: 03/15/2019] [Indexed: 01/24/2023] Open
Abstract
Hot spot residues at protein⁻RNA complexes are vitally important for investigating the underlying molecular recognition mechanism. Accurately identifying protein⁻RNA binding hot spots is critical for drug designing and protein engineering. Although some progress has been made by utilizing various available features and a series of machine learning approaches, these methods are still in the infant stage. In this paper, we present a new computational method named XGBPRH, which is based on an eXtreme Gradient Boosting (XGBoost) algorithm and can effectively predict hot spot residues in protein⁻RNA interfaces utilizing an optimal set of properties. Firstly, we download 47 protein⁻RNA complexes and calculate a total of 156 sequence, structure, exposure, and network features. Next, we adopt a two-step feature selection algorithm to extract a combination of 6 optimal features from the combination of these 156 features. Compared with the state-of-the-art approaches, XGBPRH achieves better performances with an area under the ROC curve (AUC) score of 0.817 and an F1-score of 0.802 on the independent test set. Meanwhile, we also apply XGBPRH to two case studies. The results demonstrate that the method can effectively identify novel energy hotspots.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410075, China.
| | - Yuanchao Sui
- School of Computer Science and Engineering, Central South University, Changsha 410075, China.
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan 467000, China.
| |
Collapse
|
26
|
Li Y, Niu M, Zou Q. ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J Proteome Res 2019; 18:1392-1401. [DOI: 10.1021/acs.jproteome.9b00012] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
27
|
Deng L, Wang J, Zhang J. Predicting Gene Ontology Function of Human MicroRNAs by Integrating Multiple Networks. Front Genet 2019; 10:3. [PMID: 30761178 PMCID: PMC6361788 DOI: 10.3389/fgene.2019.00003] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 01/07/2019] [Indexed: 12/15/2022] Open
Abstract
MicroRNAs (miRNAs) have been demonstrated to play significant biological roles in many human biological processes. Inferring the functions of miRNAs is an important strategy for understanding disease pathogenesis at the molecular level. In this paper, we propose an integrated model, PmiRGO, to infer the gene ontology (GO) functions of miRNAs by integrating multiple data sources, including the expression profiles of miRNAs, miRNA-target interactions, and protein-protein interactions (PPI). PmiRGO starts by building a global network consisting of three networks. Then, it employs DeepWalk to learn latent representations as network features of the global heterogeneous network. Finally, the SVM-based models are applied to label the GO terms of miRNAs. The experimental results show that PmiRGO has a significantly better performance than existing state-of-the-art methods in terms of F max . A case study further demonstrates the feasibility of PmiRGO to annotate the potential functions of miRNAs.
Collapse
Affiliation(s)
- Lei Deng
- School of Software, Central South University, Changsha, China
| | - Jiacheng Wang
- School of Software, Central South University, Changsha, China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, China
| |
Collapse
|
28
|
Zhao J, Ma X. Multiple Partial Regularized Nonnegative Matrix Factorization for Predicting Ontological Functions of lncRNAs. Front Genet 2019; 9:685. [PMID: 30728826 PMCID: PMC6351489 DOI: 10.3389/fgene.2018.00685] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 12/10/2018] [Indexed: 02/02/2023] Open
Abstract
Long non-coding RNAs (LncRNA) are critical regulators for biological processes, which are highly related to complex diseases. Even though the next generation sequence technology facilitates the discovery of a great number of lncRNAs, the knowledge about the functions of lncRNAs is limited. Thus, it is promising to predict the functions of lncRNAs, which shed light on revealing the mechanisms of complex diseases. The current algorithms predict the functions of lncRNA by using the features of protein-coding genes. Generally speaking, these algorithms fuse heterogeneous genomic data to construct lncRNA-gene associations via a linear combination, which cannot fully characterize the function-lncRNA relations. To overcome this issue, we present an nonnegative matrix factorization algorithm with multiple partial regularization (aka MPrNMF) to predict the functions of lncRNAs without fusing the heterogeneous genomic data. In details, for each type of genomic data, we construct the lncRNA-gene associations, resulting in multiple associations. The proposed method integrates separately them via regularization strategy, rather than fuse them into a single type of associations. The results demonstrate that the proposed algorithm outperforms state-of-the-art methods based network-analysis. The model and algorithm provide an effective way to explore the functions of lncRNAs.
Collapse
Affiliation(s)
- Jianbang Zhao
- College of Information Engineering, Northwest Agriculture & Forestry University, Xianyang, China
| | - Xiaoke Ma
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
29
|
Ehsani R, Drabløs F. Measures of co-expression for improved function prediction of long non-coding RNAs. BMC Bioinformatics 2018; 19:533. [PMID: 30567492 PMCID: PMC6300029 DOI: 10.1186/s12859-018-2546-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 11/28/2018] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Almost 16,000 human long non-coding RNA (lncRNA) genes have been identified in the GENCODE project. However, the function of most of them remains to be discovered. The function of lncRNAs and other novel genes can be predicted by identifying significantly enriched annotation terms in already annotated genes that are co-expressed with the lncRNAs. However, such approaches are sensitive to the methods that are used to estimate the level of co-expression. RESULTS We have tested and compared two well-known statistical metrics (Pearson and Spearman) and two geometrical metrics (Sobolev and Fisher) for identification of the co-expressed genes, using experimental expression data across 19 normal human tissues. We have also used a benchmarking approach based on semantic similarity to evaluate how well these methods are able to predict annotation terms, using a well-annotated set of protein-coding genes. CONCLUSION This work shows that geometrical metrics, in particular in combination with the statistical metrics, will predict annotation terms more efficiently than traditional approaches. Tests on selected lncRNAs confirm that it is possible to predict the function of these genes given a reliable set of expression data. The software used for this investigation is freely available.
Collapse
Affiliation(s)
- Rezvan Ehsani
- Department of Mathematics, University of Zabol, Zabol, Iran. .,Department of Bioinformatics, University of Zabol, Zabol, Iran.
| | - Finn Drabløs
- Department of Clinical and Molecular Medicine, NTNU - Norwegian University of Science and Technology, NO-7491, Trondheim, Norway.
| |
Collapse
|
30
|
Zhang J, Zou S, Deng L. Gene Ontology-based function prediction of long non-coding RNAs using bi-random walk. BMC Med Genomics 2018; 11:99. [PMID: 30453964 PMCID: PMC6245587 DOI: 10.1186/s12920-018-0414-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Background With the development of sequencing technology, more and more long non-coding RNAs (lncRNAs) have been identified. Some lncRNAs have been confirmed that they play an important role in the process of development through the dosage compensation effect, epigenetic regulation, cell differentiation regulation and other aspects. However, the majority of the lncRNAs have not been functionally characterized. Explore the function of lncRNAs and the regulatory network has become a hot research topic currently. Methods In the work, a network-based model named BiRWLGO is developed. The ultimate goal is to predict the probable functions for lncRNAs at large scale. The new model starts with building a global network composed of three networks: lncRNA similarity network, lncRNA-protein association network and protein-protein interaction (PPI) network. After that, it utilizes bi-random walk algorithm to explore the similarities between lncRNAs and proteins. Finally, we can annotate an lncRNA with the Gene Ontology (GO) terms according to its neighboring proteins. Results We compare the performance of BiRWLGO with the state-of-the-art models on a manually annotated lncRNA benchmark with known GO terms. The experimental results assert that BiRWLGO outperforms other methods in terms of both maximum F-measure (Fmax) and coverage. Conclusions BiRWLGO is a relatively efficient method to predict the functions of lncRNA. When protein interaction data is integrated, the predictive performance of BiRWLGO gains a great improvement. Electronic supplementary material The online version of this article (10.1186/s12920-018-0414-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000, China.,School of Information Science and Engineering, Central South University, Changsha, 410083, China
| | - Shuai Zou
- School of Information Science and Engineering, Central South University, Changsha, 410083, China
| | - Lei Deng
- School of Software, Central South University, Changsha, 410075, China.
| |
Collapse
|
31
|
Zeng C, Zhan W, Deng L. SDADB: a functional annotation database of protein structural domains. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5046758. [PMID: 29961821 PMCID: PMC6025185 DOI: 10.1093/database/bay064] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 06/04/2018] [Indexed: 12/27/2022]
Abstract
Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings. The domain-protein mappings are computed using large-scale structure alignment. SDADB contains ontological terms with probabilistic scores for more than 214 000 distinct SCOP domains. It also provides additional features include 3D structure alignment visualization, GO hierarchical tree view, search, browse and download options. Database URL: http://sda.denglab.org
Collapse
Affiliation(s)
- Cheng Zeng
- School of Software, Central South University, Changsha 410075, China
| | - Weihua Zhan
- School of Electronics and Computer Science, Zhejiang Wanli University, Ningbo 315100, China
| | - Lei Deng
- School of Software, Central South University, Changsha 410075, China.,Shanghai Key Lab of Intelligent Information Processing, Shanghai 200433, China
| |
Collapse
|
32
|
Feng S, Fu P, Zheng W. A hierarchical multi-label classification method based on neural networks for gene function prediction. BIOTECHNOL BIOTEC EQ 2018. [DOI: 10.1080/13102818.2018.1521302] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Affiliation(s)
- Shou Feng
- Department of Automatic Test and Control, School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, PR China
| | - Ping Fu
- Department of Automatic Test and Control, School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, PR China
| | - Wenbin Zheng
- Department of Automatic Test and Control, School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, PR China
| |
Collapse
|
33
|
Deng L, Wang J, Xiao Y, Wang Z, Liu H. Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network. BMC Bioinformatics 2018; 19:370. [PMID: 30309340 PMCID: PMC6182872 DOI: 10.1186/s12859-018-2390-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 09/19/2018] [Indexed: 12/12/2022] Open
Abstract
Background Identifying the interactions between proteins and long non-coding RNAs (lncRNAs) is of great importance to decipher the functional mechanisms of lncRNAs. However, current experimental techniques for detection of lncRNA-protein interactions are limited and inefficient. Many methods have been proposed to predict protein-lncRNA interactions, but few studies make use of the topological information of heterogenous biological networks associated with the lncRNAs. Results In this work, we propose a novel approach, PLIPCOM, using two groups of network features to detect protein-lncRNA interactions. In particular, diffusion features and HeteSim features are extracted from protein-lncRNA heterogenous network, and then combined to build the prediction model using the Gradient Tree Boosting (GTB) algorithm. Our study highlights that the topological features of the heterogeneous network are crucial for predicting protein-lncRNA interactions. The cross-validation experiments on the benchmark dataset show that PLIPCOM method substantially outperformed previous state-of-the-art approaches in predicting protein-lncRNA interactions. We also prove the robustness of the proposed method on three unbalanced data sets. Moreover, our case studies demonstrate that our method is effective and reliable in predicting the interactions between lncRNAs and proteins. Availability The source code and supporting files are publicly available at: http://denglab.org/PLIPCOM/.
Collapse
Affiliation(s)
- Lei Deng
- School of Software, Central South University, Changsha, 410075, China
| | - Junqiang Wang
- School of Software, Central South University, Changsha, 410075, China
| | - Yun Xiao
- School of Software, Central South University, Changsha, 410075, China
| | - Zixiang Wang
- School of Software, Central South University, Changsha, 410075, China
| | - Hui Liu
- Lab of Information Management, Changzhou University, Jiangsu, 213164, China.
| |
Collapse
|
34
|
Uszczynska-Ratajczak B, Lagarde J, Frankish A, Guigó R, Johnson R. Towards a complete map of the human long non-coding RNA transcriptome. Nat Rev Genet 2018; 19:535-548. [PMID: 29795125 PMCID: PMC6451964 DOI: 10.1038/s41576-018-0017-y] [Citation(s) in RCA: 416] [Impact Index Per Article: 59.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.
Collapse
Affiliation(s)
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Rory Johnson
- Department of Medical Oncology, Inselspital, University Hospital and University of Bern, Bern, Switzerland.
- Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland.
| |
Collapse
|
35
|
Long noncoding RNA HOTAIR promotes the self-renewal of leukemia stem cells through epigenetic silencing of p15. Exp Hematol 2018; 67:32-40.e3. [PMID: 30172749 DOI: 10.1016/j.exphem.2018.08.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 08/09/2018] [Accepted: 08/22/2018] [Indexed: 11/20/2022]
Abstract
Acute myeloid leukemia (AML) is a heterogeneous hematopoietic disorder initiated from a small subset of leukemia stem cell (LSC), which presents unrestricted self-renewal and proliferation. Long non-coding RNA HOTAIR is abundantly expressed and plays oncogenic roles in solid cancer and AML. However, whether HOTAIR regulates the self-renewal of LSC is largely unknown. Here, we reported that the expression of HOTAIR was increased in LSC than in normal hematological stem and progenitor cells (HSPCs). HOTAIR inhibition by short hairpin RNAs (shRNAs) decreased colony formation in leukemia cell lines and primary AML blasts. We then investigated the role of HOTAIR in leukemia in vivo. HOTAIR knockdown extends the survival time in U937-transplanted NSG mice. Furthermore, HOTAIR knockdown reduced infiltration of leukemic blasts, decreased frequency of LSC, and prolonged overall survival in MLL-AF9-induced murine leukemia, suggesting that HOTAIR is required for the maintenance of AML. Mechanistically, HOTAIR inhibited p15 expression through zeste homolog 2 (EZH2)-enrolled tri-methylation of Lys 27 of histone H3 (H3K27me3) in p15 promoter. In addition, p15 partially reversed the decrease of colony and proliferation induced by HOTAIR knockdown, suggesting that p15 plays an important role in the leukemogenesis by HOTAIR. In conclusion, our study suggests that HOTAIR facilitates leukemogenesis by enhancing self-renewal of LSC. HOTAIR might be a potential target for anti-LSC therapy.
Collapse
|
36
|
Blokhin I, Khorkova O, Hsiao J, Wahlestedt C. Developments in lncRNA drug discovery: where are we heading? Expert Opin Drug Discov 2018; 13:837-849. [PMID: 30078338 DOI: 10.1080/17460441.2018.1501024] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
INTRODUCTION The central dogma of molecular biology, which states that the only role of long RNA transcripts is to convey information from gene to protein, was brought into question in recent years due to discovery of the extensive presence and complex roles of long noncoding RNAs (lncRNAs). Furthermore, lncRNAs were found to be involved in pathogenesis of multiple diseases and thus represent a new class of therapeutic targets. Translational efforts in the lncRNA field have been augmented by progress in optimizing the chemistry and delivery platforms of lncRNA-targeting modalities, including oligonucleotide-based drugs and CRISPR-Cas9. Areas covered: This review covers the current advances in characterizing diversity and biological functions of lncRNA focusing on their therapeutic potential in selected therapeutic areas. Expert opinion: Due to accelerating parallel progress in lncRNA biology and lncRNA-compatible therapeutic modalities, it is likely that lncRNA-dependent mechanisms of pathogenesis will soon be targeted in various disorders, including neurological, psychiatric, cardiovascular, infectious diseases, and cancer. Significant efforts, however, are still required to better understand the biology of both lncRNAs and lncRNA-targeting drugs. Further work is needed in the areas of lncRNA nomenclature, database representation, intra/interfield communication, and education of the community at large.
Collapse
Affiliation(s)
- Ilya Blokhin
- a Center for Therapeutic Innovation and Department of Psychiatry and Behavioral Sciences , University of Miami Miller School of Medicine , Miami , FL , USA
| | | | | | - Claes Wahlestedt
- a Center for Therapeutic Innovation and Department of Psychiatry and Behavioral Sciences , University of Miami Miller School of Medicine , Miami , FL , USA
| |
Collapse
|
37
|
Niu M, Li Y, Wang C, Han K. RFAmyloid: A Web Server for Predicting Amyloid Proteins. Int J Mol Sci 2018; 19:ijms19072071. [PMID: 30013015 PMCID: PMC6073578 DOI: 10.3390/ijms19072071] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/10/2018] [Accepted: 07/12/2018] [Indexed: 12/22/2022] Open
Abstract
Amyloid is an insoluble fibrous protein and its mis-aggregation can lead to some diseases, such as Alzheimer’s disease and Creutzfeldt–Jakob’s disease. Therefore, the identification of amyloid is essential for the discovery and understanding of disease. We established a novel predictor called RFAmy based on random forest to identify amyloid, and it employed SVMProt 188-D feature extraction method based on protein composition and physicochemical properties and pse-in-one feature extraction method based on amino acid composition, autocorrelation pseudo acid composition, profile-based features and predicted structures features. In the ten-fold cross-validation test, RFAmy’s overall accuracy was 89.19% and F-measure was 0.891. Results were obtained by comparison experiments with other feature, classifiers, and existing methods. This shows the effectiveness of RFAmy in predicting amyloid protein. The RFAmy proposed in this paper can be accessed through the URL http://server.malab.cn/RFAmyloid/.
Collapse
Affiliation(s)
- Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China.
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150040, China.
| |
Collapse
|