1
|
Prediction of LncRNA-Protein Interactions Based on Kernel Combinations and Graph Convolutional Networks. IEEE J Biomed Health Inform 2024; 28:1937-1948. [PMID: 37327093 DOI: 10.1109/jbhi.2023.3286917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The complexes of long non-coding RNAs bound to proteins can be involved in regulating life activities at various stages of organisms. However, in the face of the growing number of lncRNAs and proteins, verifying LncRNA-Protein Interactions (LPI) based on traditional biological experiments is time-consuming and laborious. Therefore, with the improvement of computing power, predicting LPI has met new development opportunity. In virtue of the state-of-the-art works, a framework called LncRNA-Protein Interactions based on Kernel Combinations and Graph Convolutional Networks (LPI-KCGCN) has been proposed in this article. We first construct kernel matrices by taking advantage of extracting both the lncRNAs and protein concerning the sequence features, sequence similarity features, expression features, and gene ontology. Then reconstruct the existent kernel matrices as the input of the next step. Combined with known LPI interactions, the reconstructed similarity matrices, which can be used as features of the topology map of the LPI network, are exploited in extracting potential representations in the lncRNA and protein space using a two-layer Graph Convolutional Network. The predicted matrix can be finally obtained by training the network to produce scoring matrices w.r.t. lncRNAs and proteins. Different LPI-KCGCN variants are ensemble to derive the final prediction results and testify on balanced and unbalanced datasets. The 5-fold cross-validation shows that the optimal feature information combination on a dataset with 15.5% positive samples has an AUC value of 0.9714 and an AUPR value of 0.9216. On another highly unbalanced dataset with only 5% positive samples, LPI-KCGCN also has outperformed the state-of-the-art works, which achieved an AUC value of 0.9907 and an AUPR value of 0.9267.
Collapse
|
2
|
Prediction of lncRNA functions using deep neural networks based on multiple networks. BMC Genomics 2023; 23:865. [PMID: 37946156 PMCID: PMC10636874 DOI: 10.1186/s12864-023-09578-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 08/10/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND More and more studies show that lncRNA is widely involved in various physiological processes of the organism. However, the functions of the vast majority of them continue to be unknown. In addition, data related to lncRNAs in biological databases are constantly increasing. Therefore, it is quite urgent to develop a computing method to make the utmost of these data. RESULTS In this paper, we propose a new computational method based on global heterogeneous networks to predict the functions of lncRNAs, called DNGRGO. DNGRGO first calculates the similarities among proteins, miRNAs, and lncRNAs, and annotates the functions of lncRNAs according to its similar protein-coding genes, which have been labeled with gene ontology (GO). To evaluate the performance of DNGRGO, we manually annotated GO terms to lncRNAs and implemented our method on these data. Compared with the existing methods, the results of DNGRGO show superior predictive performance of maximum F-measure and coverage. CONCLUSIONS DNGRGO is able to annotate lncRNAs through capturing the low-dimensional features of the heterogeneous network. Moreover, the experimental results show that integrating miRNA data can help to improve the predictive performance of DNGRGO.
Collapse
|
3
|
Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology. IEEE Trans Nanobioscience 2023; 22:755-762. [PMID: 37204950 DOI: 10.1109/tnb.2023.3278033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Gene Ontology (GO) is a widely used bioinformatics resource for describing biological processes, molecular functions, and cellular components of proteins. It covers more than 5000 terms hierarchically organized into a directed acyclic graph and known functional annotations. Automatically annotating protein functions by using GO-based computational models has been an area of active research for a long time. However, due to the limited functional annotation information and complex topological structures of GO, existing models cannot effectively capture the knowledge representation of GO. To solve this issue, we present a method that fuses the functional and topological knowledge of GO to guide protein function prediction. This method employs a multi-view GCN model to extract a variety of GO representations from functional information, topological structure, and their combinations. To dynamically learn the significance weights of these representations, it adopts an attention mechanism to learn the final knowledge representation of GO. Furthermore, it uses a pre-trained language model (i.e., ESM-1b) to efficiently learn biological features for each protein sequence. Finally, it obtains all predicted scores by calculating the dot product of sequence features and GO representation. Our method outperforms other state-of-the-art methods, as demonstrated by the experimental results on datasets from three different species, namely Yeast, Human and Arabidopsis. Our proposed method's code can be accessed at: https://github.com/Candyperfect/Master.
Collapse
|
4
|
Topologically associating domain underlies tissue specific expression of long intergenic non-coding RNAs. iScience 2023; 26:106640. [PMID: 37250307 PMCID: PMC10214471 DOI: 10.1016/j.isci.2023.106640] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 12/15/2022] [Accepted: 04/05/2023] [Indexed: 05/31/2023] Open
Abstract
Accumulating evidence indicates that long intergenic non-coding RNAs (lincRNAs) show more tissue-specific expression patterns than protein-coding genes (PCGs). However, although lincRNAs are subject to canonical transcriptional regulation like PCGs, the molecular basis for the specificity of their expression patterns remains unclear. Here, using expression data and coordinates of topologically associating domains (TADs) in human tissues, we show that lincRNA loci are significantly enriched in the more internal region of TADs compared to PCGs and that lincRNAs within TADs have higher tissue specificity than those outside TADs. Based on these, we propose an analytical framework to interpret transcriptional status using lincRNA as an indicator. We applied it to hypertrophic cardiomyopathy data and found disease-specific transcriptional regulation: ectopic expression of keratin at the TAD level and derepression of myocyte differentiation-related genes by E2F1 with down-regulation of LINC00881. Our results provide understanding of the function and regulation of lincRNAs according to genomic structure.
Collapse
|
5
|
Hypermethylation-Mediated lncRNA MAGI2-AS3 Downregulation Facilitates Malignant Progression of Laryngeal Squamous Cell Carcinoma via Interacting With SPT6. Cell Transplant 2023; 32:9636897231154574. [PMID: 36852700 PMCID: PMC9986895 DOI: 10.1177/09636897231154574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Open
Abstract
Long noncoding RNAs (lncRNAs) have an effect on the occurrence and progression of a considerable number of diseases, especially cancer. Existing research has suggested that MAGI2 antisense RNA 3 (MAGI2-AS3) takes on a critical significance in the development of hepatocellular carcinoma and lung cancer. However, the functions of MAGI2-AS3 in laryngeal squamous cell carcinoma (LSCC) remain unclear. In this study, MAGI2-AS3 expression level in LSCC tissue and cell lines was detected, and the effect of MAGI2-AS3 overexpressed on LSCC phenotypes and the possible influence mechanisms were examined. MAGI2-AS3 was downregulated in the tissues of LSCC patients versus non-tumor tissues, and it was correlated with advanced TNM (tumor, node, metastasis) stage and lymph node metastases, as indicated by the results of this study. MAGI2-AS3 inhibited the proliferation, migration, and invasion of LSCC cells in vitro and in vivo. Furthermore, the hypermethylation level of the MAGI2-AS3 promoter region was indicated by bisulfite genomic sequencing and methylation-specific polymerase chain reaction, such that MAGI2-AS3 expression was downregulated. Besides, MAGI2-AS3 promoter hypermethylation was regulated by DNA methyltransferase 1 (DNMT1), and MAGI2-AS3 expression was reversed by 5-Aza-2'-deoxycytidine (5-Aza). Moreover, the result of the RNA pull-down experiment suggested that 38 proteins were enriched in the MAGI2-AS3 group versus the control group in TU177 cells. To be specific, SPT6 (ie, a conserved protein) was enriched by fold change >10. SPT6 knockdown reduced the antitumor effect of MAGI2-AS3 in TU177 and AMC-HN-8 cells. Meanwhile, SPT6 overexpression inhibited the proliferation, metastasis, and invasion of TU177 and AMC-HN-8 cells. As revealed by the above findings, DNMT1-regulated MAGI2-AS3 promoter hypermethylation led to downregulated MAGI2-AS3 expression, such that the presence and progression of LSCC were inhibited in an SPT6 binding-dependent manner.
Collapse
|
6
|
EVlncRNA-Dpred: improved prediction of experimentally validated lncRNAs by deep learning. Brief Bioinform 2022; 24:6961472. [PMID: 36573492 PMCID: PMC9851331 DOI: 10.1093/bib/bbac583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/02/2022] [Accepted: 11/29/2022] [Indexed: 12/28/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) played essential roles in nearly every biological process and disease. Many algorithms were developed to distinguish lncRNAs from mRNAs in transcriptomic data and facilitated discoveries of more than 600 000 of lncRNAs. However, only a tiny fraction (<1%) of lncRNA transcripts (~4000) were further validated by low-throughput experiments (EVlncRNAs). Given the cost and labor-intensive nature of experimental validations, it is necessary to develop computational tools to prioritize those potentially functional lncRNAs because many lncRNAs from high-throughput sequencing (HTlncRNAs) could be resulted from transcriptional noises. Here, we employed deep learning algorithms to separate EVlncRNAs from HTlncRNAs and mRNAs. For overcoming the challenge of small datasets, we employed a three-layer deep-learning neural network (DNN) with a K-mer feature as the input and a small convolutional neural network (CNN) with one-hot encoding as the input. Three separate models were trained for human (h), mouse (m) and plant (p), respectively. The final concatenated models (EVlncRNA-Dpred (h), EVlncRNA-Dpred (m) and EVlncRNA-Dpred (p)) provided substantial improvement over a previous model based on support-vector-machines (EVlncRNA-pred). For example, EVlncRNA-Dpred (h) achieved 0.896 for the area under receiver-operating characteristic curve, compared with 0.582 given by sequence-based EVlncRNA-pred model. The models developed here should be useful for screening lncRNA transcripts for experimental validations. EVlncRNA-Dpred is available as a web server at https://www.sdklab-biophysics-dzu.net/EVlncRNA-Dpred/index.html, and the data and source code can be freely available along with the web server.
Collapse
|
7
|
Hierarchical multi-label classification based on LSTM network and Bayesian decision theory for LncRNA function prediction. Sci Rep 2022; 12:5819. [PMID: 35388048 PMCID: PMC8986818 DOI: 10.1038/s41598-022-09672-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 03/21/2022] [Indexed: 02/01/2023] Open
Abstract
Growing evidence shows that long noncoding RNAs (lncRNAs) play an important role in cellular biological processes at multiple levels, such as gene imprinting, immune response, and genetic regulation, and are closely related to diseases because of their complex and precise control. However, most functions of lncRNAs remain undiscovered. Current computational methods for exploring lncRNA functions can avoid high-throughput experiments, but they usually focus on the construction of similarity networks and ignore the certain directed acyclic graph (DAG) formed by gene ontology annotations. In this paper, we view the function annotation work as a hierarchical multilabel classification problem and design a method HLSTMBD for classification with DAG-structured labels. With the help of a mathematical model based on Bayesian decision theory, the HLSTMBD algorithm is implemented with the long-short term memory network and a hierarchical constraint method DAGLabel. Compared with other state-of-the-art algorithms, the results on GOA-lncRNA datasets show that the proposed method can efficiently and accurately complete the label prediction work.
Collapse
|
8
|
GBDTLRL2D Predicts LncRNA-Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network. Front Cell Dev Biol 2021; 9:753027. [PMID: 34977011 PMCID: PMC8718797 DOI: 10.3389/fcell.2021.753027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 11/22/2021] [Indexed: 12/16/2022] Open
Abstract
In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA-disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA-disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA-disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.
Collapse
|
9
|
LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification. BioData Min 2021; 14:50. [PMID: 34861891 PMCID: PMC8642957 DOI: 10.1186/s13040-021-00277-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 08/22/2021] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Long noncoding RNAs (lncRNAs) have dense linkages with various biological processes. Identifying interacting lncRNA-protein pairs contributes to understand the functions and mechanisms of lncRNAs. Wet experiments are costly and time-consuming. Most computational methods failed to observe the imbalanced characterize of lncRNA-protein interaction (LPI) data. More importantly, they were measured based on a unique dataset, which produced the prediction bias. RESULTS In this study, we develop an Ensemble framework (LPI-EnEDT) with Extra tree and Decision Tree classifiers to implement imbalanced LPI data classification. First, five LPI datasets are arranged. Second, lncRNAs and proteins are separately characterized based on Pyfeat and BioTriangle and concatenated as a vector to represent each lncRNA-protein pair. Finally, an ensemble framework with Extra tree and decision tree classifiers is developed to classify unlabeled lncRNA-protein pairs. The comparative experiments demonstrate that LPI-EnEDT outperforms four classical LPI prediction methods (LPI-BLS, LPI-CatBoost, LPI-SKF, and PLIPCOM) under cross validations on lncRNAs, proteins, and LPIs. The average AUC values on the five datasets are 0.8480, 0,7078, and 0.9066 under the three cross validations, respectively. The average AUPRs are 0.8175, 0.7265, and 0.8882, respectively. Case analyses suggest that there are underlying associations between HOTTIP and Q9Y6M1, NRON and Q15717. CONCLUSIONS Fusing diverse biological features of lncRNAs and proteins and exploiting an ensemble learning model with Extra tree and decision tree classifiers, this work focus on imbalanced LPI data classification as well as interaction information inference for a new lncRNA (or protein).
Collapse
|
10
|
Fusion of KATZ measure and space projection to fast probe potential lncRNA-disease associations in bipartite graphs. PLoS One 2021; 16:e0260329. [PMID: 34807960 PMCID: PMC8608294 DOI: 10.1371/journal.pone.0260329] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 11/06/2021] [Indexed: 11/19/2022] Open
Abstract
It is well known that numerous long noncoding RNAs (lncRNAs) closely relate to the physiological and pathological processes of human diseases and can serves as potential biomarkers. Therefore, lncRNA-disease associations that are identified by computational methods as the targeted candidates reduce the cost of biological experiments focusing on deep study furtherly. However, inaccurate construction of similarity networks and inadequate numbers of observed known lncRNA–disease associations, such inherent problems make many mature computational methods that have been developed for many years still exit some limitations. It motivates us to explore a new computational method that was fused with KATZ measure and space projection to fast probing potential lncRNA-disease associations (namely KATZSP). KATZSP is comprised of following key steps: combining all the global information with which to change Boolean network of known lncRNA–disease associations into the weighted networks; changing the similarities calculation into counting the number of walks that connect lncRNA nodes and disease nodes in bipartite graphs; obtaining the space projection scores to refine the primary prediction scores. The process to fuse KATZ measure and space projection was simplified and uncomplicated with needing only one attenuation factor. The leave-one-out cross validation (LOOCV) experimental results showed that, compared with other state-of-the-art methods (NCPLDA, LDAI-ISPS and IIRWR), KATZSP had a higher predictive accuracy shown with area-under-the-curve (AUC) value on the three datasets built, while KATZSP well worked on inferring potential associations related to new lncRNAs (or isolated diseases). The results from real cases study (such as pancreas cancer, lung cancer and colorectal cancer) further confirmed that KATZSP is capable of superior predictive ability to be applied as a guide for traditional biological experiments.
Collapse
|
11
|
A novel lncRNA-protein interaction prediction method based on deep forest with cascade forest structure. Sci Rep 2021; 11:18881. [PMID: 34556758 PMCID: PMC8460650 DOI: 10.1038/s41598-021-98277-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 08/18/2021] [Indexed: 02/08/2023] Open
Abstract
Long noncoding RNAs (lncRNAs) regulate many biological processes by interacting with corresponding RNA-binding proteins. The identification of lncRNA-protein Interactions (LPIs) is significantly important to well characterize the biological functions and mechanisms of lncRNAs. Existing computational methods have been effectively applied to LPI prediction. However, the majority of them were evaluated only on one LPI dataset, thereby resulting in prediction bias. More importantly, part of models did not discover possible LPIs for new lncRNAs (or proteins). In addition, the prediction performance remains limited. To solve with the above problems, in this study, we develop a Deep Forest-based LPI prediction method (LPIDF). First, five LPI datasets are obtained and the corresponding sequence information of lncRNAs and proteins are collected. Second, features of lncRNAs and proteins are constructed based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure, respectively. Finally, a deep forest model with cascade forest structure is developed to find new LPIs. We compare LPIDF with four classical association prediction models based on three fivefold cross validations on lncRNAs, proteins, and LPIs. LPIDF obtains better average AUCs of 0.9012, 0.6937 and 0.9457, and the best average AUPRs of 0.9022, 0.6860, and 0.9382, respectively, for the three CVs, significantly outperforming other methods. The results show that the lncRNA FTX may interact with the protein P35637 and needs further validation.
Collapse
|
12
|
Prioritizing Disease-Related Microbes Based on the Topological Properties of a Comprehensive Network. Front Microbiol 2021; 12:685549. [PMID: 34326821 PMCID: PMC8315281 DOI: 10.3389/fmicb.2021.685549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 05/10/2021] [Indexed: 01/09/2023] Open
Abstract
Many microbes are parasitic within the human body, engaging in various physiological processes and playing an important role in human diseases. The discovery of new microbe-disease associations aids our understanding of disease pathogenesis. Computational methods can be applied in such investigations, thereby avoiding the time-consuming and laborious nature of experimental methods. In this study, we constructed a comprehensive microbe-disease network by integrating known microbe-disease associations from three large-scale databases (Peryton, Disbiome, and gutMDisorder), and extended the random walk with restart to the network for prioritizing unknown microbe-disease associations. The area under the curve values of the leave-one-out cross-validation and the fivefold cross-validation exceeded 0.9370 and 0.9366, respectively, indicating the high performance of this method. Despite being widely studied diseases, in case studies of inflammatory bowel disease, asthma, and obesity, some prioritized disease-related microbes were validated by recent literature. This suggested that our method is effective at prioritizing novel disease-related microbes and may offer further insight into disease pathogenesis.
Collapse
|
13
|
LDAH2V: Exploring Meta-Paths Across Multiple Networks for lncRNA-Disease Association Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1572-1581. [PMID: 31725386 DOI: 10.1109/tcbb.2019.2946257] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Accumulating evidence has demonstrated dysfunctions of long non-coding RNAs (lncRNAs) are involved in various complex human diseases. However, even today, the relationships between lncRNAs and diseases remain unknown in most cases. Developing effective computational approaches to identify potential lncRNA-disease associations has become a hot topic. Existing network-based approaches are usually focused on the intrinsic features of lncRNAs and diseases but ignore the heterogeneous information of biological networks. Considering the limitations in previous methods, we propose LDAH2V, an efficient computational framework for predicting potential lncRNA-disease associations. LDAH2V uses the HIN2Vec to calculate the meta-path and feature vector for each lncRNA-disease pair in the heterogeneous information network (HIN), which consists of lncRNA similarity network, disease similarity network, miRNA similarity network, and the associations between them. Then, a Gradient Boosting Tree (GBT) classifier to predict lncRNA-disease associations is built with the feature vectors. The results show that LDAH2V performs significantly better than the four existing state-of-the-art methods and gains an AUC of 0.97 in the 10-fold cross-validation test. Furthermore, case studies of colon cancer and ovarian cancer-related lncRNAs have been confirmed in related databases and medical literature.
Collapse
|
14
|
FGGA-lnc: automatic gene ontology annotation of lncRNA sequences based on secondary structures. Interface Focus 2021; 11:20200064. [PMID: 34123354 PMCID: PMC8193470 DOI: 10.1098/rsfs.2020.0064] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/13/2021] [Indexed: 02/01/2023] Open
Abstract
The study of long non-coding RNAs (lncRNAs), greater than 200 nucleotides, is central to understanding the development and progression of many complex diseases. Unlike proteins, the functionality of lncRNAs is only subtly encoded in their primary sequence. Current in-silico lncRNA annotation methods mostly rely on annotations inferred from interaction networks. But extensive experimental studies are required to build these networks. In this work, we present a graph-based machine learning method called FGGA-lnc for the automatic gene ontology (GO) annotation of lncRNAs across the three GO subdomains. We build upon FGGA (factor graph GO annotation), a computational method originally developed to annotate protein sequences from non-model organisms. In the FGGA-lnc version, a coding-based approach is introduced to fuse primary sequence and secondary structure information of lncRNA molecules. As a result, lncRNA sequences become sequences of a higher-order alphabet allowing supervised learning methods to assess individual GO-term annotations. Raw GO annotations obtained in this way are unaware of the GO structure and therefore likely to be inconsistent with it. The message-passing algorithm embodied by factor graph models overcomes this problem. Evaluations of the FGGA-lnc method on lncRNA data, from model and non-model organisms, showed promising results suggesting it as a candidate to satisfy the huge demand for functional annotations arising from high-throughput sequencing technologies.
Collapse
|
15
|
A Survey of Network Representation Learning Methods for Link Prediction in Biological Network. Curr Pharm Des 2021; 26:3076-3084. [PMID: 31951161 DOI: 10.2174/1381612826666200116145057] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 01/09/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND Networks are powerful resources for describing complex systems. Link prediction is an important issue in network analysis and has important practical application value. Network representation learning has proven to be useful for network analysis, especially for link prediction tasks. OBJECTIVE To review the application of network representation learning on link prediction in a biological network, we summarize recent methods for link prediction in a biological network and discuss the application and significance of network representation learning in link prediction task. METHOD & RESULTS We first introduce the widely used link prediction algorithms, then briefly introduce the development of network representation learning methods, focusing on a few widely used methods, and their application in biological network link prediction. Existing studies demonstrate that using network representation learning to predict links in biological networks can achieve better performance. In the end, some possible future directions have been discussed.
Collapse
|
16
|
LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions. Front Genet 2020; 11:615144. [PMID: 33362868 PMCID: PMC7758075 DOI: 10.3389/fgene.2020.615144] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 11/16/2020] [Indexed: 01/24/2023] Open
Abstract
Long non-coding RNAs (lncRNAs) play an important role in serval biological activities, including transcription, splicing, translation, and some other cellular regulation processes. lncRNAs perform their biological functions by interacting with various proteins. The studies on lncRNA-protein interactions are of great value to the understanding of lncRNA functional mechanisms. In this paper, we proposed a novel model to predict potential lncRNA-protein interactions using the SKF (similarity kernel fusion) and LapRLS (Laplacian regularized least squares) algorithms. We named this method the LPI-SKF. Various similarities of both lncRNAs and proteins were integrated into the LPI-SKF. LPI-SKF can be applied in predicting potential interactions involving novel proteins or lncRNAs. We obtained an AUROC (area under receiver operating curve) of 0.909 in a 5-fold cross-validation, which outperforms other state-of-the-art methods. A total of 19 out of the top 20 ranked interaction predictions were verified by existing data, which implied that the LPI-SKF had great potential in discovering unknown lncRNA-protein interactions accurately. All data and codes of this work can be downloaded from a GitHub repository (https://github.com/zyk2118216069/LPI-SKF).
Collapse
|
17
|
DeepciRGO: functional prediction of circular RNAs through hierarchical deep neural networks using heterogeneous network features. BMC Bioinformatics 2020; 21:519. [PMID: 33183227 PMCID: PMC7659092 DOI: 10.1186/s12859-020-03748-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 09/11/2020] [Indexed: 12/28/2022] Open
Abstract
Background Circular RNAs (circRNAs) are special noncoding RNA molecules with closed loop structures. Compared with the traditional linear RNA, circRNA is more stable and not easily degraded. Many studies have shown that circRNAs are involved in the regulation of various diseases and cancers. Determining the functions of circRNAs in mammalian cells is of great significance for revealing their mechanism of action in physiological and pathological processes, diagnosis and treatment of diseases. However, determining the functions of circRNAs on a large scale is a challenging task because of the high experimental costs. Results In this paper, we present a hierarchical deep learning model, DeepciRGO, which can effectively predict gene ontology functions of circRNAs. We build a heterogeneous network containing circRNA co-expressions, protein–protein interactions and protein–circRNA interactions. The topology features of proteins and circRNAs are calculated using a novel representation learning approach HIN2Vec across the heterogeneous network. Then, a deep multi-label hierarchical classification model is trained with the topology features to predict the biological process function in the gene ontology for each circRNA. In particular, we manually curated a benchmark dataset containing 185 GO annotations for 62 circRNAs, namely, circRNA2GO-62. The DeepciRGO achieves promising performance on the circRNA2GO-62 dataset with a maximum F-measure of 0.412, a recall score of 0.400, and an accuracy of 0.425, which are significantly better than other state-of-the-art RNA function prediction methods. In addition, we demonstrate the considerable potential of integrating multiple interactions and association networks. Conclusions DeepciRGO will be a useful tool for accurately annotating circRNAs. The experimental results show that integrating multi-source data can help to improve the predictive performance of DeepciRGO. Moreover, The model also can combine RNA structure and sequence information to further optimize predictive performance.
Collapse
|
18
|
A Mendelian Randomization Study on Infant Length and Type 2 Diabetes Mellitus Risk. Curr Gene Ther 2020; 19:224-231. [PMID: 31553296 DOI: 10.2174/1566523219666190925115535] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 06/15/2019] [Accepted: 06/16/2019] [Indexed: 12/12/2022]
Abstract
OBJECTIVE Infant length (IL) is a positively associated phenotype of type 2 diabetes mellitus (T2DM), but the causal relationship of which is still unclear. Here, we applied a Mendelian randomization (MR) study to explore the causal relationship between IL and T2DM, which has the potential to provide guidance for assessing T2DM activity and T2DM- prevention in young at-risk populations. MATERIALS AND METHODS To classify the study, a two-sample MR, using genetic instrumental variables (IVs) to explore the causal effect was applied to test the influence of IL on the risk of T2DM. In this study, MR was carried out on GWAS data using 8 independent IL SNPs as IVs. The pooled odds ratio (OR) of these SNPs was calculated by the inverse-variance weighted method for the assessment of the risk the shorter IL brings to T2DM. Sensitivity validation was conducted to identify the effect of individual SNPs. MR-Egger regression was used to detect pleiotropic bias of IVs. RESULTS The pooled odds ratio from the IVW method was 1.03 (95% CI 0.89-1.18, P = 0.0785), low intercept was -0.477, P = 0.252, and small fluctuation of ORs ranged from -0.062 ((0.966 - 1.03) / 1.03) to 0.05 ((1.081 - 1.03) / 1.03) in leave-one-out validation. CONCLUSION We validated that the shorter IL causes no additional risk to T2DM. The sensitivity analysis and the MR-Egger regression analysis also provided adequate evidence that the above result was not due to any heterogeneity or pleiotropic effect of IVs.
Collapse
|
19
|
Microbes and complex diseases: from experimental results to computational models. Brief Bioinform 2020; 22:5882184. [PMID: 32766753 DOI: 10.1093/bib/bbaa158] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 06/19/2020] [Accepted: 06/22/2020] [Indexed: 12/13/2022] Open
Abstract
Studies have shown that the number of microbes in humans is almost 10 times that of cells. These microbes have been proven to play an important role in a variety of physiological processes, such as enhancing immunity, improving the digestion of gastrointestinal tract and strengthening metabolic function. In addition, in recent years, more and more research results have indicated that there are close relationships between the emergence of the human noncommunicable diseases and microbes, which provides a novel insight for us to further understand the pathogenesis of the diseases. An in-depth study about the relationships between diseases and microbes will not only contribute to exploring new strategies for the diagnosis and treatment of diseases but also significantly heighten the efficiency of new drugs development. However, applying the methods of biological experimentation to reveal the microbe-disease associations is costly and inefficient. In recent years, more and more researchers have constructed multiple computational models to predict microbes that are potentially associated with diseases. Here, we start with a brief introduction of microbes and databases as well as web servers related to them. Then, we mainly introduce four kinds of computational models, including score function-based models, network algorithm-based models, machine learning-based models and experimental analysis-based models. Finally, we summarize the advantages as well as disadvantages of them and set the direction for the future work of revealing microbe-disease associations based on computational models. We firmly believe that computational models are expected to be important tools in large-scale predictions of disease-related microbes.
Collapse
|
20
|
Predicting Preference of Transcription Factors for Methylated DNA Using Sequence Information. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:1043-1050. [PMID: 33294291 PMCID: PMC7691157 DOI: 10.1016/j.omtn.2020.07.035] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 07/28/2020] [Indexed: 12/12/2022]
Abstract
Transcription factors play key roles in cell-fate decisions by regulating 3D genome conformation and gene expression. The traditional view is that methylation of DNA hinders transcription factors binding to them, but recent research has shown that many transcription factors prefer to bind to methylated DNA. Therefore, identifying such transcription factors and understanding their functions is a stepping-stone for studying methylation-mediated biological processes. In this paper, a two-step discriminated method was proposed to recognize transcription factors and their preference for methylated DNA based only on sequences information. In the first step, the proposed model was used to discriminate transcription factors from non-transcription factors. The areas under the curve (AUCs) are 0.9183 and 0.9116, respectively, for the 5-fold cross-validation test and independent dataset test. Subsequently, for the classification of transcription factors that prefer methylated DNA and transcription factors that prefer non-methylated DNA, our model could produce the AUCs of 0.7744 and 0.7356, respectively, for the 5-fold cross-validation test and independent dataset test. Based on the proposed model, a user-friendly web server called TFPred was built, which can be freely accessed at http://lin-group.cn/server/TFPred/.
Collapse
|
21
|
Remarks on Computational Method for Identifying Acid and Alkaline Enzymes. Curr Pharm Des 2020; 26:3105-3114. [PMID: 32552636 DOI: 10.2174/1381612826666200617170826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Accepted: 05/07/2020] [Indexed: 11/22/2022]
Abstract
The catalytic efficiency of the enzyme is thousands of times higher than that of ordinary catalysts. Thus, they are widely used in industrial and medical fields. However, enzymes with protein structure can be destroyed and inactivated in high temperature, over acid or over alkali environment. It is well known that most of enzymes work well in an environment with pH of 6-8, while some special enzymes remain active only in an alkaline environment with pH > 8 or an acidic environment with pH < 6. Therefore, the identification of acidic and alkaline enzymes has become a key task for industrial production. Because of the wide varieties of enzymes, it is hard work to determine the acidity and alkalinity of the enzyme by experimental methods, and even this task cannot be achieved. Converting protein sequences into digital features and building computational models can efficiently and accurately identify the acidity and alkalinity of enzymes. This review summarized the progress of the digital features to express proteins and computational methods to identify acidic and alkaline enzymes. We hope that this paper will provide more convenience, ideas, and guides for computationally classifying acid and alkaline enzymes.
Collapse
|
22
|
A Review of Drug Side Effect Identification Methods. Curr Pharm Des 2020; 26:3096-3104. [PMID: 32532187 DOI: 10.2174/1381612826666200612163819] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 05/18/2020] [Indexed: 11/22/2022]
Abstract
Drug side effects have become an important indicator for evaluating the safety of drugs. There are two main factors in the frequent occurrence of drug safety problems; on the one hand, the clinical understanding of drug side effects is insufficient, leading to frequent adverse drug reactions, while on the other hand, due to the long-term period and complexity of clinical trials, side effects of approved drugs on the market cannot be reported in a timely manner. Therefore, many researchers have focused on developing methods to identify drug side effects. In this review, we summarize the methods of identifying drug side effects and common databases in this field. We classified methods of identifying side effects into four categories: biological experimental, machine learning, text mining and network methods. We point out the key points of each kind of method. In addition, we also explain the advantages and disadvantages of each method. Finally, we propose future research directions.
Collapse
|
23
|
GBDTL2E: Predicting lncRNA-EF Associations Using Diffusion and HeteSim Features Based on a Heterogeneous Network. Front Genet 2020; 11:272. [PMID: 32351537 PMCID: PMC7174746 DOI: 10.3389/fgene.2020.00272] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Accepted: 03/06/2020] [Indexed: 12/02/2022] Open
Abstract
Interactions between genetic factors and environmental factors (EFs) play an important role in many diseases. Many diseases result from the interaction between genetics and EFs. The long non-coding RNA (lncRNA) is an important non-coding RNA that regulates life processes. The ability to predict the associations between lncRNAs and EFs is of important practical significance. However, the recent methods for predicting lncRNA-EF associations rarely use the topological information of heterogenous biological networks or simply treat all objects as the same type without considering the different and subtle semantic meanings of various paths in the heterogeneous network. In order to address this issue, a method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between lncRNAs and EFs (GBDTL2E) is proposed in this paper. The innovation of the GBDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The experimental results demonstrate that the proposed algorithm achieves a high performance.
Collapse
|
24
|
Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion. Front Genet 2020; 11:5. [PMID: 32117433 PMCID: PMC7010852 DOI: 10.3389/fgene.2020.00005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 01/06/2020] [Indexed: 12/23/2022] Open
Abstract
Complex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period and high cost of traditional biomedical methods, researchers have proposed many gene prioritization algorithms that use a large amount of biological data to mine pathogenic genes. However, because the currently known gene–disease association matrix is still very sparse and lacks evidence that genes and diseases are unrelated, there are limits to the predictive performance of gene prioritization algorithms. Based on the hypothesis that functionally related gene mutations may lead to similar disease phenotypes, this paper proposes a PU induction matrix completion algorithm based on heterogeneous information fusion (PUIMCHIF) to predict candidate genes involved in the pathogenicity of human diseases. On the one hand, PUIMCHIF uses different compact feature learning methods to extract features of genes and diseases from multiple data sources, making up for the lack of sparse data. On the other hand, based on the prior knowledge that most of the unknown gene–disease associations are unrelated, we use the PU-Learning strategy to treat the unknown unlabeled data as negative examples for biased learning. The experimental results of the PUIMCHIF algorithm regarding the three indexes of precision, recall, and mean percentile ranking (MPR) were significantly better than those of other algorithms. In the top 100 global prediction analysis of multiple genes and multiple diseases, the probability of recovering true gene associations using PUIMCHIF reached 50% and the MPR value was 10.94%. The PUIMCHIF algorithm has higher priority than those from other methods, such as IMC and CATAPULT.
Collapse
|
25
|
Exploration of the correlation between GPCRs and drugs based on a learning to rank algorithm. Comput Biol Med 2020; 119:103660. [PMID: 32090901 DOI: 10.1016/j.compbiomed.2020.103660] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 02/04/2020] [Accepted: 02/12/2020] [Indexed: 02/01/2023]
Abstract
Exploring the protein - drug correlation can not only solve the problem of selecting candidate compounds but also solve related problems such as drug redirection and finding potential drug targets. Therefore, many researchers have proposed different machine learning methods for prediction of protein-drug correlations. However, many existing models simply divide the protein-drug relationship into related or irrelevant categories and do not deeply explore the most relevant target (or drug) for a given drug (or target). In order to solve this problem, this paper applies the ranking concept to the prediction of the GPCR (G Protein-Coupled Receptors)-drug correlation. This study uses two different types of data sets to explore candidate compound and potential target problems, and both sets achieved good results. In addition, this study also found that the family to which a protein belongs is not an inherent factor that affects the ranking of GPCR-drug correlations; however, if the drug affects other family members of the protein, then the protein is likely to be a potential target of the drug. This study showed that the learning to rank algorithm is a good tool for exploring protein-drug correlations.
Collapse
|
26
|
6mA-RicePred: A Method for Identifying DNA N 6-Methyladenine Sites in the Rice Genome Based on Feature Fusion. FRONTIERS IN PLANT SCIENCE 2020; 11:4. [PMID: 32076430 PMCID: PMC7006724 DOI: 10.3389/fpls.2020.00004] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 01/06/2020] [Indexed: 06/01/2023]
Abstract
MOTIVATION The biological function of N 6-methyladenine DNA (6mA) in plants is largely unknown. Rice is one of the most important crops worldwide and is a model species for molecular and genetic studies. There are few methods for 6mA site recognition in the rice genome, and an effective computational method is needed. RESULTS In this paper, we propose a new computational method called 6mA-Pred to identify 6mA sites in the rice genome. 6mA-Pred employs a feature fusion method to combine advantageous features from other methods and thus obtain a new feature to identify 6mA sites. This method achieved an accuracy of 87.27% in the identification of 6mA sites with 10-fold cross-validation and achieved an accuracy of 85.6% in independent test sets.
Collapse
|
27
|
Predicting lncRNA-Protein Interactions With miRNAs as Mediators in a Heterogeneous Network Model. Front Genet 2020; 10:1341. [PMID: 32038709 PMCID: PMC6988623 DOI: 10.3389/fgene.2019.01341] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 12/09/2019] [Indexed: 01/20/2023] Open
Abstract
Long non-coding RNAs (lncRNAs) play important roles in various biological processes, where lncRNA–protein interactions are usually involved. Therefore, identifying lncRNA–protein interactions is of great significance to understand the molecular functions of lncRNAs. Since the experiments to identify lncRNA–protein interactions are always costly and time consuming, computational methods are developed as alternative approaches. However, existing lncRNA–protein interaction predictors usually require prior knowledge of lncRNA–protein interactions with experimental evidences. Their performances are limited due to the number of known lncRNA–protein interactions. In this paper, we explored a novel way to predict lncRNA–protein interactions without direct prior knowledge. MiRNAs were picked up as mediators to estimate potential interactions between lncRNAs and proteins. By validating our results based on known lncRNA–protein interactions, our method achieved an AUROC (Area Under Receiver Operating Curve) of 0.821, which is comparable to the state-of-the-art methods. Moreover, our method achieved an improved AUROC of 0.852 by further expanding the training dataset. We believe that our method can be a useful supplement to the existing methods, as it provides an alternative way to estimate lncRNA–protein interactions in a heterogeneous network without direct prior knowledge. All data and codes of this work can be downloaded from GitHub (https://github.com/zyk2118216069/LncRNA-protein-interactions-prediction).
Collapse
|
28
|
LPGNMF: Predicting Long Non-Coding RNA and Protein Interaction Using Graph Regularized Nonnegative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:189-197. [PMID: 30059315 DOI: 10.1109/tcbb.2018.2861009] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Long non-coding RNAs (lncRNA) play crucial roles in a variety of biological processes and complex diseases. Massive studies have indicated that lncRNAs interact with related proteins to exert regulation of cellular biological processes. Because it is time-consuming and expensive to determine lncRNA-protein interaction by experiment, more accurate predictions of interaction by computational methods are imperative. We propose a novel computational approach, predicting lncRNA-protein interaction using graph regularized nonnegative matrix factorization (LPGNMF), to discover unobserved lncRNA-protein association. First, we calculate lncRNA similarity and protein similarity by integrating the lncRNA expression information and gene ontology information. Subsequently, we utilize graph regularized nonnegative matrix factorization framework to predict potential interactions for all lncRNA simultaneously. In the cross validation test, LPGNMF achieves an AUC of 85.2 percent, higher than those of other compared methods. In addition, novel lncRNA-protein interactions detected by LPGNMF are validated by literatures or database. The results indicate that our method is effective to discover potential lncRNA-protein interaction.
Collapse
|
29
|
NMFGO: Gene Function Prediction via Nonnegative Matrix Factorization with Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:238-249. [PMID: 30059316 DOI: 10.1109/tcbb.2018.2861379] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Gene Ontology (GO) is a controlled vocabulary of terms that describe molecule function, biological roles, and cellular locations of gene products (i.e., proteins and RNAs), it hierarchically organizes more than 43,000 GO terms via the direct acyclic graph. A gene is generally annotated with several of these GO terms. Therefore, accurately predicting the association between genes and massive terms is a difficult challenge. To combat with this challenge, we propose an matrix factorization based approach called NMFGO. NMFGO stores the available GO annotations of genes in a gene-term association matrix and adopts an ontological structure based taxonomic similarity measure to capture the GO hierarchy. Next, it factorizes the association matrix into two low-rank matrices via nonnegative matrix factorization regularized with the GO hierarchy. After that, it employs a semantic similarity based k nearest neighbor classifier in the low-rank matrices approximated subspace to predict gene functions. Empirical study on three model species (S. cerevisiae, H. sapiens, and A. thaliana) shows that NMFGO is robust to the input parameters and achieves significantly better prediction performance than GIC, TO, dRW- kNN, and NtN, which were re-implemented based on the instructions of the original papers. The supplementary file and demo codes of NMFGO are available at http://mlda.swu.edu.cn/codes.php?name=NMFGO.
Collapse
|
30
|
Abstract
BACKGROUND A collection of disease-associated data contributes to study the association between diseases. Discovering closely related diseases plays a crucial role in revealing their common pathogenic mechanisms. This might further imply treatment that can be appropriated from one disease to another. During the past decades, a number of approaches for calculating disease similarity have been developed. However, most of them are designed to take advantage of single or few data sources, which results in their low accuracy. METHODS In this paper, we propose a novel method, called MultiSourcDSim, to calculate disease similarity by integrating multiple data sources, namely, gene-disease associations, GO biological process-disease associations and symptom-disease associations. Firstly, we establish three disease similarity networks according to the three disease-related data sources respectively. Secondly, the representation of each node is obtained by integrating the three small disease similarity networks. In the end, the learned representations are applied to calculate the similarity between diseases. RESULTS Our approach shows the best performance compared to the other three popular methods. Besides, the similarity network built by MultiSourcDSim suggests that our method can also uncover the latent relationships between diseases. CONCLUSIONS MultiSourcDSim is an efficient approach to predict similarity between diseases.
Collapse
|
31
|
Predicting effective drug combinations using gradient tree boosting based on features extracted from drug-protein heterogeneous network. BMC Bioinformatics 2019; 20:645. [PMID: 31818267 PMCID: PMC6902475 DOI: 10.1186/s12859-019-3288-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Accepted: 11/21/2019] [Indexed: 01/30/2023] Open
Abstract
Background Although targeted drugs have contributed to impressive advances in the treatment of cancer patients, their clinical benefits on tumor therapies are greatly limited due to intrinsic and acquired resistance of cancer cells against such drugs. Drug combinations synergistically interfere with protein networks to inhibit the activity level of carcinogenic genes more effectively, and therefore play an increasingly important role in the treatment of complex disease. Results In this paper, we combined the drug similarity network, protein similarity network and known drug-protein associations into a drug-protein heterogenous network. Next, we ran random walk with restart (RWR) on the heterogenous network using the combinatorial drug targets as the initial probability, and obtained the converged probability distribution as the feature vector of each drug combination. Taking these feature vectors as input, we trained a gradient tree boosting (GTB) classifier to predict new drug combinations. We conducted performance evaluation on the widely used drug combination data set derived from the DCDB database. The experimental results show that our method outperforms seven typical classifiers and traditional boosting algorithms. Conclusions The heterogeneous network-derived features introduced in our method are more informative and enriching compared to the primary ontology features, which results in better performance. In addition, from the perspective of network pharmacology, our method effectively exploits the topological attributes and interactions of drug targets in the overall biological network, which proves to be a systematic and reliable approach for drug discovery.
Collapse
|
32
|
DeepMiR2GO: Inferring Functions of Human MicroRNAs Using a Deep Multi-Label Classification Model. Int J Mol Sci 2019; 20:E6046. [PMID: 31801264 PMCID: PMC6928926 DOI: 10.3390/ijms20236046] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2019] [Revised: 11/25/2019] [Accepted: 11/26/2019] [Indexed: 01/08/2023] Open
Abstract
MicroRNAs (miRNAs) are a highly abundant collection of functional non-coding RNAs involved in cellular regulation and various complex human diseases. Although a large number of miRNAs have been identified, most of their physiological functions remain unknown. Computational methods play a vital role in exploring the potential functions of miRNAs. Here, we present DeepMiR2GO, a tool for integrating miRNAs, proteins and diseases, to predict the gene ontology (GO) functions based on multiple deep neuro-symbolic models. DeepMiR2GO starts by integrating the miRNA co-expression network, protein-protein interaction (PPI) network, disease phenotype similarity network, and interactions or associations among them into a global heterogeneous network. Then, it employs an efficient graph embedding strategy to learn potential network representations of the global heterogeneous network as the topological features. Finally, a deep multi-label classification network based on multiple neuro-symbolic models is built and used to annotate the GO terms of miRNAs. The predicted results demonstrate that DeepMiR2GO performs significantly better than other state-of-the-art approaches in terms of precision, recall, and maximum F-measure.
Collapse
|
33
|
Targeting Virus-host Protein Interactions: Feature Extraction and Machine Learning Approaches. Curr Drug Metab 2019; 20:177-184. [PMID: 30156155 DOI: 10.2174/1389200219666180829121038] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 05/21/2018] [Accepted: 08/02/2018] [Indexed: 01/15/2023]
Abstract
BACKGROUND Targeting critical viral-host Protein-Protein Interactions (PPIs) has enormous application prospects for therapeutics. Using experimental methods to evaluate all possible virus-host PPIs is labor-intensive and time-consuming. Recent growth in computational identification of virus-host PPIs provides new opportunities for gaining biological insights, including applications in disease control. We provide an overview of recent computational approaches for studying virus-host PPI interactions. METHODS In this review, a variety of computational methods for virus-host PPIs prediction have been surveyed. These methods are categorized based on the features they utilize and different machine learning algorithms including classical and novel methods. RESULTS We describe the pivotal and representative features extracted from relevant sources of biological data, mainly include sequence signatures, known domain interactions, protein motifs and protein structure information. We focus on state-of-the-art machine learning algorithms that are used to build binary prediction models for the classification of virus-host protein pairs and discuss their abilities, weakness and future directions. CONCLUSION The findings of this review confirm the importance of computational methods for finding the potential protein-protein interactions between virus and host. Although there has been significant progress in the prediction of virus-host PPIs in recent years, there is a lot of room for improvement in virus-host PPI prediction.
Collapse
|
34
|
Prediction of Disease Comorbidity Using HeteSim Scores based on Multiple Heterogeneous Networks. Curr Gene Ther 2019; 19:232-241. [DOI: 10.2174/1566523219666190917155959] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 06/14/2019] [Accepted: 06/16/2019] [Indexed: 12/25/2022]
Abstract
Background:
Accumulating experimental studies have indicated that disease comorbidity
causes additional pain to patients and leads to the failure of standard treatments compared to patients
who have a single disease. Therefore, accurate prediction of potential comorbidity is essential to design
more efficient treatment strategies. However, only a few disease comorbidities have been discovered
in the clinic.
Objective:
In this work, we propose PCHS, an effective computational method for predicting disease
comorbidity.
Materials and Methods:
We utilized the HeteSim measure to calculate the relatedness score for different
disease pairs in the global heterogeneous network, which integrates six networks based on biological
information, including disease-disease associations, drug-drug interactions, protein-protein interactions
and associations among them. We built the prediction model using the Support Vector Machine
(SVM) based on the HeteSim scores.
Results and Conclusion:
The results showed that PCHS performed significantly better than previous
state-of-the-art approaches and achieved an AUC score of 0.90 in 10-fold cross-validation. Furthermore,
some of our predictions have been verified in literatures, indicating the effectiveness of our method.
Collapse
|
35
|
Deep Learning Enables Accurate Prediction of Interplay Between lncRNA and Disease. Front Genet 2019; 10:937. [PMID: 31649723 PMCID: PMC6795129 DOI: 10.3389/fgene.2019.00937] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Accepted: 09/05/2019] [Indexed: 11/13/2022] Open
Abstract
Many studies have suggested that lncRNAs are involved in distinct and diverse biological processes. The mutation of lncRNAs plays a major role in a wide range of diseases. A comprehensive information of lncRNA-disease associations would improve our understanding of the underlying molecular mechanism that can explain the development of disease. However, the discovery of the relationship between lncRNA and disease in biological experiment is costly and time-consuming. Although many computational algorithms have been proposed in the last decade, there still exists much room to improve because of diverse computational limitations. In this paper, we proposed a deep-learning framework, NNLDA, to predict potential lncRNA-disease associations. We compared it with other two widely-used algorithms on a network with 205,959 interactions between 19,166 lncRNAs and 529 diseases. Results show that NNLDA outperforms other existing algorithm in the prediction of lncRNA-disease association. Additionally, NNLDA can be easily applied to large-scale datasets using the technique of mini-batch stochastic gradient descent. To our best knowledge, NNLDA is the first algorithm that uses deep neural networks to predict lncRNA-disease association. The source code of NNLDA can be freely accessed at https://github.com/gao793583308/NNLDA.
Collapse
|
36
|
Functions and Regulatory Mechanisms of lncRNAs in Skeletal Myogenesis, Muscle Disease and Meat Production. Cells 2019; 8:cells8091107. [PMID: 31546877 PMCID: PMC6769631 DOI: 10.3390/cells8091107] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Revised: 09/04/2019] [Accepted: 09/17/2019] [Indexed: 12/20/2022] Open
Abstract
Myogenesis is a complex biological process, and understanding the regulatory network of skeletal myogenesis will contribute to the treatment of human muscle related diseases and improvement of agricultural animal meat production. Long noncoding RNAs (lncRNAs) serve as regulators in gene expression networks, and participate in various biological processes. Recent studies have identified functional lncRNAs involved in skeletal muscle development and disease. These lncRNAs regulate the proliferation, differentiation, and fusion of myoblasts through multiple mechanisms, such as chromatin modification, transcription regulation, and microRNA sponge activity. In this review, we presented the latest advances regarding the functions and regulatory activities of lncRNAs involved in muscle development, muscle disease, and meat production. Moreover, challenges and future perspectives related to the identification of functional lncRNAs were also discussed.
Collapse
|
37
|
Fusion of multiple heterogeneous networks for predicting circRNA-disease associations. Sci Rep 2019; 9:9605. [PMID: 31270357 PMCID: PMC6610109 DOI: 10.1038/s41598-019-45954-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 06/18/2019] [Indexed: 12/20/2022] Open
Abstract
Circular RNAs (circRNAs) are a newly identified type of non-coding RNA (ncRNA) that plays crucial roles in many cellular processes and human diseases, and are potential disease biomarkers and therapeutic targets in human diseases. However, experimentally verified circRNA-disease associations are very rare. Hence, developing an accurate and efficient method to predict the association between circRNA and disease may be beneficial to disease prevention, diagnosis, and treatment. Here, we propose a computational method named KATZCPDA, which is based on the KATZ method and the integrations among circRNAs, proteins, and diseases to predict circRNA-disease associations. KATZCPDA not only verifies existing circRNA-disease associations but also predicts unknown associations. As demonstrated by leave-one-out and 10-fold cross-validation, KATZCPDA achieves AUC values of 0.959 and 0.958, respectively. The performance of KATZCPDA was substantially higher than those of previously developed network-based methods. To further demonstrate the effectiveness of KATZCPDA, we apply KATZCPDA to predict the associated circRNAs of Colorectal cancer, glioma, breast cancer, and Tuberculosis. The results illustrated that the predicted circRNA-disease associations could rank the top 10 of the experimentally verified associations.
Collapse
|
38
|
A Data-Driven Game Theoretic Strategy for Developers in Software Crowdsourcing: A Case Study. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9040721] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Crowdsourcing has the advantages of being cost-effective and saving time, which is a typical embodiment of collective wisdom and community workers’ collaborative development. However, this development paradigm of software crowdsourcing has not been used widely. A very important reason is that requesters have limited knowledge about crowd workers’ professional skills and qualities. Another reason is that the crowd workers in the competition cannot get the appropriate reward, which affects their motivation. To solve this problem, this paper proposes a method of maximizing reward based on the crowdsourcing ability of workers, they can choose tasks according to their own abilities to obtain appropriate bonuses. Our method includes two steps: Firstly, it puts forward a method to evaluate the crowd workers’ ability, then it analyzes the intensity of competition for tasks at Topcoder.com—an open community crowdsourcing platform—on the basis of the workers’ crowdsourcing ability; secondly, it follows dynamic programming ideas and builds game models under complete information in different cases, offering a strategy of reward maximization for workers by solving a mixed-strategy Nash equilibrium. This paper employs crowdsourcing data from Topcoder.com to carry out experiments. The experimental results show that the distribution of workers’ crowdsourcing ability is uneven, and to some extent it can show the activity degree of crowdsourcing tasks. Meanwhile, according to the strategy of reward maximization, a crowd worker can get the theoretically maximum reward.
Collapse
|
39
|
k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification. Front Genet 2019; 10:33. [PMID: 30809242 PMCID: PMC6379451 DOI: 10.3389/fgene.2019.00033] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 01/17/2019] [Indexed: 11/18/2022] Open
Abstract
In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.
Collapse
|
40
|
Predicting Gene Ontology Function of Human MicroRNAs by Integrating Multiple Networks. Front Genet 2019; 10:3. [PMID: 30761178 PMCID: PMC6361788 DOI: 10.3389/fgene.2019.00003] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 01/07/2019] [Indexed: 12/15/2022] Open
Abstract
MicroRNAs (miRNAs) have been demonstrated to play significant biological roles in many human biological processes. Inferring the functions of miRNAs is an important strategy for understanding disease pathogenesis at the molecular level. In this paper, we propose an integrated model, PmiRGO, to infer the gene ontology (GO) functions of miRNAs by integrating multiple data sources, including the expression profiles of miRNAs, miRNA-target interactions, and protein-protein interactions (PPI). PmiRGO starts by building a global network consisting of three networks. Then, it employs DeepWalk to learn latent representations as network features of the global heterogeneous network. Finally, the SVM-based models are applied to label the GO terms of miRNAs. The experimental results show that PmiRGO has a significantly better performance than existing state-of-the-art methods in terms of F max . A case study further demonstrates the feasibility of PmiRGO to annotate the potential functions of miRNAs.
Collapse
|
41
|
Multiple Partial Regularized Nonnegative Matrix Factorization for Predicting Ontological Functions of lncRNAs. Front Genet 2019; 9:685. [PMID: 30728826 PMCID: PMC6351489 DOI: 10.3389/fgene.2018.00685] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 12/10/2018] [Indexed: 02/02/2023] Open
Abstract
Long non-coding RNAs (LncRNA) are critical regulators for biological processes, which are highly related to complex diseases. Even though the next generation sequence technology facilitates the discovery of a great number of lncRNAs, the knowledge about the functions of lncRNAs is limited. Thus, it is promising to predict the functions of lncRNAs, which shed light on revealing the mechanisms of complex diseases. The current algorithms predict the functions of lncRNA by using the features of protein-coding genes. Generally speaking, these algorithms fuse heterogeneous genomic data to construct lncRNA-gene associations via a linear combination, which cannot fully characterize the function-lncRNA relations. To overcome this issue, we present an nonnegative matrix factorization algorithm with multiple partial regularization (aka MPrNMF) to predict the functions of lncRNAs without fusing the heterogeneous genomic data. In details, for each type of genomic data, we construct the lncRNA-gene associations, resulting in multiple associations. The proposed method integrates separately them via regularization strategy, rather than fuse them into a single type of associations. The results demonstrate that the proposed algorithm outperforms state-of-the-art methods based network-analysis. The model and algorithm provide an effective way to explore the functions of lncRNAs.
Collapse
|
42
|
Integrating Multiple Interaction Networks for Gene Function Inference. Molecules 2018; 24:molecules24010030. [PMID: 30577643 PMCID: PMC6337127 DOI: 10.3390/molecules24010030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 12/19/2018] [Accepted: 12/20/2018] [Indexed: 01/17/2023] Open
Abstract
In the past few decades, the number and variety of genomic and proteomic data available have increased dramatically. Molecular or functional interaction networks are usually constructed according to high-throughput data and the topological structure of these interaction networks provide a wealth of information for inferring the function of genes or proteins. It is a widely used way to mine functional information of genes or proteins by analyzing the association networks. However, it remains still an urgent but unresolved challenge how to combine multiple heterogeneous networks to achieve more accurate predictions. In this paper, we present a method named ReprsentConcat to improve function inference by integrating multiple interaction networks. The low-dimensional representation of each node in each network is extracted, then these representations from multiple networks are concatenated and fed to gcForest, which augment feature vectors by cascading and automatically determines the number of cascade levels. We experimentally compare ReprsentConcat with a state-of-the-art method, showing that it achieves competitive results on the datasets of yeast and human. Moreover, it is robust to the hyperparameters including the number of dimensions.
Collapse
|
43
|
SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput Biol 2018; 14:e1006616. [PMID: 30533006 PMCID: PMC6331124 DOI: 10.1371/journal.pcbi.1006616] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 01/14/2019] [Accepted: 11/02/2018] [Indexed: 01/12/2023] Open
Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/. LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. In this paper, we propose a novel computational method “SFPEL-LPI” to predict lncRNA-protein interactions. SFPEL-LPI makes use of lncRNA sequences, protein sequences and known lncRNA-protein associations to extract features and calculate similarities for lncRNAs and proteins, and then combines them with a feature projection ensemble learning frame. SFPEL-LPI can predict unobserved interactions between lncRNAs and proteins, and also can make predictions for new lncRNAs (or proteins), which have no interactions with any proteins (or lncRNAs). SFPEL-LPI produces high-accuracy performances on the benchmark dataset when evaluated by five-fold cross validation, and outperforms state-of-the-art methods. The case studies demonstrate that SFPEL-LPI can find out novel associations, which are confirmed by literature. To facilitate the lncRNA-protein interaction prediction, we develop a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/.
Collapse
|
44
|
Abstract
Background With the development of sequencing technology, more and more long non-coding RNAs (lncRNAs) have been identified. Some lncRNAs have been confirmed that they play an important role in the process of development through the dosage compensation effect, epigenetic regulation, cell differentiation regulation and other aspects. However, the majority of the lncRNAs have not been functionally characterized. Explore the function of lncRNAs and the regulatory network has become a hot research topic currently. Methods In the work, a network-based model named BiRWLGO is developed. The ultimate goal is to predict the probable functions for lncRNAs at large scale. The new model starts with building a global network composed of three networks: lncRNA similarity network, lncRNA-protein association network and protein-protein interaction (PPI) network. After that, it utilizes bi-random walk algorithm to explore the similarities between lncRNAs and proteins. Finally, we can annotate an lncRNA with the Gene Ontology (GO) terms according to its neighboring proteins. Results We compare the performance of BiRWLGO with the state-of-the-art models on a manually annotated lncRNA benchmark with known GO terms. The experimental results assert that BiRWLGO outperforms other methods in terms of both maximum F-measure (Fmax) and coverage. Conclusions BiRWLGO is a relatively efficient method to predict the functions of lncRNA. When protein interaction data is integrated, the predictive performance of BiRWLGO gains a great improvement. Electronic supplementary material The online version of this article (10.1186/s12920-018-0414-2) contains supplementary material, which is available to authorized users.
Collapse
|
45
|
SDADB: a functional annotation database of protein structural domains. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5046758. [PMID: 29961821 PMCID: PMC6025185 DOI: 10.1093/database/bay064] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 06/04/2018] [Indexed: 12/27/2022]
Abstract
Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings. The domain-protein mappings are computed using large-scale structure alignment. SDADB contains ontological terms with probabilistic scores for more than 214 000 distinct SCOP domains. It also provides additional features include 3D structure alignment visualization, GO hierarchical tree view, search, browse and download options. Database URL: http://sda.denglab.org
Collapse
|
46
|
Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network. BMC Bioinformatics 2018; 19:370. [PMID: 30309340 PMCID: PMC6182872 DOI: 10.1186/s12859-018-2390-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 09/19/2018] [Indexed: 12/12/2022] Open
Abstract
Background Identifying the interactions between proteins and long non-coding RNAs (lncRNAs) is of great importance to decipher the functional mechanisms of lncRNAs. However, current experimental techniques for detection of lncRNA-protein interactions are limited and inefficient. Many methods have been proposed to predict protein-lncRNA interactions, but few studies make use of the topological information of heterogenous biological networks associated with the lncRNAs. Results In this work, we propose a novel approach, PLIPCOM, using two groups of network features to detect protein-lncRNA interactions. In particular, diffusion features and HeteSim features are extracted from protein-lncRNA heterogenous network, and then combined to build the prediction model using the Gradient Tree Boosting (GTB) algorithm. Our study highlights that the topological features of the heterogeneous network are crucial for predicting protein-lncRNA interactions. The cross-validation experiments on the benchmark dataset show that PLIPCOM method substantially outperformed previous state-of-the-art approaches in predicting protein-lncRNA interactions. We also prove the robustness of the proposed method on three unbalanced data sets. Moreover, our case studies demonstrate that our method is effective and reliable in predicting the interactions between lncRNAs and proteins. Availability The source code and supporting files are publicly available at: http://denglab.org/PLIPCOM/.
Collapse
|
47
|
RicyerDB: A Database For Collecting Rice Yield-related Genes with Biological Analysis. Int J Biol Sci 2018; 14:965-970. [PMID: 29989091 PMCID: PMC6036756 DOI: 10.7150/ijbs.23328] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 12/25/2017] [Indexed: 11/16/2022] Open
Abstract
The Rice Yield-related Database (RicyerDB) was created to complement with related research of influence rice (Oryza sativa L.) yield in multiple traits by manually curating the related databases and literature, and genomics and proteomics information that could be useful for comprehensive understanding of the rice biology. RicyerDB provides a more valuable resource in which to efficiently investigate, browse and analyze yield-related genes. The whole data set can be easily queried and downloaded through the webpage. In addition, RicyerDB also constructed a protein-protein interaction network with biological analysis. The combined rice database opens a new path to facilitate researchers achieving information on rice gene in terms of their effects on traits important for rice breeding. The web server is freely available at: http://server.malab.cn/Ricyer/index.html.
Collapse
|
48
|
A sparse autoencoder-based deep neural network for protein solvent accessibility and contact number prediction. BMC Bioinformatics 2017; 18:569. [PMID: 29297299 PMCID: PMC5751690 DOI: 10.1186/s12859-017-1971-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Direct prediction of the three-dimensional (3D) structures of proteins from one-dimensional (1D) sequences is a challenging problem. Significant structural characteristics such as solvent accessibility and contact number are essential for deriving restrains in modeling protein folding and protein 3D structure. Thus, accurately predicting these features is a critical step for 3D protein structure building. RESULTS In this study, we present DeepSacon, a computational method that can effectively predict protein solvent accessibility and contact number by using a deep neural network, which is built based on stacked autoencoder and a dropout method. The results demonstrate that our proposed DeepSacon achieves a significant improvement in the prediction quality compared with the state-of-the-art methods. We obtain 0.70 three-state accuracy for solvent accessibility, 0.33 15-state accuracy and 0.74 Pearson Correlation Coefficient (PCC) for the contact number on the 5729 monomeric soluble globular protein dataset. We also evaluate the performance on the CASP11 benchmark dataset, DeepSacon achieves 0.68 three-state accuracy and 0.69 PCC for solvent accessibility and contact number, respectively. CONCLUSIONS We have shown that DeepSacon can reliably predict solvent accessibility and contact number with stacked sparse autoencoder and a dropout approach.
Collapse
|
49
|
Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences. Int J Mol Sci 2017; 18:ijms18112373. [PMID: 29117139 PMCID: PMC5713342 DOI: 10.3390/ijms18112373] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Revised: 11/01/2017] [Accepted: 11/04/2017] [Indexed: 12/26/2022] Open
Abstract
Protein-protein interactions (PPIs) play crucial roles in almost all cellular processes. Although a large amount of PPIs have been verified by high-throughput techniques in the past decades, currently known PPIs pairs are still far from complete. Furthermore, the wet-lab experiments based techniques for detecting PPIs are time-consuming and expensive. Hence, it is urgent and essential to develop automatic computational methods to efficiently and accurately predict PPIs. In this paper, a sequence-based approach called DNN-LCTD is developed by combining deep neural networks (DNNs) and a novel local conjoint triad description (LCTD) feature representation. LCTD incorporates the advantage of local description and conjoint triad, thus, it is capable to account for the interactions between residues in both continuous and discontinuous regions of amino acid sequences. DNNs can not only learn suitable features from the data by themselves, but also learn and discover hierarchical representations of data. When performing on the PPIs data of Saccharomyces cerevisiae, DNN-LCTD achieves superior performance with accuracy as 93.12%, precision as 93.75%, sensitivity as 93.83%, area under the receiver operating characteristic curve (AUC) as 97.92%, and it only needs 718 s. These results indicate DNN-LCTD is very promising for predicting PPIs. DNN-LCTD can be a useful supplementary tool for future proteomics study.
Collapse
|
50
|
Systematic analysis reveals a lncRNA-mRNA co-expression network associated with platinum resistance in high-grade serous ovarian cancer. Invest New Drugs 2017; 36:187-194. [PMID: 29082457 DOI: 10.1007/s10637-017-0523-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 10/11/2017] [Indexed: 10/18/2022]
Abstract
Resistance to platinum-based chemotherapy is the major barrier to treating high-grade serous ovarian cancer (HGS-OvCa). To improve HGS-OvCa patient prognosis, it is critical to identify the underlying mechanisms that promote platinum resistance. The goal of the present study was to identify a lncRNA-mRNA co-expression network and key lncRNAs that predict resistance to platinum-based chemotherapy in ovarian cancer patients. By systematically analyzing the expression profiles of lncRNAs and mRNAs in HGS-OvCa samples from the Cancer Genome Atlas (TCGA), we revealed that lncRNAs play important roles in platinum resistance in HGS-OvCa patients and delineate a lncRNA-mRNA co-expression network in HGS-OvCa patients who exhibit platinum resistance. Within the platinum resistance-specific lncRNA-mRNA network, 35 lncRNAs and 270 mRNAs showed 124 significant lncRNA-mRNA co-expression relationships. Pathway analysis revealed that lncRNAs in the platinum resistance network may participate in platinum resistance by regulating metabolic pathways. Moreover, HGS-OvCa patients with low lncRNA RP5-1120P11.1 expression showed a poorer prognosis than those with high lncRNA RP5-1120P11.1 expression in TCGA dataset (P = 2.74 × 10-5, log rank test), which was also validated in the GSE63885 dataset (P = 0.0242, log rank test). Network and function analysis revealed that lncRNA RP5-1120P11.1 regulates many cancer-related signaling pathways, such as the PI3K-AKT signaling pathway (P = 1.02 × 10-5, hypergeometric test) and the Jak-STAT signaling pathway (P = 1.71 × 10-4, hypergeometric test). Particularly, lncRNA RP5-1120P11.1 expression is significantly positively correlated with ABCC10 gene expression (P = 3.89 × 10-3, Pearson correlation test). Both lncRNA RP5-1120P11.1 and ABCC10 were down-regulated in platinum-resistant HGS-OvCa patients, and RP5-1120P11.1 is located near ABCC10 on chromosome 6. Gene ABCC10 has been implicated in resistance to docetaxel treatment. The present study paves the way for investigating lncRNA functions in platinum drug resistance and identifying lncRNAs with prognostic and therapeutic potential in HGS-OvCa.
Collapse
|