1
|
Zou HT, Ji BY, Xie XL. A multi-source molecular network representation model for protein-protein interactions prediction. Sci Rep 2024; 14:6184. [PMID: 38485942 PMCID: PMC10940665 DOI: 10.1038/s41598-024-56286-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 03/05/2024] [Indexed: 03/18/2024] Open
Abstract
The prediction of potential protein-protein interactions (PPIs) is a critical step in decoding diseases and understanding cellular mechanisms. Traditional biological experiments have identified plenty of potential PPIs in recent years, but this problem is still far from being solved. Hence, there is urgent to develop computational models with good performance and high efficiency to predict potential PPIs. In this study, we propose a multi-source molecular network representation learning model (called MultiPPIs) to predict potential protein-protein interactions. Specifically, we first extract the protein sequence features according to the physicochemical properties of amino acids by utilizing the auto covariance method. Second, a multi-source association network is constructed by integrating the known associations among miRNAs, proteins, lncRNAs, drugs, and diseases. The graph representation learning method, DeepWalk, is adopted to extract the multisource association information of proteins with other biomolecules. In this way, the known protein-protein interaction pairs can be represented as a concatenation of the protein sequence and the multi-source association features of proteins. Finally, the Random Forest classifier and corresponding optimal parameters are used for training and prediction. In the results, MultiPPIs obtains an average 86.03% prediction accuracy with 82.69% sensitivity at the AUC of 93.03% under five-fold cross-validation. The experimental results indicate that MultiPPIs has a good prediction performance and provides valuable insights into the field of potential protein-protein interactions prediction. MultiPPIs is free available at https://github.com/jiboyalab/multiPPIs .
Collapse
Affiliation(s)
- Hai-Tao Zou
- College of Information Science and Engineering, Guilin University of Technology, Guilin, 541000, China
| | - Bo-Ya Ji
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410000, China.
| | - Xiao-Lan Xie
- College of Information Science and Engineering, Guilin University of Technology, Guilin, 541000, China.
| |
Collapse
|
2
|
Yu G, Huang Q, Zhang X, Guo M, Wang J. Tissue Specificity Based Isoform Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3048-3059. [PMID: 34185647 DOI: 10.1109/tcbb.2021.3093167] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Alternative splicing enables a gene spliced into different isoforms and hence protein variants. Identifying individual functions of these isoforms help deciphering the functional diversity of proteins. Although much efforts have been made for automatic gene function prediction, few efforts have been moved toward computational isoform function prediction, mainly due to the unavailable (or scanty) functional annotations of isoforms. Existing efforts directly combine multiple RNA-seq datasets without account of the important tissue specificity of alternative splicing. To bridge this gap, we introduce a novel approach called TS-Isofun to predict the functions of isoforms by integrating multiple functional association networks with respect to tissue specificity. TS-Isofun first constructs tissue-specific isoform functional association networks using multiple RNA-seq datasets from tissue-wise. Next, TS-Isofun assigns weights to these networks and models the tissue specificity by selectively integrating them with adaptive weights. It then introduces a joint matrix factorization-based data fusion model to leverage the integrated network, gene-level data and functional annotations of genes to infer the functions of isoforms. To achieve coherent weight assignment and isoform function prediction, TS-Isofun jointly optimizes the weights of individual networks and the isoform function prediction in a unified objective function. Experimental results show that TS-Isofun significantly outperforms state-of-the-art methods and the account of tissue specificity contributes to more accurate isoform function prediction.
Collapse
|
3
|
Yu G, Yang Y, Yan Y, Guo M, Zhang X, Wang J. DeepIDA: Predicting Isoform-Disease Associations by Data Fusion and Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2166-2176. [PMID: 33571094 DOI: 10.1109/tcbb.2021.3058801] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Alternative splicing produces different isoforms from the same gene locus, it is an important mechanism for regulating gene expression and proteome diversity. Although the prediction of gene(ncRNA)-disease associations has been extensively studied, few (or no) computational solutions have been proposed for the prediction of isoform-disease association (IDA) at a large scale, mainly due to the lack of disease annotations of isoforms. However, increasing evidences confirm the associations between diseases and isoforms, which can more precisely uncover the pathology of complex diseases. Therefore, it is highly desirable to predict IDAs. To bridge this gap, we propose a deep neural network based solution (DeepIDA) to fuse multi-type genomics and transcriptomics data to predict IDAs. Particularly, DeepIDA uses gene-isoform relations to dispatch gene-disease associations to isoforms. In addition, it utilizes two DNN sub-networks with different structures to capture nucleotide and expression features of isoforms, Gene Ontology data and miRNA target data, respectively. After that, these two sub-networks are merged in a dense layer to predict IDAs. The experimental results on public datasets show that DeepIDA can effectively predict IDAs with AUPRC (area under the precision-recall curve) of 0.9141, macro F-measure of 0.9155, G-mean of 0.9278 and balanced accuracy of 0.9303 across 732 diseases, which are much higher than those of competitive methods. Further study on sixteen isoform-disease association cases again corroborates the superiority of DeepIDA. The code of DeepIDA is available at http://mlda.swu.edu.cn/codes.php?name=DeepIDA.
Collapse
|
4
|
Zhang Y, Duan L, Zheng H, Li-Ling J, Qin R, Chen Z, He C, Wang T. Mining Similar Aspects for Gene Similarity Explanation Based on Gene Information Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1734-1746. [PMID: 33259307 DOI: 10.1109/tcbb.2020.3041559] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Analysis of gene similarity not only can provide information on the understanding of the biological roles and functions of a gene, but may also reveal the relationships among various genes. In this paper, we introduce a novel idea of mining similar aspects from a gene information network, i.e., for a given gene pair, we want to know in which aspects (meta paths) they are most similar from the perspective of the gene information network. We defined a similarity metric based on the set of meta paths connecting the query genes in the gene information network and used the rank of similarity of a gene pair in a meta path set to measure the similarity significance in that aspect. A minimal set of gene meta paths where the query gene pair ranks the highest is a similar aspect, and the similar aspect of a query gene pair is far from trivial. We proposed a novel method, SCENARIO, to investigate minimal similar aspects. Our empirical study on the gene information network, constructed from six public gene-related databases, verified that our proposed method is effective, efficient, and useful.
Collapse
|
5
|
Liu L, Zhu S. Computational Methods for Prediction of Human Protein-Phenotype Associations: A Review. PHENOMICS (CHAM, SWITZERLAND) 2021; 1:171-185. [PMID: 36939789 PMCID: PMC9590544 DOI: 10.1007/s43657-021-00019-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 06/05/2021] [Accepted: 06/16/2021] [Indexed: 12/01/2022]
Abstract
Deciphering the relationship between human proteins (genes) and phenotypes is one of the fundamental tasks in phenomics research. The Human Phenotype Ontology (HPO) builds upon a standardized logical vocabulary to describe the abnormal phenotypes encountered in human diseases and paves the way towards the computational analysis of their genetic causes. To date, many computational methods have been proposed to predict the HPO annotations of proteins. In this paper, we conduct a comprehensive review of the existing approaches to predicting HPO annotations of novel proteins, identifying missing HPO annotations, and prioritizing candidate proteins with respect to a certain HPO term. For each topic, we first give the formalized description of the problem, and then systematically revisit the published literatures highlighting their advantages and disadvantages, followed by the discussion on the challenges and promising future directions. In addition, we point out several potential topics to be worthy of exploration including the selection of negative HPO annotations and detecting HPO misannotations. We believe that this review will provide insight to the researchers in the field of computational phenotype analyses in terms of comprehending and developing novel prediction algorithms.
Collapse
Affiliation(s)
- Lizhi Liu
- School of Computer Science, Fudan University, Shanghai, 200433 China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433 China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, 200433 China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, 200433 China
- Zhangjiang Fudan International Innovation Center, Shanghai, 200433 China
- Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433 China
| |
Collapse
|
6
|
Zhao Y, Wang J, Guo M, Zhang X, Yu G. Cross-Species Protein Function Prediction with Asynchronous-Random Walk. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1439-1450. [PMID: 31562099 DOI: 10.1109/tcbb.2019.2943342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein function prediction is a fundamental task in the post-genomic era. Available functional annotations of proteins are incomplete and the annotations of two homologous species are complementary to each other. However, how to effectively leverage mutually complementary annotations of different species to further boost the prediction performance is still not well studied. In this paper, we propose a cross-species protein function prediction approach by performing Asynchronous Random Walk on a heterogeneous network (AsyRW). AsyRW first constructs a heterogeneous network to integrate multiple functional association networks derived from different biological data, established homology-relationships between proteins from different species, known annotations of proteins and Gene Ontology (GO). To account for the intrinsic structures of intra- and inter-species of proteins and that of GO, AsyRW quantifies the individual walk lengths of each network node using the gravity-like theory, and then performs asynchronous-random walk with the individual length to predict associations between proteins and GO terms. Experiments on annotations archived in different years show that individual walk length and asynchronous-random walk can effectively leverage the complementary annotations of different species, AsyRW has a significantly improved performance to other related and competitive methods. The codes of AsyRW are available at: http://mlda.swu.edu.cn/codes.php?name=AsyRW.
Collapse
|
7
|
Yu G, Zeng J, Wang J, Zhang H, Zhang X, Guo M. Imbalance deep multi‐instance learning for predicting isoform–isoform interactions. INT J INTELL SYST 2021. [DOI: 10.1002/int.22402] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Guoxian Yu
- School of Software Shandong University Jinan China
- College of Computer and Information Science Southwest University Chongqing China
- Joint SDU‐NTU Centre for Artificial Intelligence Research Shandong University Jinan China
| | - Jie Zeng
- College of Computer and Information Science Southwest University Chongqing China
| | - Jun Wang
- College of Computer and Information Science Southwest University Chongqing China
- Joint SDU‐NTU Centre for Artificial Intelligence Research Shandong University Jinan China
| | - Hong Zhang
- College of Computer and Information Science Southwest University Chongqing China
| | - Xiangliang Zhang
- CEMSE King Abdullah University of Science and Technology Thuwal Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering Beijing University of Civil Engineering and Architecture Beijing China
| |
Collapse
|
8
|
Zhou G, Wang J, Zhang X, Guo M, Yu G. Predicting functions of maize proteins using graph convolutional network. BMC Bioinformatics 2020; 21:420. [PMID: 33323113 PMCID: PMC7739465 DOI: 10.1186/s12859-020-03745-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Maize (Zea mays ssp. mays L.) is the most widely grown and yield crop in the world, as well as an important model organism for fundamental research of the function of genes. The functions of Maize proteins are annotated using the Gene Ontology (GO), which has more than 40000 terms and organizes GO terms in a direct acyclic graph (DAG). It is a huge challenge to accurately annotate relevant GO terms to a Maize protein from such a large number of candidate GO terms. Some deep learning models have been proposed to predict the protein function, but the effectiveness of these approaches is unsatisfactory. One major reason is that they inadequately utilize the GO hierarchy. Results To use the knowledge encoded in the GO hierarchy, we propose a deep Graph Convolutional Network (GCN) based model (DeepGOA) to predict GO annotations of proteins. DeepGOA firstly quantifies the correlations (or edges) between GO terms and updates the edge weights of the DAG by leveraging GO annotations and hierarchy, then learns the semantic representation and latent inter-relations of GO terms in the way by applying GCN on the updated DAG. Meanwhile, Convolutional Neural Network (CNN) is used to learn the feature representation of amino acid sequences with respect to the semantic representations. After that, DeepGOA computes the dot product of the two representations, which enable to train the whole network end-to-end coherently. Extensive experiments show that DeepGOA can effectively integrate GO structural information and amino acid information, and then annotates proteins accurately. Conclusions Experiments on Maize PH207 inbred line and Human protein sequence dataset show that DeepGOA outperforms the state-of-the-art deep learning based methods. The ablation study proves that GCN can employ the knowledge of GO and boost the performance. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=DeepGOA.
Collapse
Affiliation(s)
- Guangjie Zhou
- School of Software, Shandong University, Jinan, China.,College of Computer and Information Sciences, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Chongqing, China
| | - Xiangliang Zhang
- CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.
| | - Guoxian Yu
- School of Software, Shandong University, Jinan, China. .,College of Computer and Information Sciences, Chongqing, China. .,CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
9
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020's. Genes (Basel) 2020; 11:E1264. [PMID: 33120976 PMCID: PMC7692357 DOI: 10.3390/genes11111264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 02/06/2023] Open
Abstract
The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C. H. J. van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
10
|
Yu G, Wang K, Domeniconi C, Guo M, Wang J. Isoform function prediction based on bi-random walks on a heterogeneous network. Bioinformatics 2020; 36:303-310. [PMID: 31250882 DOI: 10.1093/bioinformatics/btz535] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 06/21/2019] [Accepted: 06/26/2019] [Indexed: 01/29/2023] Open
Abstract
MOTIVATION Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. RESULTS We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene-gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. AVAILABILITY AND IMPLEMENTATION The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Keyao Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Carlotta Domeniconi
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| |
Collapse
|
11
|
Abstract
MOTIVATION With the ever-increasing number and diversity of sequenced species, the challenge to characterize genes with functional information is even more important. In most species, this characterization almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The Critical Assessment of protein Function Annotation algorithms (CAFA) series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the open world assumption (OWA), leading to a systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations. RESULTS This article introduces a new, OWA-compliant, benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content of negative annotations. The benchmark has been tested using the naïve and BLAST baseline methods, as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments. AVAILABILITY AND IMPLEMENTATION All data, as well as code used for analysis, is available from https://lab.dessimoz.org/20_not. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alex Warwick Vesztrocy
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Department of Computer Science, University College London, London, WC1E 6BT, UK
- Centre for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
12
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
13
|
Yu G, Wang K, Fu G, Guo M, Wang J. NMFGO: Gene Function Prediction via Nonnegative Matrix Factorization with Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:238-249. [PMID: 30059316 DOI: 10.1109/tcbb.2018.2861379] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Gene Ontology (GO) is a controlled vocabulary of terms that describe molecule function, biological roles, and cellular locations of gene products (i.e., proteins and RNAs), it hierarchically organizes more than 43,000 GO terms via the direct acyclic graph. A gene is generally annotated with several of these GO terms. Therefore, accurately predicting the association between genes and massive terms is a difficult challenge. To combat with this challenge, we propose an matrix factorization based approach called NMFGO. NMFGO stores the available GO annotations of genes in a gene-term association matrix and adopts an ontological structure based taxonomic similarity measure to capture the GO hierarchy. Next, it factorizes the association matrix into two low-rank matrices via nonnegative matrix factorization regularized with the GO hierarchy. After that, it employs a semantic similarity based k nearest neighbor classifier in the low-rank matrices approximated subspace to predict gene functions. Empirical study on three model species (S. cerevisiae, H. sapiens, and A. thaliana) shows that NMFGO is robust to the input parameters and achieves significantly better prediction performance than GIC, TO, dRW- kNN, and NtN, which were re-implemented based on the instructions of the original papers. The supplementary file and demo codes of NMFGO are available at http://mlda.swu.edu.cn/codes.php?name=NMFGO.
Collapse
|
14
|
Affiliation(s)
- Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
| | - Zbynek Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| |
Collapse
|
15
|
Plyusnin I, Holm L, Törönen P. Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences. PLoS Comput Biol 2019; 15:e1007419. [PMID: 31682632 PMCID: PMC6855565 DOI: 10.1371/journal.pcbi.1007419] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 11/14/2019] [Accepted: 09/24/2019] [Indexed: 11/18/2022] Open
Abstract
Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
- Research Programme in Organismal and Evolutionary Biology, Faculty of Biosciences, University of Helsinki, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| |
Collapse
|
16
|
Abstract
Ontology fusion in knowledge bases has become less easy, due to the massive capacity involved in the process of semantic similarity calculation. Many similarity calculation methods have been developed, although they are hardly united. This article contributes a low-cost similarity calculation method for ontology fusion, based on the inspiration of binary metrics, with the aim of reducing the size of similarity calculations both spatially and logically. By introducing the definitions of a heterogeneous ontology, entities of ontologies and rules of ontology fusion on the basis of concept fusion and relationship fusion, we put forward the algorithm of main traverse procedure and calculated to be the least cost in time and space in comparison with traditional methods. We adopted three experiments to testify the usability of our approach from the perspective of actual library resources, small datasets and large datasets. In Experiment 1, the bibliographic data from East China Normal University Library were used to show the feasibility and capability of our proposal and present the process of the algorithm. In both Experiments 2 and 3, our approach had at least 88% confidence in detecting accurate merging mappings and also decreased time cost. The test demonstrated a good fusion result. The problem of lower recalls caused by error analysis results from the conflict between the complex structures in ontologies and the recursive functions, which will be improved in the future.
Collapse
Affiliation(s)
- Wen Lou
- Department of Information Management, Faculty of Economics and Management, East China Normal University, China
- Institute for Academic Evaluation and Development, East China Normal University, China
| | - Ruofan Pi
- Department of Accounting, Faculty of Economics and Management, East China Normal University, China
| | - Hui Wang
- Library, Zhejiang College of Construction, China
| | | |
Collapse
|
17
|
Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.097] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
18
|
HashGO: hashing gene ontology for protein function prediction. Comput Biol Chem 2017; 71:264-273. [DOI: 10.1016/j.compbiolchem.2017.09.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 10/18/2022]
|
19
|
Rifaioglu AS, Doğan T, Saraç ÖS, Ersahin T, Saidi R, Atalay MV, Martin MJ, Cetin-Atalay R. Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins 2017; 86:135-151. [PMID: 29098713 DOI: 10.1002/prot.25416] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2017] [Revised: 10/24/2017] [Accepted: 11/01/2017] [Indexed: 12/24/2022]
Abstract
Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred covers nearly the whole functionality spectrum in Gene Ontology system and it can predict both generic and specific GO terms. UniGOPred was run on CAFA2 challenge target protein sequences and it is categorized within the top 10 best performing methods for the molecular function category. In addition, the performance of UniGOPred is higher compared to the baseline BLAST classifier in all categories of GO. UniGOPred predictions are compared with UniProtKB/TrEMBL database annotations as well. Furthermore, the proposed tool's ability to predict negatively associated GO terms that defines the functions that a protein does not possess, is discussed. UniGOPred annotations were also validated by case studies on PTEN protein variants experimentally and on CHD8 protein variants with literature. UniGOPred protein functional annotation system is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.
Collapse
Affiliation(s)
- Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey.,Department of Computer Engineering, İskenderun Technical University, Hatay, 31200, Turkey
| | - Tunca Doğan
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.,CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Ömer Sinan Saraç
- Department of Computer Engineering, Istanbul Technical University, İstanbul, 34467, Turkey
| | - Tulin Ersahin
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Rabie Saidi
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Mehmet Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey
| | - Maria Jesus Martin
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Rengul Cetin-Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| |
Collapse
|
20
|
Notaro M, Schubach M, Robinson PN, Valentini G. Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods. BMC Bioinformatics 2017; 18:449. [PMID: 29025394 PMCID: PMC5639780 DOI: 10.1186/s12859-017-1854-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Accepted: 10/02/2017] [Indexed: 03/12/2023] Open
Abstract
BACKGROUND The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. RESULTS We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a "flat" learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity. CONCLUSIONS Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository.
Collapse
Affiliation(s)
- Marco Notaro
- Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano, Via Comelico 39, Milan, 20135 Italy
| | - Max Schubach
- Institute for Medical and Human Genetics, Charité - Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353 Germany
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, Berlin, 10178 Germany
| | - Peter N. Robinson
- Institute for Medical and Human Genetics, Charité - Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353 Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin, 14195 Germany
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Dr, Farmington, 06032 CT USA
- Institute for Systems Genomics, University of Connecticut, 10 Discovery Dr, Farmington, 06032 CT USA
| | - Giorgio Valentini
- Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano, Via Comelico 39, Milan, 20135 Italy
| |
Collapse
|
21
|
Yu G, Lu C, Wang J. NoGOA: predicting noisy GO annotations using evidences and sparse representation. BMC Bioinformatics 2017; 18:350. [PMID: 28732468 PMCID: PMC5521088 DOI: 10.1186/s12859-017-1764-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 07/14/2017] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. RESULTS We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. CONCLUSIONS The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Chang Lu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| |
Collapse
|
22
|
Tian Z, Wang C, Guo M, Liu X, Teng Z. SGFSC: speeding the gene functional similarity calculation based on hash tables. BMC Bioinformatics 2016; 17:445. [PMID: 27814675 PMCID: PMC5096311 DOI: 10.1186/s12859-016-1294-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Accepted: 10/19/2016] [Indexed: 12/23/2022] Open
Abstract
Background In recent years, many measures of gene functional similarity have been proposed and widely used in all kinds of essential research. These methods are mainly divided into two categories: pairwise approaches and group-wise approaches. However, a common problem with these methods is their time consumption, especially when measuring the gene functional similarities of a large number of gene pairs. The problem of computational efficiency for pairwise approaches is even more prominent because they are dependent on the combination of semantic similarity. Therefore, the efficient measurement of gene functional similarity remains a challenging problem. Results To speed current gene functional similarity calculation methods, a novel two-step computing strategy is proposed: (1) establish a hash table for each method to store essential information obtained from the Gene Ontology (GO) graph and (2) measure gene functional similarity based on the corresponding hash table. There is no need to traverse the GO graph repeatedly for each method with the help of the hash table. The analysis of time complexity shows that the computational efficiency of these methods is significantly improved. We also implement a novel Speeding Gene Functional Similarity Calculation tool, namely SGFSC, which is bundled with seven typical measures using our proposed strategy. Further experiments show the great advantage of SGFSC in measuring gene functional similarity on the whole genomic scale. Conclusions The proposed strategy is successful in speeding current gene functional similarity calculation methods. SGFSC is an efficient tool that is freely available at http://nclab.hit.edu.cn/SGFSC. The source code of SGFSC can be downloaded from http://pan.baidu.com/s/1dFFmvpZ.
Collapse
Affiliation(s)
- Zhen Tian
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Zhixia Teng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.,Department of Information Management and Information System, Northeast Forestry University, Harbin, 150001, People's Republic of China
| |
Collapse
|
23
|
Lu C, Wang J, Zhang Z, Yang P, Yu G. NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Comput Biol Chem 2016; 65:203-211. [PMID: 27670689 DOI: 10.1016/j.compbiolchem.2016.09.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 09/07/2016] [Indexed: 10/21/2022]
Abstract
Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions.
Collapse
Affiliation(s)
- Chang Lu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Zili Zhang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, New South Wales, Australia
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|