1
|
Moghim M, Javanmard A, Lolaei F, Taghavi M, Bakhshalizadeh S. Nuclear Multi-Microsatellite Marker Profiling Provides Clues to Molecular Genetic Diversity in Culture-Based Caspian Beluga Sturgeon (Huso huso) Brood Stocks: Ecological Mirror for Restoration. Vet Med Sci 2025; 11:e70255. [PMID: 40198651 PMCID: PMC11977658 DOI: 10.1002/vms3.70255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 01/28/2025] [Accepted: 02/04/2025] [Indexed: 04/10/2025] Open
Abstract
The decreasing natural genetic diversity within intense beluga sturgeon aquaculture presents a complex challenge for culture-based sturgeon stocks. The present report aims to assess the situation by utilizing advanced capillary electrophoresis (CE) and multi-microsatellite nuclear marker tools. We conducted a study involving 436 individuals of both sexes, collected from eight breeder private farms with diverse breeding histories and generational backgrounds. Through the application of eight microsatellites, we amplified a non-coding core genomic region in the species, followed by CE to provide double confirmation of observed genotype and actual allelic size SSR profiling. Utilizing molecular descriptive statistics in the POPGENE software, we calculated allele frequencies, expected and observed heterozygosity within populations, the number of observed and effective alleles (na and ne) and the Shannon's Information Index. Furthermore, we performed molecular analysis of variance (AMOVA), model-based clustering, principal coordinate analysis (PCoA) and STRUCTURE analysis to genetically characterize the populations. It was revealed that LS19 (na = 16) and Afu54 (na = 3) exhibited the highest and lowest levels of polymorphisms, respectively, within the studied families. Moreover, the Iranian Fisheries Research Organization (IFRO) farm population was found to have the highest genetic diversity (Ave_Het = 0.67), whereas the Rajaeei Sturgeon Private Farm (RJI) displayed the lowest diversity score (Ave_Het = 0.55) among the examined populations. INS and Jahanpouri Sturgeon Private Farm (JPR) showed the highest similarity (0.91), whereas the Saeei Sturgeon Private Farm had the lowest genetic distance, with a similarity score of 0.74 among the populations studied. Furthermore, the evidence from the STRUCTURE analysis highlighted notable levels of allelic sharing and admixture among the eight studied populations, indirectly indicating the presence of genetic diversity within each population and the relatively low genetic distance between the populations. The results demonstrate a significant level of genetic variability, providing evidence that supports the low value of inbreeding in brood management.
Collapse
Affiliation(s)
- Mehdi Moghim
- Department of GeneticsCaspian Sea Ecology Research CentreSariIran
| | - Arash Javanmard
- Department of Animal ScienceFaculty of AgricultureUniversity of TabrizTabrizIran
| | - Faramarz Lolaei
- Department of Stock AssessmentCaspian Sea Ecology Research CentreSariIran
| | | | - Shima Bakhshalizadeh
- Department of Marine ScienceCaspian Sea Basin Research CenterUniversity of GuilanRashtIran
| |
Collapse
|
2
|
Jia X, Luo W, Li J, Xing J, Sun H, Wu S, Su X. A deep learning framework for predicting disease-gene associations with functional modules and graph augmentation. BMC Bioinformatics 2024; 25:214. [PMID: 38877401 PMCID: PMC11549817 DOI: 10.1186/s12859-024-05841-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 06/12/2024] [Indexed: 06/16/2024] Open
Abstract
BACKGROUND The exploration of gene-disease associations is crucial for understanding the mechanisms underlying disease onset and progression, with significant implications for prevention and treatment strategies. Advances in high-throughput biotechnology have generated a wealth of data linking diseases to specific genes. While graph representation learning has recently introduced groundbreaking approaches for predicting novel associations, existing studies always overlooked the cumulative impact of functional modules such as protein complexes and the incompletion of some important data such as protein interactions, which limits the detection performance. RESULTS Addressing these limitations, here we introduce a deep learning framework called ModulePred for predicting disease-gene associations. ModulePred performs graph augmentation on the protein interaction network using L3 link prediction algorithms. It builds a heterogeneous module network by integrating disease-gene associations, protein complexes and augmented protein interactions, and develops a novel graph embedding for the heterogeneous module network. Subsequently, a graph neural network is constructed to learn node representations by collectively aggregating information from topological structure, and gene prioritization is carried out by the disease and gene embeddings obtained from the graph neural network. Experimental results underscore the superiority of ModulePred, showcasing the effectiveness of incorporating functional modules and graph augmentation in predicting disease-gene associations. This research introduces innovative ideas and directions, enhancing the understanding and prediction of gene-disease relationships.
Collapse
Affiliation(s)
- Xianghu Jia
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Weiwen Luo
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Jiaqi Li
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Jieqi Xing
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Hongjie Sun
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China.
| | - Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China.
| |
Collapse
|
3
|
Zeng Z, Xie J, Yang Z, Ma T, Chen D. TO-UGDA: target-oriented unsupervised graph domain adaptation. Sci Rep 2024; 14:9165. [PMID: 38644394 PMCID: PMC11576983 DOI: 10.1038/s41598-024-59890-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 04/16/2024] [Indexed: 04/23/2024] Open
Abstract
Graph domain adaptation (GDA) aims to address the challenge of limited label data in the target graph domain. Existing methods such as UDAGCN, GRADE, DEAL, and COCO for different-level (node-level, graph-level) adaptation tasks exhibit variations in domain feature extraction, and most of them solely rely on representation alignment to transfer label information from a labeled source domain to an unlabeled target domain. However, this approach can be influenced by irrelevant information and usually ignores the conditional shift of the downstream predictor. To effectively address this issue, we introduce a target-oriented unsupervised graph domain adaptive framework for graph adaptation called TO-UGDA. Particularly, domain-invariant feature representations are extracted using graph information bottleneck. The discrepancy between two domains is minimized using an adversarial alignment strategy to obtain a unified feature distribution. Additionally, the meta pseudo-label is introduced to enhance downstream adaptation and improve the model's generalizability. Through extensive experimentation on real-world graph datasets, it is proved that the proposed framework achieves excellent performance across various node-level and graph-level adaptation tasks.
Collapse
Affiliation(s)
- Zhuo Zeng
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
- Chengdu Union Big Data Tech. Inc., Chengdu, 610041, China
| | - Jianyu Xie
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
- Chengdu Union Big Data Tech. Inc., Chengdu, 610041, China
| | - Zhijie Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
- Chengdu Union Big Data Tech. Inc., Chengdu, 610041, China
| | - Tengfei Ma
- College of Information Science and Engineering, Hunan University, Changsha, 410082, China
| | - Duanbing Chen
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China.
- Chengdu Union Big Data Tech. Inc., Chengdu, 610041, China.
- Suining Institute of Digital Economy, Suining, 629018, China.
| |
Collapse
|
4
|
Lin S, Liu C, Zhou P, Hu ZY, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X. Prototypical Graph Contrastive Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2747-2758. [PMID: 35895656 DOI: 10.1109/tnnls.2022.3191086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Graph-level representations are critical in various real-world applications, such as predicting the properties of molecules. However, in practice, precise graph annotations are generally very expensive and time-consuming. To address this issue, graph contrastive learning constructs an instance discrimination task, which pulls together positive pairs (augmentation pairs of the same graph) and pushes away negative pairs (augmentation pairs of different graphs) for unsupervised representation learning. However, since for a query, its negatives are uniformly sampled from all graphs, existing methods suffer from the critical sampling bias issue, i.e., the negatives likely having the same semantic structure with the query, leading to performance degradation. To mitigate this sampling bias issue, in this article, we propose a prototypical graph contrastive learning (PGCL) approach. Specifically, PGCL models the underlying semantic structure of the graph data via clustering semantically similar graphs into the same group and simultaneously encourages the clustering consistency for different augmentations of the same graph. Then, given a query, it performs negative sampling via drawing the graphs from those clusters that differ from the cluster of query, which ensures the semantic difference between query and its negative samples. Moreover, for a query, PGCL further reweights its negative samples based on the distance between their prototypes (cluster centroids) and the query prototype such that those negatives having moderate prototype distance enjoy relatively large weights. This reweighting strategy is proven to be more effective than uniform sampling. Experimental results on various graph benchmarks testify the advantages of our PGCL over state-of-the-art methods. The code is publicly available at https://github.com/ha-lins/PGCL.
Collapse
|
5
|
Solano LE, D’Sa NM, Nikolaidis N. PRRGO: A Tool for Visualizing and Mapping Globally Expressed Genes in Public Gene Expression Omnibus RNA-Sequencing Studies to PageRank-scored Gene Ontology Terms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.21.576540. [PMID: 38328158 PMCID: PMC10849496 DOI: 10.1101/2024.01.21.576540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
We herein report PageRankeR Gene Ontology (PRRGO), a downloadable web application that can integrate differentially expressed gene (DEG) data from the gene expression omnibus (GEO) GEO2R web tool with the gene ontology (GO) database [1]. Unlike existing tools, PRRGO computes the PageRank for the entire GO network and can generate both interactive GO networks on the web interface and comma-separated values (CSV) files containing the DEG statistics categorized by GO term. These hierarchical and tabular GO-DEG data are especially conducive to hypothesis generation and overlap studies with the use of PageRank data, which can provide a metric of GO term centrality. We verified the tool for accuracy and reliability across nine independent heat shock (HS) studies for which the RNA-seq data was publicly available on GEO and found that the tool produced increasing concordance between study DEGs, GO terms, and select HS-specific GO terms.
Collapse
Affiliation(s)
- Luis E. Solano
- Department of Biological Science, Center for Applied Biotechnology Studies, and Center for Computational and Applied Mathematics, College of Natural Sciences and Mathematics, California State University Fullerton, Fullerton, CA 92834-6850
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA
| | - Nicholas M. D’Sa
- Department of Biological Science, Center for Applied Biotechnology Studies, and Center for Computational and Applied Mathematics, College of Natural Sciences and Mathematics, California State University Fullerton, Fullerton, CA 92834-6850
- University of California, Irvine, Irvine, CA
| | - Nikolas Nikolaidis
- Department of Biological Science, Center for Applied Biotechnology Studies, and Center for Computational and Applied Mathematics, College of Natural Sciences and Mathematics, California State University Fullerton, Fullerton, CA 92834-6850
| |
Collapse
|
6
|
Yan TC, Yue ZX, Xu HQ, Liu YH, Hong YF, Chen GX, Tao L, Xie T. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Comput Biol Med 2023; 154:106446. [PMID: 36680931 DOI: 10.1016/j.compbiomed.2022.106446] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/07/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022]
Abstract
New drug discovery is inseparable from the discovery of drug targets, and the vast majority of the known targets are proteins. At the same time, proteins are essential structural and functional elements of living cells necessary for the maintenance of all forms of life. Therefore, protein functions have become the focus of many pharmacological and biological studies. Traditional experimental techniques are no longer adequate for rapidly growing annotation of protein sequences, and approaches to protein function prediction using computational methods have emerged and flourished. A significant trend has been to use machine learning to achieve this goal. In this review, approaches to protein function prediction based on the sequence, structure, protein-protein interaction (PPI) networks, and fusion of multi-information sources are discussed. The current status of research on protein function prediction using machine learning is considered, and existing challenges and prominent breakthroughs are discussed to provide ideas and methods for future studies.
Collapse
Affiliation(s)
- Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
7
|
Ju W, Gu Y, Luo X, Wang Y, Yuan H, Zhong H, Zhang M. Unsupervised graph-level representation learning with hierarchical contrasts. Neural Netw 2023; 158:359-368. [PMID: 36516542 DOI: 10.1016/j.neunet.2022.11.019] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/08/2022] [Accepted: 11/13/2022] [Indexed: 11/27/2022]
Abstract
Unsupervised graph-level representation learning has recently shown great potential in a variety of domains, ranging from bioinformatics to social networks. Plenty of graph contrastive learning methods have been proposed to generate discriminative graph-level representations recently. They typically design multiple types of graph augmentations and enforce a graph to have consistent representations under different views. However, these techniques mostly neglect the intrinsic hierarchical structure of the graph, resulting in a limited exploration of semantic information for graph representation. Moreover, they often rely on a large number of negative samples to prevent collapsing into trivial solutions, while a great need for negative samples may lead to memory issues during optimization in graph domains. To address the two issues, this paper develops an unsupervised graph-level representation learning framework named Hierarchical Graph Contrastive Learning (HGCL), which investigates the hierarchical structural semantics of a graph at both node and graph levels. Specifically, our HGCL consists of three parts, i.e., node-level contrastive learning, graph-level contrastive learning, and mutual contrastive learning to capture graph semantics hierarchically. Furthermore, the Siamese network and momentum update are further involved to release the demand for excessive negative samples. Finally, the experimental results on both benchmark datasets for graph classification and large-scale OGB datasets for transfer learning demonstrate that our proposed HGCL significantly outperforms a broad range of state-of-the-art baselines.
Collapse
Affiliation(s)
- Wei Ju
- School of Computer Science, Peking University, Beijing, 100871, China
| | - Yiyang Gu
- School of Computer Science, Peking University, Beijing, 100871, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, 90095, USA.
| | - Yifan Wang
- School of Computer Science, Peking University, Beijing, 100871, China
| | - Haochen Yuan
- School of Computer Science, Peking University, Beijing, 100871, China
| | | | - Ming Zhang
- School of Computer Science, Peking University, Beijing, 100871, China.
| |
Collapse
|
8
|
Peng M, Juan X, Li Z. Graph prototypical contrastive learning. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.09.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
9
|
Luo X, Ju W, Qu M, Gu Y, Chen C, Deng M, Hua XS, Zhang M. CLEAR: Cluster-Enhanced Contrast for Self-Supervised Graph Representation Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:899-912. [PMID: 35675236 DOI: 10.1109/tnnls.2022.3177775] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article studies self-supervised graph representation learning, which is critical to various tasks, such as protein property prediction. Existing methods typically aggregate representations of each individual node as graph representations, but fail to comprehensively explore local substructures (i.e., motifs and subgraphs), which also play important roles in many graph mining tasks. In this article, we propose a self-supervised graph representation learning framework named cluster-enhanced Contrast (CLEAR) that models the structural semantics of a graph from graph-level and substructure-level granularities, i.e., global semantics and local semantics, respectively. Specifically, we use graph-level augmentation strategies followed by a graph neural network-based encoder to explore global semantics. As for local semantics, we first use graph clustering techniques to partition each whole graph into several subgraphs while preserving as much semantic information as possible. We further employ a self-attention interaction module to aggregate the semantics of all subgraphs into a local-view graph representation. Moreover, we integrate both global semantics and local semantics into a multiview graph contrastive learning framework, enhancing the semantic-discriminative ability of graph representations. Extensive experiments on various real-world benchmarks demonstrate the efficacy of the proposed over current graph self-supervised representation learning approaches on both graph classification and transfer learning tasks.
Collapse
|
10
|
Wang M, Shao W, Hao X, Huang S, Zhang D. Identify connectome between genotypes and brain network phenotypes via deep self-reconstruction sparse canonical correlation analysis. Bioinformatics 2022; 38:2323-2332. [PMID: 35143604 DOI: 10.1093/bioinformatics/btac074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 01/21/2022] [Accepted: 02/02/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION As a rising research topic, brain imaging genetics aims to investigate the potential genetic architecture of both brain structure and function. It should be noted that in the brain, not all variations are deservedly caused by genetic effect, and it is generally unknown which imaging phenotypes are promising for genetic analysis. RESULTS In this work, genetic variants (i.e. the single nucleotide polymorphism, SNP) can be correlated with brain networks (i.e. quantitative trait, QT), so that the connectome (including the brain regions and connectivity features) of functional brain networks from the functional magnetic resonance imaging data is identified. Specifically, a connection matrix is firstly constructed, whose upper triangle elements are selected to be connectivity features. Then, the PageRank algorithm is exploited for estimating the importance of different brain regions as the brain region features. Finally, a deep self-reconstruction sparse canonical correlation analysis (DS-SCCA) method is developed for the identification of genetic associations with functional connectivity phenotypic markers. This approach is a regularized, deep extension, scalable multi-SNP-multi-QT method, which is well-suited for applying imaging genetic association analysis to the Alzheimer's Disease Neuroimaging Initiative datasets. It is further optimized by adopting a parametric approach, augmented Lagrange and stochastic gradient descent. Extensive experiments are provided to validate that the DS-SCCA approach realizes strong associations and discovers functional connectivity and brain region phenotypic biomarkers to guide disease interpretation. AVAILABILITY AND IMPLEMENTATION The Matlab code is available at https://github.com/meimeiling/DS-SCCA/tree/main. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meiling Wang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
| | - Wei Shao
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
| | - Xiaoke Hao
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300401, China
| | - Shuo Huang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
| | - Daoqiang Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
| |
Collapse
|
11
|
Law JN, Akers K, Tasnina N, Santina CMD, Deutsch S, Kshirsagar M, Klein-Seetharaman J, Crovella M, Rajagopalan P, Kasif S, Murali TM. Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2. Gigascience 2021; 10:giab082. [PMID: 34966926 PMCID: PMC8716363 DOI: 10.1093/gigascience/giab082] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/21/2021] [Accepted: 11/28/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. RESULTS We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. CONCLUSIONS We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.
Collapse
Affiliation(s)
- Jeffrey N Law
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Kyle Akers
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Nure Tasnina
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | | - Shay Deutsch
- Department of Mathematics, University of California, Los Angeles, CA 90095, USA
| | | | | | - Mark Crovella
- Department of Computer Science, Boston University, Boston, MA 02215, USA
| | | | - Simon Kasif
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
12
|
Ghorbani M, Dehmer M, Lotfi A, Amraei N, Mowshowitz A, Emmert-Streib F. On the relationship between PageRank and automorphisms of a graph. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
13
|
Zhu Q, Yang J, Xu B, Hou Z, Sun L, Zhang D. Multimodal Brain Network Jointly Construction and Fusion for Diagnosis of Epilepsy. Front Neurosci 2021; 15:734711. [PMID: 34658773 PMCID: PMC8511490 DOI: 10.3389/fnins.2021.734711] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 08/10/2021] [Indexed: 11/24/2022] Open
Abstract
Brain network analysis has been proved to be one of the most effective methods in brain disease diagnosis. In order to construct discriminative brain networks and improve the performance of disease diagnosis, many machine learning–based methods have been proposed. Recent studies show that combining functional and structural brain networks is more effective than using only single modality data. However, in the most of existing multi-modal brain network analysis methods, it is a common strategy that constructs functional and structural network separately, which is difficult to embed complementary information of different modalities of brain network. To address this issue, we propose a unified brain network construction algorithm, which jointly learns both functional and structural data and effectively face the connectivity and node features for improving classification. First, we conduct space alignment and brain network construction under a unified framework, and then build the correlation model among all brain regions with functional data by low-rank representation so that the global brain region correlation can be captured. Simultaneously, the local manifold with structural data is embedded into this model to preserve the local structural information. Second, the PageRank algorithm is adaptively used to evaluate the significance of different brain regions, in which the interaction of multiple brain regions is considered. Finally, a multi-kernel strategy is utilized to solve the data heterogeneity problem and merge the connectivity as well as node information for classification. We apply the proposed method to the diagnosis of epilepsy, and the experimental results show that our method can achieve a promising performance.
Collapse
Affiliation(s)
- Qi Zhu
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Jing Yang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Bingliang Xu
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Zhenghua Hou
- Department of Psychosomatics and Psychiatry, Affiliated Zhongda Hospital, School of Medicine, Southeast University, Nanjing, China
| | - Liang Sun
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Daoqiang Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| |
Collapse
|
14
|
Tsoni R, Panagiotakopoulos CΤ, Verykios VS. Revealing latent traits in the social behavior of distance learning students. EDUCATION AND INFORMATION TECHNOLOGIES 2021; 27:3529-3565. [PMID: 34602848 PMCID: PMC8479270 DOI: 10.1007/s10639-021-10742-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 09/02/2021] [Indexed: 06/13/2023]
Abstract
This paper proposes a multilayered methodology for analyzing distance learning students' data to gain insight into the learning progress of the student subjects both in an individual basis and as members of a learning community during the course taking process. The communication aspect is of high importance in educational research. Additionally, it is difficult to assess as it involves multiple relationships and different levels of interaction. Social network analysis (SNA) allows the visualization of this complexity and provides quantified measures for evaluation. Thus, initially, SNA techniques were applied to create one-mode, undirected networks and capture important metrics originating from students' interactions in the fora of the courses offered in the context of distance learning programs. Principal component analysis and clustering were used next to reveal latent students' traits and common patterns in their social interactions with other students and their learning behavior. We selected two different courses to test this methodology and to highlight convergent and divergent features between them. Three major factors that explain over 70% of the variance were identified and four groups of students were found, characterized by common elements in students' learning profile. The results highlight the importance of academic performance, social behavior and online participation as the main criteria for clustering that could be helpful for tutors in distance learning to closely monitor the learning process and promptly interevent when needed.
Collapse
Affiliation(s)
- Rozita Tsoni
- School of Science and Technology, Hellenic Open University, Patras, Greece
| | | | | |
Collapse
|
15
|
Picart-Armada S, Thompson WK, Buil A, Perera-Lluna A. The effect of statistical normalization on network propagation scores. Bioinformatics 2021; 37:845-852. [PMID: 33070187 PMCID: PMC8097756 DOI: 10.1093/bioinformatics/btaa896] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 09/18/2020] [Accepted: 10/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. RESULTS Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. AVAILABILITY The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergio Picart-Armada
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain.,Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain
| | - Wesley K Thompson
- Mental Health Center Sct. Hans, 4000 Roskilde, Denmark.,Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, CA, USA
| | - Alfonso Buil
- Mental Health Center Sct. Hans, 4000 Roskilde, Denmark
| | - Alexandre Perera-Lluna
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain.,Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain
| |
Collapse
|
16
|
Law JN, Kale SD, Murali TM. Accurate and efficient gene function prediction using a multi-bacterial network. Bioinformatics 2021; 37:800-806. [PMID: 33063084 DOI: 10.1093/bioinformatics/btaa885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Revised: 09/23/2020] [Accepted: 09/30/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. RESULTS We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. AVAILABILITY AND IMPLEMENTATION An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jeffrey N Law
- Genetics, Bioinformatics and Computational Biology Ph.D. Program, Blacksburg, VA 24061, USA
| | - Shiv D Kale
- Fralin Life Sciences Institute, Blacksburg, VA 24061, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
17
|
Janyasupab P, Suratanee A, Plaimas K. Network diffusion with centrality measures to identify disease-related genes. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:2909-2929. [PMID: 33892577 DOI: 10.3934/mbe.2021147] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Disease-related gene prioritization is one of the most well-established pharmaceutical techniques used to identify genes that are important to a biological process relevant to a disease. In identifying these essential genes, the network diffusion (ND) approach is a widely used technique applied in gene prioritization. However, there is still a large number of candidate genes that need to be evaluated experimentally. Therefore, it would be of great value to develop a new strategy to improve the precision of the prioritization. Given the efficiency and simplicity of centrality measures in capturing a gene that might be important to the network structure, herein, we propose a technique that extends the scope of ND through a centrality measure to identify new disease-related genes. Five common centrality measures with different aspects were examined for integration in the traditional ND model. A total of 40 diseases were used to test our developed approach and to find new genes that might be related to a disease. Results indicated that the best measure to combine with the diffusion is closeness centrality. The novel candidate genes identified by the model for all 40 diseases were provided along with supporting evidence. In conclusion, the integration of network centrality in ND is a simple but effective technique to discover more precise disease-related genes, which is extremely useful for biomedical science.
Collapse
Affiliation(s)
- Panisa Janyasupab
- Advanced Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, 10330, Thailand
| | - Apichat Suratanee
- Intelligent and Nonlinear Dynamic Innovations Research Center, Department of Mathematics, Faculty of Applied Science, King Mongkut's University of Technology North Bangkok, Bangkok 10800, Thailand
| | - Kitiporn Plaimas
- Advanced Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, 10330, Thailand
- Omics Science and Bioinformatics Center, Faculty of Science, Chulalongkorn University, Bangkok, 10330, Thailand
| |
Collapse
|
18
|
Chen Q, Li Y, Tan K, Qiao Y, Pan S, Jiang T, Chen YPP. Network-based methods for gene function prediction. Brief Funct Genomics 2021; 20:249-257. [PMID: 33686431 DOI: 10.1093/bfgp/elab006] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 01/25/2021] [Accepted: 01/26/2021] [Indexed: 12/23/2022] Open
Abstract
The rapid development of high-throughput technology has generated a large number of biological networks. Network-based methods are able to provide rich information for inferring gene function. This is composed of analyzing the topological characteristics of genes in related networks, integrating biological information, and considering data from different data sources. To promote network biology and related biotechnology research, this article provides a survey for the state of the art of advanced methods of network-based gene function prediction and discusses the potential challenges.
Collapse
Affiliation(s)
- Qingfeng Chen
- University of Technology Sydney, China and Hundred-Talent Program
| | - Yongjie Li
- School of Computer and Electronic Information at Guangxi University
| | - Kai Tan
- School of Computer and Electronic Information at Guangxi University
| | - Yvlu Qiao
- School of Computer and Electronic Information at Guangxi University
| | - Shirui Pan
- Computer science from the University of Technology Sydney
| | - Taijiao Jiang
- Suzhou Institute of System Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia
| |
Collapse
|
19
|
González-Gomariz J, Serrano G, Tilve-Álvarez CM, Corrales FJ, Guruceaga E, Segura V. UPEFinder: A Bioinformatic Tool for the Study of Uncharacterized Proteins Based on Gene Expression Correlation and the PageRank Algorithm. J Proteome Res 2020; 19:4795-4807. [PMID: 33155801 DOI: 10.1021/acs.jproteome.0c00364] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The Human Proteome Project (HPP) is leading the international effort to characterize the human proteome. Although the main goal of this project was first focused on the detection of missing proteins, a new challenge arose from the need to assign biological functions to the uncharacterized human proteins and describe their implications in human diseases. Not only the proteins with experimental evidence (uPE1 proteins) but also the uncharacterized missing proteins (uMPs) were the objects of study in this challenge, neXt-CP50. In this work, we developed a new bioinformatic approach to infer biological annotations for the uPE1 proteins and uMPs based on a "guilt-by-association" analysis using public RNA-Seq data sets. We used the correlation of these proteins with the well-characterized PE1 proteins to construct a network. In this way, we applied the PageRank algorithm to this network to identify the most relevant nodes, which were the biological annotations of the uncharacterized proteins. All of the generated information was stored in a database. In addition, we implemented the web application UPEFinder (https://upefinder.proteored.org) to facilitate the access to this new resource. This information is especially relevant for the researchers of the HPP who are interested in the generation and validation of new hypotheses about the functions of these proteins. Both the database and the web application are publicly available (https://github.com/ubioinformat/UPEfinder).
Collapse
Affiliation(s)
| | - Guillermo Serrano
- Bioinformatics Platform, CIMA University of Navarra, Pamplona E-31008, Spain
| | - Carlos M Tilve-Álvarez
- Fundación Profesor Nóvoa-Santos, Instituto de Investigación Biomédica da Coruña, Coruña E-15006, Spain
| | - Fernando J Corrales
- Proteomics Unit, National Center for Biotechnology, CSIC, Madrid E-28049, Spain
| | - Elizabeth Guruceaga
- IdiSNA, Navarra Institute for Health Research, Pamplona E-31008, Spain
- Bioinformatics Platform, CIMA University of Navarra, Pamplona E-31008, Spain
| | - Victor Segura
- Tracasa Instrumental, Sarriguren E-31621, Spain
- Sección de Ingeniería del Dato, Dirección General de Telecomunicaciones y Digitalización, Gobierno de Navarra, Sarriguren E-31621, Spain
| |
Collapse
|
20
|
Wagner MJ, Pratapa A, Murali TM. Reconstructing signaling pathways using regular language constrained paths. Bioinformatics 2020; 35:i624-i633. [PMID: 31510694 PMCID: PMC6612893 DOI: 10.1093/bioinformatics/btz360] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION High-quality curation of the proteins and interactions in signaling pathways is slow and painstaking. As a result, many experimentally detected interactions are not annotated to any pathways. A natural question that arises is whether or not it is possible to automatically leverage existing pathway annotations to identify new interactions for inclusion in a given pathway. RESULTS We present RegLinker, an algorithm that achieves this purpose by computing multiple short paths from pathway receptors to transcription factors within a background interaction network. The key idea underlying RegLinker is the use of regular language constraints to control the number of non-pathway interactions that are present in the computed paths. We systematically evaluate RegLinker and five alternative approaches against a comprehensive set of 15 signaling pathways and demonstrate that RegLinker recovers withheld pathway proteins and interactions with the best precision and recall. We used RegLinker to propose new extensions to the pathways. We discuss the literature that supports the inclusion of these proteins in the pathways. These results show the broad potential of automated analysis to attenuate difficulties of traditional manual inquiry. AVAILABILITY AND IMPLEMENTATION https://github.com/Murali-group/RegLinker. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Aditya Pratapa
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| |
Collapse
|
21
|
Wekesa JS, Luan Y, Meng J. Predicting Protein Functions Based on Differential Co-expression and Neighborhood Analysis. J Comput Biol 2020; 28:1-18. [PMID: 32302512 DOI: 10.1089/cmb.2019.0120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Proteins are polypeptides essential in biological processes. Protein physical interactions are complemented by other types of functional relationship data including genetic interactions, knowledge about co-expression, and evolutionary pathways. Existing algorithms integrate protein interaction and gene expression data to retrieve context-specific subnetworks composed of genes/proteins with known and unknown functions. However, most protein function prediction algorithms fail to exploit diverse intrinsic information in feature and label spaces. We develop a novel integrative method based on differential Co-expression analysis and Neighbor-voting algorithm for Protein Function Prediction, namely CNPFP. The method integrates heterogeneous data and exploits intrinsic and latent linkages via global iterative approach and genomic features. CNPFP performs three tasks: clustering, differential co-expression analysis, and predicts protein functions. Our aim is to identify yeast cell cycle-specific proteins linked to differentially expressed proteins in the protein-protein interaction network. To capture intrinsic information, CNPFP selects the most relevant feature subset based on global iterative neighbor-voting algorithm. We identify eight condition-specific modules. The most relevant subnetwork has 87 genes highly enriched with cyclin-dependent kinases, a protein kinase relevant for cell cycle regulation. We present comprehensive annotations for 3538 Saccharomyces cerevisiae proteins. Our method achieves an AUROC of 0.9862, accuracy of 0.9710, and F-score of 0.9691. From the results, we can summarize that exploiting intrinsic nature of protein relationships improves the quality of function prediction. Thus, the proposed method is useful in functional genomics studies.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
- School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
| | - Yushi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
22
|
Liu C, Ma Y, Zhao J, Nussinov R, Zhang YC, Cheng F, Zhang ZK. Computational network biology: Data, models, and applications. PHYSICS REPORTS 2020; 846:1-66. [DOI: 10.1016/j.physrep.2019.12.004] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
23
|
Picart-Armada S, Barrett SJ, Willé DR, Perera-Lluna A, Gutteridge A, Dessailly BH. Benchmarking network propagation methods for disease gene identification. PLoS Comput Biol 2019; 15:e1007276. [PMID: 31479437 PMCID: PMC6743778 DOI: 10.1371/journal.pcbi.1007276] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 09/13/2019] [Accepted: 07/16/2019] [Indexed: 12/17/2022] Open
Abstract
In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genes. The use of biological network data has proven its effectiveness in many areas from computational biology. Networks consist of nodes, usually genes or proteins, and edges that connect pairs of nodes, representing information such as physical interactions, regulatory roles or co-occurrence. In order to find new candidate nodes for a given biological property, the so-called network propagation algorithms start from the set of known nodes with that property and leverage the connections from the biological network to make predictions. Here, we assess the performance of several network propagation algorithms to find sensible gene targets for 22 common non-cancerous diseases, i.e. those that have been found promising enough to start the clinical trials with any compound. We focus on obtaining performance metrics that reflect a practical scenario in drug development where only a small set of genes can be essayed. We found that the presence of protein complexes biased the performance estimates, leading to over-optimistic conclusions, and introduced two novel strategies to address it. Our results support that network propagation is still a viable approach to find drug targets, but that special care needs to be put on the validation strategy. Algorithms benefitted from the use of a larger -although noisier- network and of direct evidence data, rather than indirect genetic associations to disease.
Collapse
Affiliation(s)
- Sergio Picart-Armada
- B2SLab, Departament d’Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, Spain
- Networking Biomedical Research Centre in the subject area of Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), Madrid, Spain
- Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Esplugues de Llobregat, Spain
- * E-mail:
| | | | | | - Alexandre Perera-Lluna
- B2SLab, Departament d’Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, Spain
- Networking Biomedical Research Centre in the subject area of Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), Madrid, Spain
- Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Esplugues de Llobregat, Spain
| | - Alex Gutteridge
- Computational Biology and Statistics, GSK, Stevenage, United Kingdom
| | | |
Collapse
|
24
|
Chen W, Li W, Huang G, Flavel M. The Applications of Clustering Methods in Predicting Protein Functions. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666181212114612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The understanding of protein function is essential to the study of biological
processes. However, the prediction of protein function has been a difficult task for bioinformatics to
overcome. This has resulted in many scholars focusing on the development of computational methods
to address this problem.
Objective:
In this review, we introduce the recently developed computational methods of protein function
prediction and assess the validity of these methods. We then introduce the applications of clustering
methods in predicting protein functions.
Collapse
Affiliation(s)
- Weiyang Chen
- College of Information, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Weiwei Li
- College of Information, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Guohua Huang
- College of Information Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Matthew Flavel
- School of Life Sciences, La Trobe University, Bundoora, Vic 3083, Australia
| |
Collapse
|
25
|
Lin CH, Konecki DM, Liu M, Wilson SJ, Nassar H, Wilkins AD, Gleich DF, Lichtarge O. Multimodal network diffusion predicts future disease-gene-chemical associations. Bioinformatics 2019; 35:1536-1543. [PMID: 30304494 PMCID: PMC6499233 DOI: 10.1093/bioinformatics/bty858] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 09/14/2018] [Accepted: 10/08/2018] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Precision medicine is an emerging field with hopes to improve patient treatment and reduce morbidity and mortality. To these ends, computational approaches have predicted associations among genes, chemicals and diseases. Such efforts, however, were often limited to using just some available association types. This lowers prediction coverage and, since prior evidence shows that integrating heterogeneous data is likely beneficial, it may limit accuracy. Therefore, we systematically tested whether using more association types improves prediction. RESULTS We study multimodal networks linking diseases, genes and chemicals (drugs) by applying three diffusion algorithms and varying information content. Ten-fold cross-validation shows that these networks are internally consistent, both within and across association types. Also, diffusion methods recovered missing edges, even if all the edges from an entire mode of association were removed. This suggests that information is transferable between these association types. As a realistic validation, time-stamped experiments simulated the predictions of future associations based solely on information known prior to a given date. The results show that many future published results are predictable from current associations. Moreover, in most cases, using more association types increases prediction coverage without significantly decreasing sensitivity and specificity. In case studies, literature-supported validation shows that these predictions mimic human-formulated hypotheses. Overall, this study suggests that diffusion over a more comprehensive multimodal network will generate more useful hypotheses of associations among diseases, genes and chemicals, which may guide the development of precision therapies. AVAILABILITY AND IMPLEMENTATION Code and data are available at https://github.com/LichtargeLab/multimodal-network-diffusion. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chih-Hsu Lin
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, USA
| | - Daniel M Konecki
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, USA
| | - Meng Liu
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Stephen J Wilson
- Department of Biochemistry and Molecular Biology, Houston, TX, USA
| | - Huda Nassar
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Angela D Wilkins
- Departments of Molecular and Human Genetics, and Pharmacology, Houston, TX, USA
- Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, TX, USA
| | - David F Gleich
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Olivier Lichtarge
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, USA
- Department of Biochemistry and Molecular Biology, Houston, TX, USA
- Departments of Molecular and Human Genetics, and Pharmacology, Houston, TX, USA
- Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
26
|
Golestan Hashemi FS, Razi Ismail M, Rafii Yusop M, Golestan Hashemi MS, Nadimi Shahraki MH, Rastegari H, Miah G, Aslani F. Intelligent mining of large-scale bio-data: Bioinformatics applications. BIOTECHNOL BIOTEC EQ 2017. [DOI: 10.1080/13102818.2017.1364977] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Farahnaz Sadat Golestan Hashemi
- Plant Genetics, AgroBioChem Department, Gembloux Agro-Bio Tech, University of Liege, Liege, Belgium
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Razi Ismail
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Rafii Yusop
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mahboobe Sadat Golestan Hashemi
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Mohammad Hossein Nadimi Shahraki
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Hamid Rastegari
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
| | - Gous Miah
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Farzad Aslani
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| |
Collapse
|