1
|
Spencer NR, Gunabalasingam M, Dial K, Di X, Malcolm T, Magarvey NA. An integrated AI knowledge graph framework of bacterial enzymology and metabolism. Proc Natl Acad Sci U S A 2025; 122:e2425048122. [PMID: 40193601 PMCID: PMC12012490 DOI: 10.1073/pnas.2425048122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Accepted: 02/27/2025] [Indexed: 04/09/2025] Open
Abstract
The study of bacterial metabolism holds immense significance for improving human health and advancing agricultural practices. The prospective applications of genomically encoded bacterial metabolism present a compelling opportunity, particularly in the light of the rapid expansion of genomic sequencing data. Current metabolic inference tools face challenges in scaling with large datasets, leading to increased computational demands, and often exhibit limited inter-relatability and interoperability. Here, we introduce the Integrated Biosynthetic Inference Suite (IBIS), which employs deep learning models and a knowledge graph to facilitate rapid, scalable bacterial metabolic inference. This system leverages a series of Transformer based models to generate high quality, meaningful embeddings for individual enzymes, biosynthetic domains, and metabolic pathways. These embedded representations enable rapid, large-scale comparisons of metabolic proteins and pathways, surpassing the capabilities of conventional methodologies. The examination of evolutionary and functionally conserved metabolites across diverse bacterial species is facilitated by integrating the predictive capabilities of IBIS into a graph database enriched with comprehensive metadata. The consideration of both primary and specialized metabolism, combined with an embedding logic for enzyme discovery, uniquely positions IBIS to identify potential novel metabolic pathways. With the expansion of genomic data necessitating transformative approaches to advance molecular metabolism research, IBIS delivers an AI-driven holistic investigation of bacterial metabolism.
Collapse
Affiliation(s)
- Norman R. Spencer
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| | - Mathusan Gunabalasingam
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| | - Keshav Dial
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| | - Xiaxia Di
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| | - Tonya Malcolm
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| | - Nathan A. Magarvey
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ONL8S 4L8, Canada
| |
Collapse
|
2
|
Li T, Zhang Y, Su D, Liu M, Ge M, Chen L, Li C, Tang J. Knowledge Graph-Based Few-Shot Learning for Label of Medical Imaging Reports. Acad Radiol 2025:S1076-6332(25)00189-8. [PMID: 40140273 DOI: 10.1016/j.acra.2025.02.045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Revised: 02/23/2025] [Accepted: 02/25/2025] [Indexed: 03/28/2025]
Abstract
BACKGROUND The application of artificial intelligence (AI) in the field of automatic imaging report labeling faces the challenge of manually labeling large datasets. PURPOSE To propose a data augmentation method by using knowledge graph (KG) and few-shot learning. METHODS A KG of lumbar spine X-ray images was constructed, and 2000 data were annotated based on the KG, which were divided into training, validation, and test sets in a ratio of 7:2:1. The training dataset was augmented based on the synonym/replacement attributes of the KG and was the augmented data was input into the BERT (Bidirectional Encoder Representations from Transformers) model for automatic annotation training. The performance of the model under different augmentation ratios (1:10, 1:100, 1:1000) and augmentation methods (synonyms only, replacements only, combination of synonyms and replacements) was evaluated using the precision and F1 scores. In addition, with the augmentation ratio was fixed, iterative experiments were performed by supplementing the data of nodes that perform poorly in the validation set to further improve model's performance. RESULTS Prior to data augmentation, the precision was 0.728 and the F1 score was 0.666. By adjusting the augmentation ratio, the precision increased from 0.912 at a 1:10 augmentation ratio to 0.932 at a 1:100 augmentation ratio (P<.05), while F1 score improved from 0.853 at a 1:10 augmentation ratio to 0.881 at a 1:100 augmentation ratio (P<.05). Additionally, the effectiveness of various augmentation methods was compared at a 1:100 augmentation ratio. The augmentation method that combined synonyms and replacements (F1=0.881) was superior to the methods that only used synonyms (F1=0.815) and only used replacements (F1=0.753) (P<.05). For nodes that exhibited suboptimal performance on the validation set, supplementing the training set with target data improved model performance, increasing the average F1 score to 0.979 (P<.05). CONCLUSION Based on the KG, this study trained an automatic labeling model of radiology reports using a few-shot data set. This method effectively reduces the workload of manual labeling, improves the efficiency and accuracy of image data labeling, and provides an important research strategy for the application of AI in the domain of automatic labeling of image reports.
Collapse
Affiliation(s)
- Tiancheng Li
- The First Affiliated Hospital of Anhui Medical University, Anhui Medical University, Hefei 230032, China (T.L., D.S., J.T.); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China (T.L., D.S., C.L., J.T.)
| | - Yuxuan Zhang
- College of Medical Information Engineering, Anhui University of Traditional Chinese Medicine, Hefei, China (Y.Z., M.G., L.C., C.L.)
| | - Deyu Su
- The First Affiliated Hospital of Anhui Medical University, Anhui Medical University, Hefei 230032, China (T.L., D.S., J.T.); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China (T.L., D.S., C.L., J.T.)
| | - Ming Liu
- College of Artificial Intelligence, Anhui University, Hefei, China (M.L.)
| | - Mingxin Ge
- College of Medical Information Engineering, Anhui University of Traditional Chinese Medicine, Hefei, China (Y.Z., M.G., L.C., C.L.)
| | - Linyu Chen
- College of Medical Information Engineering, Anhui University of Traditional Chinese Medicine, Hefei, China (Y.Z., M.G., L.C., C.L.)
| | - Chuanfu Li
- College of Medical Information Engineering, Anhui University of Traditional Chinese Medicine, Hefei, China (Y.Z., M.G., L.C., C.L.); First Clinical Medical College, Anhui University of Traditional Chinese Medicine, Hefei, China (C.L.); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China (T.L., D.S., C.L., J.T.)
| | - Jin Tang
- The First Affiliated Hospital of Anhui Medical University, Anhui Medical University, Hefei 230032, China (T.L., D.S., J.T.); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China (T.L., D.S., C.L., J.T.).
| |
Collapse
|
3
|
Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. A comprehensive large scale biomedical knowledge graph for AI powered data driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified 600-1400 candidate drugs per month, with one-third of those discovered in the first two months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| | - Xin Sui
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Feng Pan
- Insilicom LLC, Tallahassee, FL 32303
| | | | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Jian Wang
- 977 Wisteria Ter., Sunnyvale, CA 94086
| | | | | | - Jun Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Eric Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Ben Lee
- Insilicom LLC, Tallahassee, FL 32303
| | - Peili Zhang
- Forward Informatics, Winchester, Massachusetts, 01890
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Tingting Zhao
- Insilicom LLC, Tallahassee, FL 32303
- Department of Geography, Florida State University, Tallahassee, FL 32306
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| |
Collapse
|
4
|
Robertson H, Han BA, Castellanos AA, Rosado D, Stott G, Zimmerman R, Drake JM, Graeden E. Understanding ecological systems using knowledge graphs: an application to highly pathogenic avian influenza. BIOINFORMATICS ADVANCES 2025; 5:vbaf016. [PMID: 40041112 PMCID: PMC11879169 DOI: 10.1093/bioadv/vbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 12/23/2024] [Accepted: 01/31/2025] [Indexed: 03/06/2025]
Abstract
Motivation Ecological systems are complex. Representing heterogeneous knowledge about ecological systems is a pervasive challenge because data are generated from many subdisciplines, exist in disparate sources, and only capture a subset of interactions underpinning system dynamics. Knowledge graphs (KGs) have been successfully applied to organize heterogeneous data and to predict new linkages in complex systems. Though not previously applied broadly in ecology, KGs have much to offer in an era when system dynamics are responding to rapid changes across multiple scales. Results We developed a KG to demonstrate the method's utility for ecological problems focused on highly pathogenic avian influenza (HPAI), a highly transmissible virus with a broad host range, wide geographic distribution, and rapid evolution with pandemic potential. We describe the development of a graph to include data related to HPAI including pathogen-host associations, species distributions, and population demographics, using a semantic ontology that defines relationships within and between datasets. We use the graph to perform a set of proof-of-concept analyses validating the method and identifying patterns of HPAI ecology. We underscore the generalizable value of KGs to ecology including ability to reveal previously known relationships and testable hypotheses in support of a deeper mechanistic understanding of ecological systems. Availability and implementation The data and code are available under the MIT License on GitHub at https://github.com/cghss-data-lab/uga-pipp.
Collapse
Affiliation(s)
- Hailey Robertson
- Department of Epidemiology of Microbial Diseases, Yale University School of Public Health, New Haven, CT 06510, United States
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Barbara A Han
- Cary Institute of Ecosystem Studies, Millbrook, NY 12545, United States
| | | | - David Rosado
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Guppy Stott
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
- Center for Ecology of Infectious Diseases, University of Georgia, Athens, GA 30602, United States
- Odum School of Ecology, University of Georgia, Athens, GA 30602, United States
| | - Ryan Zimmerman
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC 20007, United States
| | - John M Drake
- Center for Ecology of Infectious Diseases, University of Georgia, Athens, GA 30602, United States
- Odum School of Ecology, University of Georgia, Athens, GA 30602, United States
| | - Ellie Graeden
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC 20007, United States
- Massive Data Institute, Georgetown University, Washington, DC 20007, United States
| |
Collapse
|
5
|
Wang X, Zhang M, Xu J, Li X, Xiong J, Cao H, Dou F, Zhai X, Sun H. A novel approach for target deconvolution from phenotype-based screening using knowledge graph. Sci Rep 2025; 15:2414. [PMID: 39827292 PMCID: PMC11742725 DOI: 10.1038/s41598-025-86166-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 01/08/2025] [Indexed: 01/22/2025] Open
Abstract
Deconvoluting drug targets is crucial in modern drug development, yet both traditional and artificial intelligence (AI)-driven methods face challenges in terms of completeness, accuracy, and efficiency. Identifying drug targets, especially within complex systems such as the p53 pathway, remains a formidable task. The regulation of this pathway by myriad stress signals and regulatory elements adds layers of complexity to the discovery of effective p53 pathway activators. Recent insights into p53 activation have led to two main screening strategies for p53 activators. The target-based approach focuses on p53 and its regulators (MDM2, MDMX, USP7, Sirt proteins), but requires separate systems for each target and may miss multi-target compounds. Phenotype-based screening can reveal new targets but involves a lengthy process to elucidate mechanisms and targets, hindering drug development. Knowledge graphs have emerged as powerful tools that offer strengths in link prediction and knowledge inference to address these issues. In this study, we constructed a protein-protein interaction knowledge graph (PPIKG) and pioneered an integrated drug target deconvolution system that combines AI with molecular docking techniques. Analysis based on the PPIKG narrowed down candidate proteins from 1088 to 35, significantly saving time and cost. Subsequent molecular docking led us to pinpoint USP7 as a direct target for the p53 pathway activator UNBS5162. Leveraging knowledge graphs and a multidisciplinary approach allows us to streamline the laborious and expensive process of reverse targeting drug discovery through phenotype screening. Our findings have the potential to revolutionize drug screening and open new avenues in pharmacological research, increasing the speed and efficiency of pursuing novel therapeutics. The code is available at https://github.com/Xiong-Jing/PPIKG .
Collapse
Affiliation(s)
- Xiaohong Wang
- Shandong Foreign Trade Vocational College, Qingdao, 266100, China
| | - Meifang Zhang
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao, 266100, China
| | - Jianliang Xu
- Faculty of Information Science and Engineering, Ocean University of China, Qingdao, 266071, China
| | - Xin Li
- Gansu Health Vocational College, Lanzhou, 730000, China
| | - Jing Xiong
- School of Computer Science, Qufu Normal University, Rizhao, 276827, China.
- Rizhao-Qufu Normal University Joint Technology Transfer Center, Rizhao, 276827, China.
- International Joint Research Laboratory for Perception Data Intelligent Processing of Henan, Anyang Normal University, Anyang, 455000, China.
| | - Haowei Cao
- Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, 255000, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, 255000, China
| | - Fangkun Dou
- Oceanographic Data Center, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
| | - Xue Zhai
- School of Engineering, Qufu Normal University, Rizhao, 276827, China
| | - Hua Sun
- International Joint Research Laboratory for Perception Data Intelligent Processing of Henan, Anyang Normal University, Anyang, 455000, China
| |
Collapse
|
6
|
Wu Y, Xie X, Zhu J, Guan L, Li M. Overview and Prospects of DNA Sequence Visualization. Int J Mol Sci 2025; 26:477. [PMID: 39859192 PMCID: PMC11764684 DOI: 10.3390/ijms26020477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/30/2024] [Accepted: 01/04/2025] [Indexed: 01/27/2025] Open
Abstract
Due to advances in big data technology, deep learning, and knowledge engineering, biological sequence visualization has been extensively explored. In the post-genome era, biological sequence visualization enables the visual representation of both structured and unstructured biological sequence data. However, a universal visualization method for all types of sequences has not been reported. Biological sequence data are rapidly expanding exponentially and the acquisition, extraction, fusion, and inference of knowledge from biological sequences are critical supporting technologies for visualization research. These areas are important and require in-depth exploration. This paper elaborates on a comprehensive overview of visualization methods for DNA sequences from four different perspectives-two-dimensional, three-dimensional, four-dimensional, and dynamic visualization approaches-and discusses the strengths and limitations of each method in detail. Furthermore, this paper proposes two potential future research directions for biological sequence visualization in response to the challenges of inefficient graphical feature extraction and knowledge association network generation in existing methods. The first direction is the construction of knowledge graphs for biological sequence big data, and the second direction is the cross-modal visualization of biological sequences using machine learning methods. This review is anticipated to provide valuable insights and contributions to computational biology, bioinformatics, genomic computing, genetic breeding, evolutionary analysis, and other related disciplines in the fields of biology, medicine, chemistry, statistics, and computing. It has an important reference value in biological sequence recommendation systems and knowledge question answering systems.
Collapse
Affiliation(s)
| | | | | | | | - Mengshan Li
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China; (Y.W.); (X.X.); (J.Z.); (L.G.)
| |
Collapse
|
7
|
Yin H, Duo H, Li S, Qin D, Xie L, Xiao Y, Sun J, Tao J, Zhang X, Li Y, Zou Y, Yang Q, Yang X, Hao Y, Li B. Unlocking biological insights from differentially expressed genes: Concepts, methods, and future perspectives. J Adv Res 2024:S2090-1232(24)00560-5. [PMID: 39647635 DOI: 10.1016/j.jare.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2024] [Revised: 10/12/2024] [Accepted: 12/03/2024] [Indexed: 12/10/2024] Open
Abstract
BACKGROUND Identifying differentially expressed genes (DEGs) is a core task of transcriptome analysis, as DEGs can reveal the molecular mechanisms underlying biological processes. However, interpreting the biological significance of large DEG lists is challenging. Currently, gene ontology, pathway enrichment and protein-protein interaction analysis are common strategies employed by biologists. Additionally, emerging analytical strategies/approaches (such as network module analysis, knowledge graph, drug repurposing, cell marker discovery, trajectory analysis, and cell communication analysis) have been proposed. Despite these advances, comprehensive guidelines for systematically and thoroughly mining the biological information within DEGs remain lacking. AIM OF REVIEW This review aims to provide an overview of essential concepts and methodologies for the biological interpretation of DEGs, enhancing the contextual understanding. It also addresses the current limitations and future perspectives of these approaches, highlighting their broad applications in deciphering the molecular mechanism of complex diseases and phenotypes. To assist users in extracting insights from extensive datasets, especially various DEG lists, we developed DEGMiner (https://www.ciblab.net/DEGMiner/), which integrates over 300 easily accessible databases and tools. KEY SCIENTIFIC CONCEPTS OF REVIEW This review offers strong support and guidance for exploring DEGs, and also will accelerate the discovery of hidden biological insights within genomes.
Collapse
Affiliation(s)
- Huachun Yin
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China; Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, PR China; Department of Neurobiology, Chongqing Key Laboratory of Neurobiology, The Army Medical University, Chongqing 400038, PR China
| | - Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Song Li
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, PR China
| | - Dan Qin
- Department of Biology, College of Science, Northeastern University, Boston, MA 02115, USA
| | - Lingling Xie
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Yingxue Xiao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Jing Sun
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, PR China
| | - Yue Zou
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou 310058, PR China
| | - Xian Yang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China.
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, PR China.
| |
Collapse
|
8
|
Sunil RS, Lim SC, Itharajula M, Mutwil M. The gene function prediction challenge: Large language models and knowledge graphs to the rescue. CURRENT OPINION IN PLANT BIOLOGY 2024; 82:102665. [PMID: 39579414 DOI: 10.1016/j.pbi.2024.102665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 10/23/2024] [Accepted: 10/24/2024] [Indexed: 11/25/2024]
Abstract
Elucidating gene function is one of the ultimate goals of plant science. Despite this, only ∼15 % of all genes in the model plant Arabidopsis thaliana have comprehensively experimentally verified functions. While bioinformatical gene function prediction approaches can guide biologists in their experimental efforts, neither the performance of the gene function prediction methods nor the number of experimental characterization of genes has increased dramatically in recent years. In this review, we will discuss the status quo and the trajectory of gene function elucidation and outline the recent advances in gene function prediction approaches. We will then discuss how recent artificial intelligence advances in large language models and knowledge graphs can be leveraged to accelerate gene function predictions and keep us updated with scientific literature.
Collapse
Affiliation(s)
- Rohan Shawn Sunil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Shan Chun Lim
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Manoj Itharajula
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
9
|
Chai X, An S, Chen S, Li W, Feng Z, Li X, Gong H, Luo Q, Li A. Knowledge mining of brain connectivity in massive literature based on transfer learning. Bioinformatics 2024; 40:btae648. [PMID: 39656949 PMCID: PMC11631446 DOI: 10.1093/bioinformatics/btae648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Revised: 10/17/2024] [Accepted: 12/04/2024] [Indexed: 12/17/2024] Open
Abstract
MOTIVATION Neuroscientists have long endeavored to map brain connectivity, yet the intricate nature of brain networks often leads them to concentrate on specific regions, hindering efforts to unveil a comprehensive connectivity map. Recent advancements in imaging and text mining techniques have enabled the accumulation of a vast body of literature containing valuable insights into brain connectivity, facilitating the extraction of whole-brain connectivity relations from this corpus. However, the diverse representations of brain region names and connectivity relations pose a challenge for conventional machine learning methods and dictionary-based approaches in identifying all instances accurately. RESULTS We propose BioSEPBERT, a biomedical pre-trained model based on start-end position pointers and BERT. In addition, our model integrates specialized identifiers with enhanced self-attention capabilities for preceding and succeeding brain regions, thereby improving the performance of named entity recognition and relation extraction in neuroscience. Our approach achieves optimal F1 scores of 85.0%, 86.6%, and 86.5% for named entity recognition, connectivity relation extraction, and directional relation extraction, respectively, surpassing state-of-the-art models by 2.6%, 1.1%, and 1.1%. Furthermore, we leverage BioSEPBERT to extract 22.6 million standardized brain regions and 165 072 directional relations from a corpus comprising 1.3 million abstracts and 193 100 full-text articles. The results demonstrate that our model facilitates researchers to rapidly acquire knowledge regarding neural circuits across various brain regions, thereby enhancing comprehension of brain connectivity in specific regions. AVAILABILITY AND IMPLEMENTATION Data and source code are available at: http://atlas.brainsmatics.org/res/BioSEPBERT and https://github.com/Brainsmatics/BioSEPBERT.
Collapse
Affiliation(s)
- Xiaokang Chai
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Sile An
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Simeng Chen
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Wenwei Li
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Zhao Feng
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
| | - Xiangning Li
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| | - Hui Gong
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| | - Qingming Luo
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
| | - Anan Li
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| |
Collapse
|
10
|
Li T, Li M, Wu Y, Li Y. Visualization Methods for DNA Sequences: A Review and Prospects. Biomolecules 2024; 14:1447. [PMID: 39595624 PMCID: PMC11592258 DOI: 10.3390/biom14111447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 11/08/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
The efficient analysis and interpretation of biological sequence data remain major challenges in bioinformatics. Graphical representation, as an emerging and effective visualization technique, offers a more intuitive method for analyzing DNA sequences. However, many visualization approaches are dispersed across research databases, requiring urgent organization, integration, and analysis. Additionally, no single visualization method excels in all aspects. To advance these methods, knowledge graphs and advanced machine learning techniques have become key areas of exploration. This paper reviews the current 2D and 3D DNA sequence visualization methods and proposes a new research direction focused on constructing knowledge graphs for biological sequence visualization, explaining the relevant theories, techniques, and models involved. Additionally, we summarize machine learning techniques applicable to sequence visualization, such as graph embedding methods and the use of convolutional neural networks (CNNs) for processing graphical representations. These machine learning techniques and knowledge graphs aim to provide valuable insights into computational biology, bioinformatics, genomic computing, and evolutionary analysis. The study serves as an important reference for improving intelligent search systems, enriching knowledge bases, and enhancing query systems related to biological sequence visualization, offering a comprehensive framework for future research.
Collapse
Affiliation(s)
- Tan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Mengshan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Yan Wu
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China;
| | - Yelin Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| |
Collapse
|
11
|
Ma P, Shang S, Liu R, Dong Y, Wu J, Gu W, Yu M, Liu J, Li Y, Chen Y. Prediction of teicoplanin plasma concentration in critically ill patients: a combination of machine learning and population pharmacokinetics. J Antimicrob Chemother 2024; 79:2815-2827. [PMID: 39207798 DOI: 10.1093/jac/dkae292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 08/02/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND Teicoplanin has been widely used in patients with infections caused by Staphylococcus aureus, especially for critically ill patients. The pharmacokinetics (PK) of teicoplanin vary between individuals and within the same individual. We aim to establish a prediction model via a combination of machine learning and population PK (PPK) to support personalized medication decisions for critically ill patients. METHODS A retrospective study was performed incorporating 33 variables, including PPK parameters (clearance and volume of distribution). Multiple algorithms and Shapley additive explanations were employed for feature selection of variables to determine the strongest driving factors. RESULTS The performance of each algorithm with PPK parameters was superior to that without PPK parameters. The composition of support vector regression, categorical boosting and a backpropagation neural network (7:2:1) with the highest R2 (0.809) was determined as the final ensemble model. The model included 15 variables after feature selection, of which the predictive performance was superior to that of models considering all variables or using only PPK. The R2, mean absolute error, mean squared error, absolute accuracy (±5 mg/L) and relative accuracy (±30%) of external validation were 0.649, 3.913, 28.347, 76.12% and 76.12%, respectively. CONCLUSIONS Our study offers a non-invasive, fast and cost-effective prediction model of teicoplanin plasma concentration in critically ill patients. The model serves as a fundamental tool for clinicians to determine the effective plasma concentration range of teicoplanin and formulate individualized dosing regimens accordingly.
Collapse
Affiliation(s)
- Pan Ma
- Department of Pharmacy, The First Affiliated Hospital of Army Medical University, Chongqing 400038, China
| | - Shenglan Shang
- Department of Clinical Pharmacy, General Hospital of Central Theater Command, Wuhan, Hubei Province 430070, China
| | - Ruixiang Liu
- Department of Pharmacy, The First Affiliated Hospital of Army Medical University, Chongqing 400038, China
| | - Yuzhu Dong
- Department of Pharmacy, The Third Affiliated Hospital of Chongqing Medical University, Chongqing 401120, China
| | - Jiangfan Wu
- Department of Pharmacy, The First Affiliated Hospital of Chongqing Medical University, Chongqing 400016, China
| | - Wenrui Gu
- Department of Pharmacy, The First Affiliated Hospital of Army Medical University, Chongqing 400038, China
| | - Mengchen Yu
- Department of Clinical Pharmacy, General Hospital of Central Theater Command, Wuhan, Hubei Province 430070, China
| | - Jing Liu
- Department of Clinical Pharmacy, General Hospital of Central Theater Command, Wuhan, Hubei Province 430070, China
| | - Ying Li
- Medical Big Data and Artificial Intelligence Center, The First Affiliated Hospital of Army Medical University, Chongqing 400038, China
| | - Yongchuan Chen
- Department of Pharmacy, The First Affiliated Hospital of Army Medical University, Chongqing 400038, China
| |
Collapse
|
12
|
Qi F, Gao N, Li J, Zhou C, Jiang J, Zhou B, Guo L, Feng X, Ji J, Cai Q, Yang L, Zhu R, Que X, Wu J, Xi W, Qin W, Zhang J. A multidimensional recommendation framework for identifying biological targets to aid the diagnosis and treatment of liver metastasis in patients with colorectal cancer. Mol Cancer 2024; 23:239. [PMID: 39449040 PMCID: PMC11515508 DOI: 10.1186/s12943-024-02155-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Accepted: 10/11/2024] [Indexed: 10/26/2024] Open
Abstract
The quest to understand the molecular mechanisms of tumour metastasis and identify pivotal biomarkers for cancer therapy is increasing in importance. Single-omics analyses, constrained by their focus on a single biological layer, cannot fully elucidate the complexities of tumour molecular profiles and can thus overlook crucial molecular targets. In response to this limitation, we developed a multiobjective recommendation system (RJH-Metastasis 1.0) anchored in a multiomics knowledge graph to integrate genome, transcriptome, and proteome data and corroborative literature evidence and then conducted comprehensive analyses of colorectal cancer with liver metastasis (CRCLM). A total of 25 key genes significantly associated with CRCLM were recommended by our system, and GNB1, GATAD2A, GBP2, MACROD1, and EIF5B were further highlighted. Specifically, GNB1 presented fewer mutations but elevated RNA transcription and protein expression in CRCLM patients. The role of GNB1 in promoting the malignant behaviours of colon cancer cells was demonstrated via in vitro and in vivo studies. Aberrant expression of GNB1 could be regulated by METTL1-driven m7G modification. METTL1 knockdown decreased m7G modification in the 3' UTR of GNB1, increasing its mRNA transcription and translation during liver metastasis. Furthermore, GNB1 induced the formation of an immunosuppressive microenvironment by promoting the CLEC2C-KLRB1 interaction between memory B cells and KLRB1+PD-1+CD8+ cells. GNB1 expression and the efficacy of PD-1 antibody-based treatment in CRCLM patients were significantly correlated. In summary, our recommendation system can be used for effective exploration of key molecules in colorectal cancer, among which GNB1 was identified as a critical CRCLM promoter and immunotherapy biomarker in colorectal cancer patients.
Collapse
Affiliation(s)
- Feng Qi
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China.
| | - Na Gao
- Department of Laboratory Medicine, Zhongnan Hospital of Wuhan University, Wuhan University, Wuhan, 430071, P. R. China
| | - Jia Li
- Department of Thoracic Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 20025, P. R. China
| | - Chenfei Zhou
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Jinling Jiang
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Bin Zhou
- Department of Hepatic Surgery IV, Eastern Hepatobiliary Surgery Hospital, Naval Medical University, Shanghai, 200438, P. R. China
| | - Liting Guo
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Xiaohui Feng
- Department of Oncology, Loujiang New City Hospital of Taicang (Taicang Branch of Ruijin Hospital Affiliated with Shanghai Jiao Tong University School of Medicine), Suzhou, 215400, P. R. China
| | - Jun Ji
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Qu Cai
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Liu Yang
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Rongjia Zhu
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Xinyi Que
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Junwei Wu
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China
| | - Wenqi Xi
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China.
| | - Wenxing Qin
- Department of Medical Oncology, Fudan University Shanghai Cancer Center, Shanghai, 200032, P. R. China.
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, P. R. China.
| | - Jun Zhang
- Department of Oncology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, P. R. China.
| |
Collapse
|
13
|
Ahmed KT, Ansari MI, Zhang W. DTI-LM: language model powered drug-target interaction prediction. Bioinformatics 2024; 40:btae533. [PMID: 39221997 PMCID: PMC11520403 DOI: 10.1093/bioinformatics/btae533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 08/05/2024] [Accepted: 08/29/2024] [Indexed: 09/04/2024] Open
Abstract
MOTIVATION The identification and understanding of drug-target interactions (DTIs) play a pivotal role in the drug discovery and development process. Sequence representations of drugs and proteins in computational model offer advantages such as their widespread availability, easier input quality control, and reduced computational resource requirements. These make them an efficient and accessible tools for various computational biology and drug discovery applications. Many sequence-based DTI prediction methods have been developed over the years. Despite the advancement in methodology, cold start DTI prediction involving unknown drug or protein remains a challenging task, particularly for sequence-based models. Introducing DTI-LM, a novel framework leveraging advanced pretrained language models, we harness their exceptional context-capturing abilities along with neighborhood information to predict DTIs. DTI-LM is specifically designed to rely solely on sequence representations for drugs and proteins, aiming to bridge the gap between warm start and cold start predictions. RESULTS Large-scale experiments on four datasets show that DTI-LM can achieve state-of-the-art performance on DTI predictions. Notably, it excels in overcoming the common challenges faced by sequence-based models in cold start predictions for proteins, yielding impressive results. The incorporation of neighborhood information through a graph attention network further enhances prediction accuracy. Nevertheless, a disparity persists between cold start predictions for proteins and drugs. A detailed examination of DTI-LM reveals that language models exhibit contrasting capabilities in capturing similarities between drugs and proteins. AVAILABILITY AND IMPLEMENTATION Source code is available at: https://github.com/compbiolabucf/DTI-LM.
Collapse
Affiliation(s)
- Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| | - Md Istiaq Ansari
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| |
Collapse
|
14
|
Hu Y, Oleshko S, Firmani S, Zhu Z, Cheng H, Ulmer M, Arnold M, Colomé-Tatché M, Tang J, Xhonneux S, Marsico A. Path-based reasoning for biomedical knowledge graphs with BioPathNet. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.17.599219. [PMID: 39149355 PMCID: PMC11326122 DOI: 10.1101/2024.06.17.599219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) methods are limited in capturing this complexity. Representation-based learning techniques improve prediction accuracy by mapping nodes to low-dimensional embeddings, yet they often struggle with interpretability and scalability. We present BioPathNet, a novel graph neural network framework based on the Neural Bellman-Ford Network (NBFNet), addressing these limitations through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability. This allows visualization of influential paths and facilitates biological validation. BioPathNet leverages a background regulatory graph (BRG) for enhanced message passing and uses stringent negative sampling to improve precision. In evaluations across various LP tasks, such as gene function annotation, drug-disease indication, synthetic lethality, and lncRNA-mRNA interaction prediction, BioPathNet consistently outperformed shallow node embedding methods, relational graph neural networks and task-specific state-of-the-art methods, demonstrating robust performance and versatility. Our study predicts novel drug indications for diseases like acute lymphoblastic leukemia (ALL) and Alzheimer's, validated by medical experts and clinical trials. We also identified new synthetic lethality gene pairs and regulatory interactions involving lncRNAs and target genes, confirmed through literature reviews. BioPathNet's interpretability will enable researchers to trace prediction paths and gain molecular insights, making it a valuable tool for drug discovery, personalized medicine and biology in general.
Collapse
Affiliation(s)
- Yue Hu
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
- School of Life Sciences, Technical University of Munich, Alte Akademie 8, Freising, 85354, Bavaria, Germany
| | - Svitlana Oleshko
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
- School of Computation, Information and Technology, Technical University of Munich, Arcisstrasse 21, Munich, 80333, Bavaria, Germany
| | - Samuele Firmani
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
| | - Zhaocheng Zhu
- Department, Mila - Québec AI Institute, 6666 St-Urbain, Montréal, QC H2S 3H1, Quebec, Canada
- Department, Université de Montréal, 2900, boul. Édouard-Montpetit, Montréal, QC H3T 1J4, Quebec, Canada
| | - Hui Cheng
- School of Computation, Information and Technology, Technical University of Munich, Arcisstrasse 21, Munich, 80333, Bavaria, Germany
| | - Maria Ulmer
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
- School of Life Sciences, Technical University of Munich, Alte Akademie 8, Freising, 85354, Bavaria, Germany
| | - Matthias Arnold
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
- Department of Psychiatry and Behavioural Sciences, Duke University, 905 W Main St., Durham, NC 27701, North Carolina, United States
| | - Maria Colomé-Tatché
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
- School of Life Sciences, Technical University of Munich, Alte Akademie 8, Freising, 85354, Bavaria, Germany
- Faculty of Biology, Ludwig-Maximilian University of Munich, Grosshaderner Str. 2, Planegg-Martinsried, 82152, Bavaria, Germany
| | - Jian Tang
- Department, Mila - Québec AI Institute, 6666 St-Urbain, Montréal, QC H2S 3H1, Quebec, Canada
- Department, CIFAR AI Chair, 661 University Ave, Toronto, ON M5G 1M1, Ontario, Canada
- Department, HEC Montréal, 3000 Chem. de la Côte-Sainte-Catherine, Montréal, QC H3T 2A7, Quebec, Canada
| | - Sophie Xhonneux
- Department, Mila - Québec AI Institute, 6666 St-Urbain, Montréal, QC H2S 3H1, Quebec, Canada
- Department, Université de Montréal, 2900, boul. Édouard-Montpetit, Montréal, QC H3T 1J4, Quebec, Canada
| | - Annalisa Marsico
- Computational Health Center, Helmholtz Center Munich, Ingolstaedter Landstrasse 1, Neuherberg, 85764, Bavaria, Germany
| |
Collapse
|
15
|
Hu X, Sun Z, Nian Y, Wang Y, Dang Y, Li F, Feng J, Yu E, Tao C. Self-Explainable Graph Neural Network for Alzheimer Disease and Related Dementias Risk Prediction: Algorithm Development and Validation Study. JMIR Aging 2024; 7:e54748. [PMID: 38976869 PMCID: PMC11263893 DOI: 10.2196/54748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 03/31/2024] [Accepted: 06/02/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Alzheimer disease and related dementias (ADRD) rank as the sixth leading cause of death in the United States, underlining the importance of accurate ADRD risk prediction. While recent advancements in ADRD risk prediction have primarily relied on imaging analysis, not all patients undergo medical imaging before an ADRD diagnosis. Merging machine learning with claims data can reveal additional risk factors and uncover interconnections among diverse medical codes. OBJECTIVE The study aims to use graph neural networks (GNNs) with claim data for ADRD risk prediction. Addressing the lack of human-interpretable reasons behind these predictions, we introduce an innovative, self-explainable method to evaluate relationship importance and its influence on ADRD risk prediction. METHODS We used a variationally regularized encoder-decoder GNN (variational GNN [VGNN]) integrated with our proposed relation importance method for estimating ADRD likelihood. This self-explainable method can provide a feature-important explanation in the context of ADRD risk prediction, leveraging relational information within a graph. Three scenarios with 1-year, 2-year, and 3-year prediction windows were created to assess the model's efficiency, respectively. Random forest (RF) and light gradient boost machine (LGBM) were used as baselines. By using this method, we further clarify the key relationships for ADRD risk prediction. RESULTS In scenario 1, the VGNN model showed area under the receiver operating characteristic (AUROC) scores of 0.7272 and 0.7480 for the small subset and the matched cohort data set. It outperforms RF and LGBM by 10.6% and 9.1%, respectively, on average. In scenario 2, it achieved AUROC scores of 0.7125 and 0.7281, surpassing the other models by 10.5% and 8.9%, respectively. Similarly, in scenario 3, AUROC scores of 0.7001 and 0.7187 were obtained, exceeding 10.1% and 8.5% than the baseline models, respectively. These results clearly demonstrate the significant superiority of the graph-based approach over the tree-based models (RF and LGBM) in predicting ADRD. Furthermore, the integration of the VGNN model and our relation importance interpretation could provide valuable insight into paired factors that may contribute to or delay ADRD progression. CONCLUSIONS Using our innovative self-explainable method with claims data enhances ADRD risk prediction and provides insights into the impact of interconnected medical code relationships. This methodology not only enables ADRD risk modeling but also shows potential for other image analysis predictions using claims data.
Collapse
Affiliation(s)
- Xinyue Hu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yi Nian
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yichen Wang
- Division of Hospital Medicine at Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, United States
| | - Yifang Dang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Jingna Feng
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Evan Yu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
16
|
Gualdi F, Oliva B, Piñero J. Predicting gene disease associations with knowledge graph embeddings for diseases with curtailed information. NAR Genom Bioinform 2024; 6:lqae049. [PMID: 38745993 PMCID: PMC11091931 DOI: 10.1093/nargab/lqae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 03/08/2024] [Accepted: 04/24/2024] [Indexed: 05/16/2024] Open
Abstract
Knowledge graph embeddings (KGE) are a powerful technique used in the biomedical domain to represent biological knowledge in a low dimensional space. However, a deep understanding of these methods is still missing, and, in particular, regarding their applications to prioritize genes associated with complex diseases with reduced genetic information. In this contribution, we built a knowledge graph (KG) by integrating heterogeneous biomedical data and generated KGE by implementing state-of-the-art methods, and two novel algorithms: Dlemb and BioKG2vec. Extensive testing of the embeddings with unsupervised clustering and supervised methods showed that KGE can be successfully implemented to predict genes associated with diseases and that our novel approaches outperform most existing algorithms in both scenarios. Our findings underscore the significance of data quality, preprocessing, and integration in achieving accurate predictions. Additionally, we applied KGE to predict genes linked to Intervertebral Disc Degeneration (IDD) and illustrated that functions pertinent to the disease are enriched within the prioritized gene set.
Collapse
Affiliation(s)
- Francesco Gualdi
- Integrative Biomedical Informatics, Research Programme on Biomedical Informatics (IBI-GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
- Structural Bioinformatics Lab, Research Programme on Biomedical Informatics (SBI-GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Baldomero Oliva
- Structural Bioinformatics Lab, Research Programme on Biomedical Informatics (SBI-GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Janet Piñero
- Integrative Biomedical Informatics, Research Programme on Biomedical Informatics (IBI-GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain
- Medbioinformatics Solutions SL, Barcelona, Spain
| |
Collapse
|
17
|
Huan JM, Wang XJ, Li Y, Zhang SJ, Hu YL, Li YL. The biomedical knowledge graph of symptom phenotype in coronary artery plaque: machine learning-based analysis of real-world clinical data. BioData Min 2024; 17:13. [PMID: 38773619 PMCID: PMC11110203 DOI: 10.1186/s13040-024-00365-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 05/17/2024] [Indexed: 05/24/2024] Open
Abstract
A knowledge graph can effectively showcase the essential characteristics of data and is increasingly emerging as a significant means of integrating information in the field of artificial intelligence. Coronary artery plaque represents a significant etiology of cardiovascular events, posing a diagnostic challenge for clinicians who are confronted with a multitude of nonspecific symptoms. To visualize the hierarchical relationship network graph of the molecular mechanisms underlying plaque properties and symptom phenotypes, patient symptomatology was extracted from electronic health record data from real-world clinical settings. Phenotypic networks were constructed utilizing clinical data and protein‒protein interaction networks. Machine learning techniques, including convolutional neural networks, Dijkstra's algorithm, and gene ontology semantic similarity, were employed to quantify clinical and biological features within the network. The resulting features were then utilized to train a K-nearest neighbor model, yielding 23 symptoms, 41 association rules, and 61 hub genes across the three types of plaques studied, achieving an area under the curve of 92.5%. Weighted correlation network analysis and pathway enrichment were subsequently utilized to identify lipid status-related genes and inflammation-associated pathways that could help explain the differences in plaque properties. To confirm the validity of the network graph model, we conducted coexpression analysis of the hub genes to evaluate their potential diagnostic value. Additionally, we investigated immune cell infiltration, examined the correlations between hub genes and immune cells, and validated the reliability of the identified biological pathways. By integrating clinical data and molecular network information, this biomedical knowledge graph model effectively elucidated the potential molecular mechanisms that collude symptoms, diseases, and molecules.
Collapse
Affiliation(s)
- Jia-Ming Huan
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
| | - Xiao-Jie Wang
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
| | - Yuan Li
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
| | - Shi-Jun Zhang
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
| | - Yuan-Long Hu
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
| | - Yun-Lun Li
- First School of Clinical Medicine, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China.
- Department of Cardiovascular, Affiliated Hospital of Shandong University of Traditional Chinese Medicine, Jinan, 250014, China.
- Precision Diagnosis and Treatment of Cardiovascular Diseases with Traditional Chinese Medicine Shandong Engineering Research Center, Jinan, 250355, China.
| |
Collapse
|
18
|
Xia Y, Pan X, Shen HB. Heterogeneous sampled subgraph neural networks with knowledge distillation to enhance double-blind compound-protein interaction prediction. Structure 2024; 32:611-620.e4. [PMID: 38447575 DOI: 10.1016/j.str.2024.02.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 12/18/2023] [Accepted: 02/08/2024] [Indexed: 03/08/2024]
Abstract
Identifying binding compounds against a target protein is crucial for large-scale virtual screening in drug development. Recently, network-based methods have been developed for compound-protein interaction (CPI) prediction. However, they are difficult to be applied to unseen (i.e., never-seen-before) proteins and compounds. In this study, we propose SgCPI to incorporate local known interacting networks to predict CPI interactions. SgCPI randomly samples the local CPI network of the query compound-protein pair as a subgraph and applies a heterogeneous graph neural network (HGNN) to embed the active/inactive message of the subgraph. For unseen compounds and proteins, SgCPI-KD takes SgCPI as the teacher model to distillate its knowledge by estimating the potential neighbors. Experimental results indicate: (1) the sampled subgraphs of the CPI network introduce efficient knowledge for unseen molecular prediction with the HGNNs, and (2) the knowledge distillation strategy is beneficial to the double-blind interaction prediction by estimating molecular neighbors and distilling knowledge.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
19
|
Liu JX, Zhang X, Huang YQ, Hao GF, Yang GF. Multi-level bioinformatics resources support drug target discovery of protein-protein interactions. Drug Discov Today 2024; 29:103979. [PMID: 38608830 DOI: 10.1016/j.drudis.2024.103979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 03/14/2024] [Accepted: 04/05/2024] [Indexed: 04/14/2024]
Abstract
Drug discovery often begins with a new target. Protein-protein interactions (PPIs) are crucial to multitudinous cellular processes and offer a promising avenue for drug-target discovery. PPIs are characterized by multi-level complexity: at the protein level, interaction networks can be used to identify potential targets, whereas at the residue level, the details of the interactions of individual PPIs can be used to examine a target's druggability. Much great progress has been made in target discovery through multi-level PPI-related computational approaches, but these resources have not been fully discussed. Here, we systematically survey bioinformatics tools for identifying and assessing potential drug targets, examining their characteristics, limitations and applications. This work will aid the integration of the broader protein-to-network context with the analysis of detailed binding mechanisms to support the discovery of drug targets.
Collapse
Affiliation(s)
- Jia-Xin Liu
- National Key Laboratory of Green Pesticide, Key Laboratory of Pesticide & Chemical Biology, Ministry of Education, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, PR China
| | - Xiao Zhang
- State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for R&D of Fine Chemicals, Guizhou University, Guiyang 550025, PR China
| | - Yuan-Qin Huang
- State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for R&D of Fine Chemicals, Guizhou University, Guiyang 550025, PR China
| | - Ge-Fei Hao
- National Key Laboratory of Green Pesticide, Key Laboratory of Pesticide & Chemical Biology, Ministry of Education, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, PR China; State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for R&D of Fine Chemicals, Guizhou University, Guiyang 550025, PR China.
| | - Guang-Fu Yang
- National Key Laboratory of Green Pesticide, Key Laboratory of Pesticide & Chemical Biology, Ministry of Education, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, PR China.
| |
Collapse
|
20
|
Harrigan WL, Ferrell BD, Wommack KE, Polson SW, Schreiber ZD, Belcaid M. Improvements in viral gene annotation using large language models and soft alignments. BMC Bioinformatics 2024; 25:165. [PMID: 38664627 PMCID: PMC11046836 DOI: 10.1186/s12859-024-05779-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
Collapse
Affiliation(s)
- William L Harrigan
- Hawai'i Institute of Marine Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - Barbra D Ferrell
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - K Eric Wommack
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Shawn W Polson
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Zachary D Schreiber
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Mahdi Belcaid
- Department of Computer Science, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA.
| |
Collapse
|
21
|
Luo H, Yin W, Wang J, Zhang G, Liang W, Luo J, Yan C. Drug-drug interactions prediction based on deep learning and knowledge graph: A review. iScience 2024; 27:109148. [PMID: 38405609 PMCID: PMC10884936 DOI: 10.1016/j.isci.2024.109148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024] Open
Abstract
Drug-drug interactions (DDIs) can produce unpredictable pharmacological effects and lead to adverse events that have the potential to cause irreversible damage to the organism. Traditional methods to detect DDIs through biological or pharmacological analysis are time-consuming and expensive, therefore, there is an urgent need to develop computational methods to effectively predict drug-drug interactions. Currently, deep learning and knowledge graph techniques which can effectively extract features of entities have been widely utilized to develop DDI prediction methods. In this research, we aim to systematically review DDI prediction researches applying deep learning and graph knowledge. The available biomedical data and public databases related to drugs are firstly summarized in this review. Then, we discuss the existing drug-drug interactions prediction methods which have utilized deep learning and knowledge graph techniques and group them into three main classes: deep learning-based methods, knowledge graph-based methods, and methods that combine deep learning with knowledge graph. We comprehensively analyze the commonly used drug related data and various DDI prediction methods, and compare these prediction methods on benchmark datasets. Finally, we briefly discuss the challenges related to drug-drug interactions prediction, including asymmetric DDIs prediction and high-order DDI prediction.
Collapse
Affiliation(s)
- Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Weijie Yin
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Academy for Advanced Interdisciplinary Studies, Zhengzhou, China
| | - Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Wenjuan Liang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Academy for Advanced Interdisciplinary Studies, Zhengzhou, China
| |
Collapse
|
22
|
Alvarez-Mamani E, Dechant R, Beltran-Castañón CA, Ibáñez AJ. Graph embedding on mass spectrometry- and sequencing-based biomedical data. BMC Bioinformatics 2024; 25:1. [PMID: 38166530 PMCID: PMC10763173 DOI: 10.1186/s12859-023-05612-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/11/2023] [Indexed: 01/04/2024] Open
Abstract
Graph embedding techniques are using deep learning algorithms in data analysis to solve problems of such as node classification, link prediction, community detection, and visualization. Although typically used in the context of guessing friendships in social media, several applications for graph embedding techniques in biomedical data analysis have emerged. While these approaches remain computationally demanding, several developments over the last years facilitate their application to study biomedical data and thus may help advance biological discoveries. Therefore, in this review, we discuss the principles of graph embedding techniques and explore the usefulness for understanding biological network data derived from mass spectrometry and sequencing experiments, the current workhorses of systems biology studies. In particular, we focus on recent examples for characterizing protein-protein interaction networks and predicting novel drug functions.
Collapse
Affiliation(s)
- Edwin Alvarez-Mamani
- Engineering Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
| | - Reinhard Dechant
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Calico Life Sciences, 1170 Veterans Blvd, San Francisco, CA, 94080, USA
| | | | - Alfredo J Ibáñez
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
- Science Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
| |
Collapse
|
23
|
Galluzzo Y. A comprehensive review of the data and knowledge graphs approaches in bioinformatics. COMPUTER SCIENCE AND INFORMATION SYSTEMS 2024; 21:1055-1075. [DOI: 10.2298/csis230530027g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
The scientific community is currently showing strong interest in constructing knowledge graphs from heterogeneous domains (genomic, pharmaceutical, clinical etc.). The main goal here is to support researchers in gaining an immediate overview of the biomedical and clinical data that can be utilized to construct and extend KGs. A in-depth overview of the available biomedical data and the latest applications of knowledge graphs, from the biological to the clinical context, is provided showing the most recent methods of representing biomedical knowledge with embeddings (KGEs). Furthermore, this review, differentiates biomedical databases based on their construction process (whether manually curated by experts or not), aiming to offer a detailed overview and guide researchers in selecting the appropriate database for their research considering to the specific project needs, available resources, and data complexity. In conclusion, the review highlights current challenges: integration of different knowledge graphs and the interpretability of predictions of new relations.
Collapse
|
24
|
Abstract
Knowledge graphs represent information in the form of entities and relationships between those entities. Such a representation has multiple potential applications in drug discovery, including democratizing access to biomedical data, contextualizing or visualizing that data, and generating novel insights through the application of machine learning approaches. Knowledge graphs put data into context and therefore offer the opportunity to generate explainable predictions, which is a key topic in contemporary artificial intelligence. In this chapter, we outline some of the factors that need to be considered when constructing biomedical knowledge graphs, examine recent advances in mining such systems to gain insights for drug discovery, and identify potential future areas for further development.
Collapse
Affiliation(s)
- Tim James
- Evotec (UK) Ltd., Abingdon, Oxfordshire, UK.
| | | |
Collapse
|
25
|
Su C, Hou Y, Levin M, Zhang R, Wang F. Protocol to implement a computational pipeline for biomedical discovery based on a biomedical knowledge graph. STAR Protoc 2023; 4:102666. [PMID: 37883224 PMCID: PMC10630678 DOI: 10.1016/j.xpro.2023.102666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 09/06/2023] [Accepted: 10/03/2023] [Indexed: 10/28/2023] Open
Abstract
Biomedical knowledge graphs (BKGs) provide a new paradigm for managing abundant biomedical knowledge efficiently. Today's artificial intelligence techniques enable mining BKGs to discover new knowledge. Here, we present a protocol for implementing a computational pipeline for biomedical knowledge discovery (BKD) based on a BKG. We describe steps of the pipeline including data processing, implementing BKD based on knowledge graph embeddings, and prediction result interpretation. We detail how our pipeline can be used for drug repurposing hypothesis generation for Parkinson's disease. For complete details on the use and execution of this protocol, please refer to Su et al.1.
Collapse
Affiliation(s)
- Chang Su
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA.
| | - Yu Hou
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Michael Levin
- Bioengineering Department, College of Engineering, Temple University, Philadelphia, PA 19122, USA
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA.
| |
Collapse
|
26
|
Fu C, Huang Z, van Harmelen F, He T, Jiang X. Food4healthKG: Knowledge graphs for food recommendations based on gut microbiota and mental health. Artif Intell Med 2023; 145:102677. [PMID: 37925207 DOI: 10.1016/j.artmed.2023.102677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 08/05/2023] [Accepted: 10/03/2023] [Indexed: 11/06/2023]
Abstract
Food is increasingly acknowledged as a powerful means to promote and maintain mental health. The introduction of the gut-brain axis has been instrumental in understanding the impact of food on mental health. It is widely reported that food can significantly influence gut microbiota metabolism, thereby playing a pivotal role in maintaining mental health. However, the vast amount of heterogeneous data published in recent research lacks systematic integration and application development. To remedy this, we construct a comprehensive knowledge graph, named Food4healthKG, focusing on food, gut microbiota, and mental diseases. The constructed workflow includes the integration of numerous heterogeneous data, entity linking to a normalized format, and the well-designed representation of the acquired knowledge. To illustrate the availability of Food4healthKG, we design two case studies: the knowledge query and the food recommendation based on Food4healthKG. Furthermore, we propose two evaluation methods to validate the quality of the results obtained from Food4healthKG. The results demonstrate the system's effectiveness in practical applications, particularly in providing convincing food recommendations based on gut microbiota and mental health. Food4healthKG is accessible at https://github.com/ccszbd/Food4healthKG.
Collapse
Affiliation(s)
- Chengcheng Fu
- National Engineering Research Center for E-Learning, Central China Normal University, Wuhan, China; School of Computer Science, Central China Normal University, Wuhan, China; Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; National Language Resources Monitor Research Center for Network Media, Central China Normal University, Wuhan, China
| | - Zhisheng Huang
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; Clinical Research Center for Mental Disorders, Shanghai Pudong New Area Mental Health Center, Tongji University School of Medicine, Shanghai, China; Deep Blue Technology Group, Shanghai, China
| | - Frank van Harmelen
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Tingting He
- School of Computer Science, Central China Normal University, Wuhan, China; National Language Resources Monitor Research Center for Network Media, Central China Normal University, Wuhan, China
| | - Xingpeng Jiang
- School of Computer Science, Central China Normal University, Wuhan, China; National Language Resources Monitor Research Center for Network Media, Central China Normal University, Wuhan, China.
| |
Collapse
|
27
|
Shan W, Shen C, Luo L, Ding P. Multi-task learning for predicting synergistic drug combinations based on auto-encoding multi-relational graphs. iScience 2023; 26:108020. [PMID: 37854693 PMCID: PMC10579440 DOI: 10.1016/j.isci.2023.108020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 08/26/2023] [Accepted: 09/19/2023] [Indexed: 10/20/2023] Open
Abstract
Combinatorial drug therapy is a promising approach for treating complex diseases by combining drugs with synergistic effects. However, predicting effective drug combinations is challenging due to the complexity of biological systems and the limited understanding of pathophysiological mechanisms and drug targets. In this paper, we proposed a computational framework called VGAETF (Variational Graph Autoencoder Tensor Decomposition), which leveraged multi-relational graph to model complex relationships between entities in biological systems and predicted disease-related synergistic drug combinations in an end-to-end manner. In the computational experiments, VGAETF achieved high performances (AUROC [the area under receiver operating characteristic] = 0.9767, AUPR [the area under precision-recall] = 0.9660), outperforming other compared methods. Moreover, case studies further demonstrated the effectiveness of VGAETF in identifying potential disease-related synergistic drug combinations.
Collapse
Affiliation(s)
- Wenyu Shan
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| | - Cong Shen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
- Hunan Medical Big Data International Science and Technology Innovation Cooperation Base, Hengyang, Hunan 421001, China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| |
Collapse
|
28
|
Dlamini SB, Saunders CJ, Laguette MJN, Gibbon A, Gamieldien J, Collins M, September AV. Application of an in silico approach identifies a genetic locus within ITGB2, and its interactions with HSPG2 and FGF9, to be associated with anterior cruciate ligament rupture risk. Eur J Sport Sci 2023; 23:2098-2108. [PMID: 36680346 DOI: 10.1080/17461391.2023.2171906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
We developed a Biomedical Knowledge Graph model that is phenotype and biological function-aware through integrating knowledge from multiple domains in a Neo4j, graph database. All known human genes were assessed through the model to identify potential new risk genes for anterior cruciate ligament (ACL) ruptures and Achilles tendinopathy (AT). Genes were prioritised and explored in a case-control study comparing participants with ACL ruptures (ACL-R), including a sub-group with non-contact mechanism injuries (ACL-NON), to uninjured control individuals (CON). After gene filtering, 3376 genes, including 411 genes identified through previous whole exome sequencing, were found to be potentially linked to AT and ACL ruptures. Four variants were prioritised: HSPG2:rs2291826A/G, HSPG2:rs2291827G/A, ITGB2:rs2230528C/T and FGF9:rs2274296C/T. The rs2230528 CC genotype was over-represented in the CON group compared to ACL-R (p < 0.001) and ACL-NON (p < 0.001) and the TT genotype and T allele were over-represented in the ACL-R group and ACL-NON compared to CON (p < 0.001) group. Several significant differences in distributions were noted for the gene-gene interactions: (HSPG2:rs2291826, rs2291827 and ITGB2:rs2230528) and (ITGB2:rs2230528 and FGF9:rs2297429). This study substantiates the efficiency of using a prior knowledge-driven in silico approach to identify candidate genes linked to tendon and ACL injuries. Our biomedical knowledge graph identified and, with further testing, highlighted novel associations of the ITGB2 gene which has not been explored in a genetic case control association study, with ACL rupture risk. We thus recommend a multistep approach including bioinformatics in conjunction with next generation sequencing technology to improve the discovery potential of genomics technologies in musculoskeletal soft tissue injuries.HighlightsA biomedical knowledge graph was modelled for musculoskeletal soft tissue injuries to efficiently identify candidate genes for genetic susceptibility analyses.The biomedical knowledge graph and sequencing data identified potential biologically relevant variants to explore susceptibility to common tendon and ligament injuries. Specifically genetic variants within the ITGB2 and FGF9 genes were associated with ACL risk.Novel allele combinations (HSPG2-ITGB2 and ITGB2-FGF9) showcase the potential effect of ITGB2 in influencing risk of ACL rupture.
Collapse
Affiliation(s)
- Senanile B Dlamini
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- Department of Human Biology, Health through Physical Activity Lifestyle and Sport Research Centre (HPALS), Newlands, South Africa
| | - Colleen J Saunders
- Division of Emergency Medicine, Department of Surgery, University of Cape Town, Cape Town, South Africa
- South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Mary-Jessica N Laguette
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- Department of Human Biology, Health through Physical Activity Lifestyle and Sport Research Centre (HPALS), Newlands, South Africa
| | - Andrea Gibbon
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
| | - Junaid Gamieldien
- South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Malcolm Collins
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- Department of Human Biology, Health through Physical Activity Lifestyle and Sport Research Centre (HPALS), Newlands, South Africa
- Department of Human Biology, International Federation of Sports Medicine (FIMS) Collaborative Centre of Sports Medicine, University of Cape Town, Newlands, South Africa
| | - Alison V September
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- Department of Human Biology, Health through Physical Activity Lifestyle and Sport Research Centre (HPALS), Newlands, South Africa
- Department of Human Biology, International Federation of Sports Medicine (FIMS) Collaborative Centre of Sports Medicine, University of Cape Town, Newlands, South Africa
| |
Collapse
|
29
|
Cheng M, Jiang Y, Xu J, Mentis AFA, Wang S, Zheng H, Sahu SK, Liu L, Xu X. Spatially resolved transcriptomics: a comprehensive review of their technological advances, applications, and challenges. J Genet Genomics 2023; 50:625-640. [PMID: 36990426 DOI: 10.1016/j.jgg.2023.03.011] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/11/2023] [Accepted: 03/16/2023] [Indexed: 03/29/2023]
Abstract
The ability to explore life kingdoms is largely driven by innovations and breakthroughs in technology, from the invention of the microscope 350 years ago to the recent emergence of single-cell sequencing, by which the scientific community has been able to visualize life at an unprecedented resolution. Most recently, the Spatially Resolved Transcriptomics (SRT) technologies have filled the gap in probing the spatial or even three-dimensional organization of the molecular foundation behind the molecular mysteries of life, including the origin of different cellular populations developed from totipotent cells and human diseases. In this review, we introduce recent progresses and challenges on SRT from the perspectives of technologies and bioinformatic tools, as well as the representative SRT applications. With the currently fast-moving progress of the SRT technologies and promising results from early adopted research projects, we can foresee the bright future of such new tools in understanding life at the most profound analytical level.
Collapse
Affiliation(s)
| | - Yujia Jiang
- BGI-Hangzhou, Hangzhou, Zhejiang 310012, China
| | | | | | - Shuai Wang
- BGI-Hangzhou, Hangzhou, Zhejiang 310012, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Sunil Kumar Sahu
- BGI-Shenzhen, Shenzhen, Guangdong 518103, China; State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Longqi Liu
- BGI-Hangzhou, Hangzhou, Zhejiang 310012, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Xun Xu
- BGI-Hangzhou, Hangzhou, Zhejiang 310012, China; BGI-Shenzhen, Shenzhen, Guangdong 518103, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China; Guangdong Provincial Key Laboratory of Genome Read and Write, Shenzhen, Guangdong 518120, China.
| |
Collapse
|
30
|
Evangelista JE, Xie Z, Marino GB, Nguyen N, Clarke DB, Ma’ayan A. Enrichr-KG: bridging enrichment analysis across multiple libraries. Nucleic Acids Res 2023; 51:W168-W179. [PMID: 37166973 PMCID: PMC10320098 DOI: 10.1093/nar/gkad393] [Citation(s) in RCA: 59] [Impact Index Per Article: 29.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 04/23/2023] [Accepted: 05/02/2023] [Indexed: 05/12/2023] Open
Abstract
Gene and protein set enrichment analysis is a critical step in the analysis of data collected from omics experiments. Enrichr is a popular gene set enrichment analysis web-server search engine that contains hundreds of thousands of annotated gene sets. While Enrichr has been useful in providing enrichment analysis with many gene set libraries from different categories, integrating enrichment results across libraries and domains of knowledge can further hypothesis generation. To this end, Enrichr-KG is a knowledge graph database and a web-server application that combines selected gene set libraries from Enrichr for integrative enrichment analysis and visualization. The enrichment results are presented as subgraphs made of nodes and links that connect genes to their enriched terms. In addition, users of Enrichr-KG can add gene-gene links, as well as predicted genes to the subgraphs. This graphical representation of cross-library results with enriched and predicted genes can illuminate hidden associations between genes and annotated enriched terms from across datasets and resources. Enrichr-KG currently serves 26 gene set libraries from different categories that include transcription, pathways, ontologies, diseases/drugs, and cell types. To demonstrate the utility of Enrichr-KG we provide several case studies. Enrichr-KG is freely available at: https://maayanlab.cloud/enrichr-kg.
Collapse
Affiliation(s)
- John Erol Evangelista
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| | - Zhuorui Xie
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| | - Giacomo B Marino
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| | - Nhi Nguyen
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| | - Daniel J B Clarke
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| | - Avi Ma’ayan
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, NY, NY, USA
| |
Collapse
|
31
|
Zhang B, Shi H, Wang H. Machine Learning and AI in Cancer Prognosis, Prediction, and Treatment Selection: A Critical Approach. J Multidiscip Healthc 2023; 16:1779-1791. [PMID: 37398894 PMCID: PMC10312208 DOI: 10.2147/jmdh.s410301] [Citation(s) in RCA: 65] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 06/12/2023] [Indexed: 07/04/2023] Open
Abstract
Cancer is a leading cause of morbidity and mortality worldwide. While progress has been made in the diagnosis, prognosis, and treatment of cancer patients, individualized and data-driven care remains a challenge. Artificial intelligence (AI), which is used to predict and automate many cancers, has emerged as a promising option for improving healthcare accuracy and patient outcomes. AI applications in oncology include risk assessment, early diagnosis, patient prognosis estimation, and treatment selection based on deep knowledge. Machine learning (ML), a subset of AI that enables computers to learn from training data, has been highly effective at predicting various types of cancer, including breast, brain, lung, liver, and prostate cancer. In fact, AI and ML have demonstrated greater accuracy in predicting cancer than clinicians. These technologies also have the potential to improve the diagnosis, prognosis, and quality of life of patients with various illnesses, not just cancer. Therefore, it is important to improve current AI and ML technologies and to develop new programs to benefit patients. This article examines the use of AI and ML algorithms in cancer prediction, including their current applications, limitations, and future prospects.
Collapse
Affiliation(s)
- Bo Zhang
- Jinling Institute of Science and Technology, Nanjing City, Jiangsu Province, People’s Republic of China
| | - Huiping Shi
- Jinling Institute of Science and Technology, Nanjing City, Jiangsu Province, People’s Republic of China
| | - Hongtao Wang
- School of Life Science, Tonghua Normal University, Tonghua City, Jilin Province, People’s Republic of China
| |
Collapse
|
32
|
Aldughayfiq B, Ashfaq F, Jhanjhi NZ, Humayun M. Capturing Semantic Relationships in Electronic Health Records Using Knowledge Graphs: An Implementation Using MIMIC III Dataset and GraphDB. Healthcare (Basel) 2023; 11:1762. [PMID: 37372880 DOI: 10.3390/healthcare11121762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Electronic health records (EHRs) are an increasingly important source of information for healthcare professionals and researchers. However, EHRs are often fragmented, unstructured, and difficult to analyze due to the heterogeneity of the data sources and the sheer volume of information. Knowledge graphs have emerged as a powerful tool for capturing and representing complex relationships within large datasets. In this study, we explore the use of knowledge graphs to capture and represent complex relationships within EHRs. Specifically, we address the following research question: Can a knowledge graph created using the MIMIC III dataset and GraphDB effectively capture semantic relationships within EHRs and enable more efficient and accurate data analysis? We map the MIMIC III dataset to an ontology using text refinement and Protege; then, we create a knowledge graph using GraphDB and use SPARQL queries to retrieve and analyze information from the graph. Our results demonstrate that knowledge graphs can effectively capture semantic relationships within EHRs, enabling more efficient and accurate data analysis. We provide examples of how our implementation can be used to analyze patient outcomes and identify potential risk factors. Our results demonstrate that knowledge graphs are an effective tool for capturing semantic relationships within EHRs, enabling a more efficient and accurate data analysis. Our implementation provides valuable insights into patient outcomes and potential risk factors, contributing to the growing body of literature on the use of knowledge graphs in healthcare. In particular, our study highlights the potential of knowledge graphs to support decision-making and improve patient outcomes by enabling a more comprehensive and holistic analysis of EHR data. Overall, our research contributes to a better understanding of the value of knowledge graphs in healthcare and lays the foundation for further research in this area.
Collapse
Affiliation(s)
- Bader Aldughayfiq
- Department of Information Systems, College of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia
| | - Farzeen Ashfaq
- School of Computer Science-SCS, Taylor's University, Subang Jaya 47500, Malaysia
| | - N Z Jhanjhi
- School of Computer Science-SCS, Taylor's University, Subang Jaya 47500, Malaysia
| | - Mamoona Humayun
- Department of Information Systems, College of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia
| |
Collapse
|
33
|
Zhang DY, Cui WQ, Hou L, Yang J, Lyu LY, Wang ZY, Linghu KG, He WB, Yu H, Hu YJ. Expanding potential targets of herbal chemicals by node2vec based on herb-drug interactions. Chin Med 2023; 18:64. [PMID: 37264453 PMCID: PMC10233865 DOI: 10.1186/s13020-023-00763-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 05/01/2023] [Indexed: 06/03/2023] Open
Abstract
BACKGROUND The identification of chemical-target interaction is key to pharmaceutical research and development, but the unclear materials basis and complex mechanisms of traditional medicine (TM) make it difficult, especially for low-content chemicals which are hard to test in experiments. In this research, we aim to apply the node2vec algorithm in the context of drug-herb interactions for expanding potential targets and taking advantage of molecular docking and experiments for verification. METHODS Regarding the widely reported risks between cardiovascular drugs and herbs, Salvia miltiorrhiza (Danshen, DS) and Ligusticum chuanxiong (Chuanxiong, CX), which are widely used in the treatment of cardiovascular disease (CVD), and approved drugs for CVD form the new dataset as an example. Three data groups DS-drug, CX-drug, and DS-CX-drug were applied to serve as the context of drug-herb interactions for link prediction. Three types of datasets were set under three groups, containing information from chemical-target connection (CTC), chemical-chemical connection (CCC) and protein-protein interaction (PPI) in increasing steps. Five algorithms, including node2vec, were applied as comparisons. Molecular docking and pharmacological experiments were used for verification. RESULTS Node2vec represented the best performance with average AUROC and AP values of 0.91 on the datasets "CTC, CCC, PPI". Targets of 32 herbal chemicals were identified within 43 predicted edges of herbal chemicals and drug targets. Among them, 11 potential chemical-drug target interactions showed better binding affinity by molecular docking. Further pharmacological experiments indicated caffeic acid increased the thermal stability of the protein GGT1 and ligustilide and low-content chemical neocryptotanshinone induced mRNA change of FGF2 and MTNR1A, respectively. CONCLUSIONS The analytical framework and methods established in the study provide an important reference for researchers in discovering herb-drug interactions, alerting clinical risks, and understanding complex mechanisms of TM.
Collapse
Affiliation(s)
- Dai-Yan Zhang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Wen-Qing Cui
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Ling Hou
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Jing Yang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Li-Yang Lyu
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Ze-Yu Wang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Ke-Gang Linghu
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Wen-Bin He
- Shanxi Key Laboratory of Chinese Medicine Encephalopathy, Shanxi University of Chinese Medicine, Taiyuan, China
| | - Hua Yu
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China
| | - Yuan-Jia Hu
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, 999078, Macao, China.
- DPM, Faculty of Health Sciences, University of Macau, Macao, China.
| |
Collapse
|
34
|
Su C, Hou Y, Zhou M, Rajendran S, Maasch JRA, Abedi Z, Zhang H, Bai Z, Cuturrufo A, Guo W, Chaudhry FF, Ghahramani G, Tang J, Cheng F, Li Y, Zhang R, DeKosky ST, Bian J, Wang F. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). iScience 2023; 26:106460. [PMID: 37020958 PMCID: PMC10068563 DOI: 10.1016/j.isci.2023.106460] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/20/2022] [Accepted: 03/16/2023] [Indexed: 04/01/2023] Open
Abstract
The abundance of biomedical knowledge gained from biological experiments and clinical practices is an invaluable resource for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In this study, we created a comprehensive BKG called the integrative Biomedical Knowledge Hub (iBKH) by harmonizing and integrating information from diverse biomedical resources. To make iBKH easily accessible for biomedical research, we developed a web-based, user-friendly graphical portal that allows fast and interactive knowledge retrieval. Additionally, we also implemented an efficient and scalable graph learning pipeline for discovering novel biomedical knowledge in iBKH. As a proof of concept, we performed our iBKH-based method for computational in-silico drug repurposing for Alzheimer's disease. The iBKH is publicly available.
Collapse
Affiliation(s)
- Chang Su
- Department of Health Service Administration and Policy, College of Public Health, Temple University, Philadelphia, PA 19122, USA
| | - Yu Hou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Manqi Zhou
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, New York, NY 10065, USA
| | | | - Zehra Abedi
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Haotan Zhang
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | | | - Winston Guo
- Department of Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Fayzan F. Chaudhry
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Gregory Ghahramani
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec AI Institute and HEC Montreal, Montreal, QC H2S 3H1, Canada
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC H3A 0C6, Canada
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Steven T. DeKosky
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
35
|
Peng C, Xia F, Naseriparsa M, Osborne F. Knowledge Graphs: Opportunities and Challenges. Artif Intell Rev 2023; 56:1-32. [PMID: 37362886 PMCID: PMC10068207 DOI: 10.1007/s10462-023-10465-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/09/2023] [Indexed: 04/05/2023]
Abstract
With the explosive growth of artificial intelligence (AI) and big data, it has become vitally important to organize and represent the enormous volume of knowledge appropriately. As graph data, knowledge graphs accumulate and convey knowledge of the real world. It has been well-recognized that knowledge graphs effectively represent complex information; hence, they rapidly gain the attention of academia and industry in recent years. Thus to develop a deeper understanding of knowledge graphs, this paper presents a systematic overview of this field. Specifically, we focus on the opportunities and challenges of knowledge graphs. We first review the opportunities of knowledge graphs in terms of two aspects: (1) AI systems built upon knowledge graphs; (2) potential application fields of knowledge graphs. Then, we thoroughly discuss severe technical challenges in this field, such as knowledge graph embeddings, knowledge acquisition, knowledge graph completion, knowledge fusion, and knowledge reasoning. We expect that this survey will shed new light on future research and the development of knowledge graphs.
Collapse
Affiliation(s)
- Ciyuan Peng
- Institute of Innovation, Science and Sustainability, Federation University Australia, Ballarat, 3353 VIC Australia
| | - Feng Xia
- School of Computing Technologies, RMIT University, Melbourne, 3000 VIC Australia
| | - Mehdi Naseriparsa
- Global Professional School, Federation University Australia, Ballarat, 3353 VIC Australia
| | - Francesco Osborne
- Knowledge Media Institute, The Open University, Milton Keynes, MK7 6AA UK
| |
Collapse
|
36
|
Sanders LM, Scott RT, Yang JH, Qutub AA, Garcia Martin H, Berrios DC, Hastings JJA, Rask J, Mackintosh G, Hoarfrost AL, Chalk S, Kalantari J, Khezeli K, Antonsen EL, Babdor J, Barker R, Baranzini SE, Beheshti A, Delgado-Aparicio GM, Glicksberg BS, Greene CS, Haendel M, Hamid AA, Heller P, Jamieson D, Jarvis KJ, Komarova SV, Komorowski M, Kothiyal P, Mahabal A, Manor U, Mason CE, Matar M, Mias GI, Miller J, Myers JG, Nelson C, Oribello J, Park SM, Parsons-Wingerter P, Prabhu RK, Reynolds RJ, Saravia-Butler A, Saria S, Sawyer A, Singh NK, Snyder M, Soboczenski F, Soman K, Theriot CA, Van Valen D, Venkateswaran K, Warren L, Worthey L, Zitnik M, Costes SV. Biological research and self-driving labs in deep space supported by artificial intelligence. NAT MACH INTELL 2023. [DOI: 10.1038/s42256-023-00618-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
37
|
A Quick Prototype for Assessing OpenIE Knowledge Graph-Based Question-Answering Systems. INFORMATION 2023. [DOI: 10.3390/info14030186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023] Open
Abstract
Due to the rapid growth of knowledge graphs (KG) as representational learning methods in recent years, question-answering approaches have received increasing attention from academia and industry. Question-answering systems use knowledge graphs to organize, navigate, search and connect knowledge entities. Managing such systems requires a thorough understanding of the underlying graph-oriented structures and, at the same time, an appropriate query language, such as SPARQL, to access relevant data. Natural language interfaces are needed to enable non-technical users to query ever more complex data. The paper proposes a question-answering approach to support end users in querying graph-oriented knowledge bases. The system pipeline is composed of two main modules: one is dedicated to translating a natural language query submitted by the user into a triple of the form <subject, predicate, object>, while the second module implements knowledge graph embedding (KGE) models, exploiting the previous module triple and retrieving the answer to the question. Our framework delivers a fast OpenIE-based knowledge extraction system and a graph-based answer prediction model for question-answering tasks. The system was designed by leveraging existing tools to accomplish a simple prototype for fast experimentation, especially across different knowledge domains, with the added benefit of reducing development time and costs. The experimental results confirm the effectiveness of the proposed system, which provides promising performance, as assessed at the module level. In particular, in some cases, the system outperforms the literature. Finally, a use case example shows the KG generated by user questions in a graphical interface provided by an ad-hoc designed web application.
Collapse
|
38
|
Carvalho RMS, Oliveira D, Pesquita C. Knowledge Graph Embeddings for ICU readmission prediction. BMC Med Inform Decis Mak 2023; 23:12. [PMID: 36658526 PMCID: PMC9850812 DOI: 10.1186/s12911-022-02070-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 11/28/2022] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Intensive Care Unit (ICU) readmissions represent both a health risk for patients,with increased mortality rates and overall health deterioration, and a financial burden for healthcare facilities. As healthcare became more data-driven with the introduction of Electronic Health Records (EHR), machine learning methods have been applied to predict ICU readmission risk. However, these methods disregard the meaning and relationships of data objects and work blindly over clinical data without taking into account scientific knowledge and context. Ontologies and Knowledge Graphs can help bridge this gap between data and scientific context, as they are computational artefacts that represent the entities of a domain and their relationships to each other in a formalized way. METHODS AND RESULTS We have developed an approach that enriches EHR data with semantic annotations to ontologies to build a Knowledge Graph. A patient's ICU stay is represented by Knowledge Graph embeddings in a contextualized manner, which are used by machine learning models to predict 30-days ICU readmissions. This approach is based on several contributions: (1) an enrichment of the MIMIC-III dataset with patient-oriented annotations to various biomedical ontologies; (2) a Knowledge Graph that defines patient data with biomedical ontologies; (3) a predictive model of ICU readmission risk that uses Knowledge Graph embeddings; (4) a variant of the predictive model that targets different time points during an ICU stay. Our predictive approaches outperformed both a baseline and state-of-the-art works achieving a mean Area Under the Receiver Operating Characteristic Curve of 0.827 and an Area Under the Precision-Recall Curve of 0.691. The application of this novel approach to help clinicians decide whether a patient can be discharged has the potential to prevent the readmission of [Formula: see text] of Intensive Care Unit patients, without unnecessarily prolonging the stay of those who would not require it. CONCLUSION The coupling of semantic annotation and Knowledge Graph embeddings affords two clear advantages: they consider scientific context and they are able to build representations of EHR information of different types in a common format. This work demonstrates the potential for impact that integrating ontologies and Knowledge Graphs into clinical machine learning applications can have.
Collapse
Affiliation(s)
- Ricardo M. S. Carvalho
- grid.9983.b0000 0001 2181 4263LASIGE, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
| | - Daniela Oliveira
- grid.9983.b0000 0001 2181 4263LASIGE, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
| | - Catia Pesquita
- grid.9983.b0000 0001 2181 4263LASIGE, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
| |
Collapse
|
39
|
Li G, Siddharth L, Luo J. Embedding knowledge graph of patent metadata to measure knowledge proximity. J Assoc Inf Sci Technol 2023. [DOI: 10.1002/asi.24736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Affiliation(s)
- Guangtong Li
- Data‐Driven Innovation Lab, Engineering Product Development Pillar Singapore University of Technology and Design Singapore Singapore
| | - L. Siddharth
- Data‐Driven Innovation Lab, Engineering Product Development Pillar Singapore University of Technology and Design Singapore Singapore
| | - Jianxi Luo
- Data‐Driven Innovation Lab, Engineering Product Development Pillar Singapore University of Technology and Design Singapore Singapore
| |
Collapse
|
40
|
Nourani E, Asgari E, McHardy AC, Mofrad MRK. TripletProt: Deep Representation Learning of Proteins Based On Siamese Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3744-3753. [PMID: 34460382 DOI: 10.1109/tcbb.2021.3108718] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Pretrained representations have recently gained attention in various machine learning applications. Nonetheless, the high computational costs associated with training these models have motivated alternative approaches for representation learning. Herein we introduce TripletProt, a new approach for protein representation learning based on the Siamese neural networks. Representation learning of biological entities which capture essential features can alleviate many of the challenges associated with supervised learning in bioinformatics. The most important distinction of our proposed method is relying on the protein-protein interaction (PPI) network. The computational cost of the generated representations for any potential application is significantly lower than comparable methods since the length of the representations is significantly smaller than that in other approaches. TripletProt offers great potentials for the protein informatics tasks and can be widely applied to similar tasks. We evaluate TripletProt comprehensively in protein functional annotation tasks including sub-cellular localization (14 categories) and gene ontology prediction (more than 2000 classes), which are both challenging multi-class, multi-label classification machine learning problems. We compare the performance of TripletProt with the state-of-the-art approaches including a recurrent language model-based approach (i.e., UniRep), as well as a protein-protein interaction (PPI) network and sequence-based method (i.e., DeepGO). Our TripletProt showed an overall improvement of F1 score in the above mentioned comprehensive functional annotation tasks, solely relying on the PPI network. Availability: The source code and datasets are available at https://github.com/EsmaeilNourani/TripletProt.
Collapse
|
41
|
Abstract
As genetic circuits become more sophisticated, the size and complexity of data about their designs increase. The data captured goes beyond genetic sequences alone; information about circuit modularity and functional details improves comprehension, performance analysis, and design automation techniques. However, new data types expose new challenges around the accessibility, visualization, and usability of design data (and metadata). Here, we present a method to transform circuit designs into networks and showcase its potential to enhance the utility of design data. Since networks are dynamic structures, initial graphs can be interactively shaped into subnetworks of relevant information based on requirements such as the hierarchy of biological parts or interactions between entities. A significant advantage of a network approach is the ability to scale abstraction, providing an automatic sliding level of detail that further tailors the visualization to a given situation. Additionally, several visual changes can be applied, such as coloring or clustering nodes based on types (e.g., genes or promoters), resulting in easier comprehension from a user perspective. This approach allows circuit designs to be coupled to other networks, such as metabolic pathways or implementation protocols captured in graph-like formats. We advocate using networks to structure, access, and improve synthetic biology information.
Collapse
Affiliation(s)
- Matthew Crowther
- School
of Computing, Newcastle University, Newcastle Upon Tyne NE4
5TG, United Kingdom
- Centro
de Biotecnología y Genómica de Plantas, Universidad
Politécnica de Madrid, Instituto
Nacional de Investigación y Tecnología Agraria y Alimentaria
(INIA-CSIC), Pozuelo
de Alarcón, 28223 Madrid, Spain
| | - Anil Wipat
- School
of Computing, Newcastle University, Newcastle Upon Tyne NE4
5TG, United Kingdom
| | - Ángel Goñi-Moreno
- Centro
de Biotecnología y Genómica de Plantas, Universidad
Politécnica de Madrid, Instituto
Nacional de Investigación y Tecnología Agraria y Alimentaria
(INIA-CSIC), Pozuelo
de Alarcón, 28223 Madrid, Spain
| |
Collapse
|
42
|
Gao Z, Ding P, Xu R. KG-Predict: A knowledge graph computational framework for drug repurposing. J Biomed Inform 2022; 132:104133. [PMID: 35840060 PMCID: PMC9595135 DOI: 10.1016/j.jbi.2022.104133] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 06/18/2022] [Accepted: 07/03/2022] [Indexed: 11/26/2022]
Abstract
The emergence of large-scale phenotypic, genetic, and other multi-model biochemical data has offered unprecedented opportunities for drug discovery including drug repurposing. Various knowledge graph-based methods have been developed to integrate and analyze complex and heterogeneous data sources to find new therapeutic applications for existing drugs. However, existing methods have limitations in modeling and capturing context-sensitive inter-relationships among tens of thousands of biomedical entities. In this paper, we developed KG-Predict: a knowledge graph computational framework for drug repurposing. We first integrated multiple types of entities and relations from various genotypic and phenotypic databases to construct a knowledge graph termed GP-KG. GP-KG was composed of 1,246,726 associations between 61,146 entities. KG-Predict then aggregated the heterogeneous topological and semantic information from GP-KG to learn low-dimensional representations of entities and relations, and further utilized these representations to infer new drug-disease interactions. In cross-validation experiments, KG-Predict achieved high performances [AUROC (the area under receiver operating characteristic) = 0.981, AUPR (the area under precision-recall) = 0.409 and MRR (the mean reciprocal rank) = 0.261], outperforming other state-of-art graph embedding methods. We applied KG-Predict in identifying novel repositioned candidate drugs for Alzheimer's disease (AD) and showed that KG-Predict prioritized both FDA-approved and active clinical trial anti-AD drugs among the top (AUROC = 0.868 and AUPR = 0.364).
Collapse
Affiliation(s)
- Zhenxiang Gao
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, 44106 OH, USA.
| | - Pingjian Ding
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, 44106 OH, USA.
| | - Rong Xu
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, 44106 OH, USA.
| |
Collapse
|
43
|
Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data with SpaTalk. Nat Commun 2022; 13:4429. [PMID: 35908020 PMCID: PMC9338929 DOI: 10.1038/s41467-022-32111-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 07/18/2022] [Indexed: 12/19/2022] Open
Abstract
Spatially resolved transcriptomics provides genetic information in space toward elucidation of the spatial architecture in intact organs and the spatially resolved cell-cell communications mediating tissue homeostasis, development, and disease. To facilitate inference of spatially resolved cell-cell communications, we here present SpaTalk, which relies on a graph network and knowledge graph to model and score the ligand-receptor-target signaling network between spatially proximal cells by dissecting cell-type composition through a non-negative linear model and spatial mapping between single-cell transcriptomic and spatially resolved transcriptomic data. The benchmarked performance of SpaTalk on public single-cell spatial transcriptomic datasets is superior to that of existing inference methods. Then we apply SpaTalk to STARmap, Slide-seq, and 10X Visium data, revealing the in-depth communicative mechanisms underlying normal and disease tissues with spatial structure. SpaTalk can uncover spatially resolved cell-cell communications for single-cell and spot-based spatially resolved transcriptomic data universally, providing valuable insights into spatial inter-cellular tissue dynamics. Cell-cell communication is a vital feature involving numerous biological processes. Here, the authors develop SpaTalk, a cell-cell communication inference method using knowledge graph for spatially resolved transcriptomic data, providing valuable insights into spatial intercellular tissue dynamics.
Collapse
|
44
|
Pan X, Lin X, Cao D, Zeng X, Yu PS, He L, Nussinov R, Cheng F. Deep learning for drug repurposing: Methods, databases, and applications. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1597] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Xiaoqin Pan
- School of Computer Science and Engineering Hunan University Changsha Hunan China
| | - Xuan Lin
- School of Computer Science Xiangtan University Xiangtan China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education Xiangtan University Xiangtan China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
| | - Xiangxiang Zeng
- School of Computer Science and Engineering Hunan University Changsha Hunan China
| | - Philip S. Yu
- Department of Computer Science University of Illinois at Chicago Chicago Illinois USA
| | - Lifang He
- Department of Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Ruth Nussinov
- Computational Structural Biology Section, Basic Science Program, Frederick National Laboratory for Cancer Research National Cancer Institute at Frederick Frederick Maryland USA
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine Tel Aviv University Tel Aviv Israel
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic Cleveland Ohio USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine Case Western Reserve University Cleveland Ohio USA
- Case Comprehensive Cancer Center Case Western Reserve University School of Medicine Cleveland Ohio USA
| |
Collapse
|
45
|
Zhu C, Yang Z, Xia X, Li N, Zhong F, Liu L. Multimodal reasoning based on knowledge graph embedding for specific diseases. Bioinformatics 2022; 38:2235-2245. [PMID: 35150235 PMCID: PMC9004655 DOI: 10.1093/bioinformatics/btac085] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 01/06/2022] [Accepted: 02/07/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. RESULTS This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model's reliability in discovering new knowledge by manually proofreading predicted drug-gene, gene-disease and disease-drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models. AVAILABILITY AND IMPLEMENTATION The constructed SDKG-11 and the implementation by TensorFlow are available from https://github.com/ZhuChaoY/SDKG-11. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chaoyu Zhu
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Xiaoqiong Xia
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Nan Li
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Fan Zhong
- To whom correspondence should be addressed. or
| | - Lei Liu
- To whom correspondence should be addressed. or
| |
Collapse
|
46
|
Zhu F, Li F, Deng L, Meng F, Liang Z. Protein Interaction Network Reconstruction with a Structural Gated Attention Deep Model by Incorporating Network Structure Information. J Chem Inf Model 2022; 62:258-273. [PMID: 35005980 DOI: 10.1021/acs.jcim.1c00982] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Protein-protein interactions (PPIs) provide a physical basis of molecular communications for a wide range of biological processes in living cells. Establishing the PPI network has become a fundamental but essential task for a better understanding of biological events and disease pathogenesis. Although many machine learning algorithms have been employed to predict PPIs, with only protein sequence information as the training features, these models suffer from low robustness and prediction accuracy. In this study, a new deep-learning-based framework named the Structural Gated Attention Deep (SGAD) model was proposed to improve the performance of PPI network reconstruction (PINR). The improved predictive performances were achieved by augmenting multiple protein sequence descriptors, the topological features and information flow of the PPI network, which were further implemented with a gating mechanism to improve its robustness to noise. On 11 independent test data sets and one combined data set, SGAD yielded area under the curve values of approximately 0.83-0.93, outperforming other models. Furthermore, the SGAD ensemble can learn more characteristics information on protein pairs through a two-layer neural network, serving as a powerful tool in the exploration of PPI biological space.
Collapse
Affiliation(s)
- Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Feifei Li
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Lei Deng
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Fanwang Meng
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario L8S 4L8, Canada
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215 006, China
| |
Collapse
|
47
|
Ye Q, Hsieh CY, Yang Z, Kang Y, Chen J, Cao D, He S, Hou T. A unified drug-target interaction prediction framework based on knowledge graph and recommendation system. Nat Commun 2021; 12:6775. [PMID: 34811351 PMCID: PMC8635420 DOI: 10.1038/s41467-021-27137-3] [Citation(s) in RCA: 107] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 11/05/2021] [Indexed: 02/06/2023] Open
Abstract
Prediction of drug-target interactions (DTI) plays a vital role in drug development in various areas, such as virtual screening, drug repurposing and identification of potential drug side effects. Despite extensive efforts have been invested in perfecting DTI prediction, existing methods still suffer from the high sparsity of DTI datasets and the cold start problem. Here, we develop KGE_NFM, a unified framework for DTI prediction by combining knowledge graph (KG) and recommendation system. This framework firstly learns a low-dimensional representation for various entities in the KG, and then integrates the multimodal information via neural factorization machine (NFM). KGE_NFM is evaluated under three realistic scenarios, and achieves accurate and robust predictions on four benchmark datasets, especially in the scenario of the cold start for proteins. Our results indicate that KGE_NFM provides valuable insight to integrate KG and recommendation system-based techniques into a unified framework for novel DTI discovery.
Collapse
Affiliation(s)
- Qing Ye
- grid.13402.340000 0004 1759 700XInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang China ,grid.13402.340000 0004 1759 700XCollege of Control Science and Engineering, Zhejiang University, Hangzhou, 310027 Zhejiang China ,grid.13402.340000 0004 1759 700XState Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058 China
| | - Chang-Yu Hsieh
- Tencent Quantum Laboratory, Shenzhen, 518057 Guangdong China
| | - Ziyi Yang
- Tencent Quantum Laboratory, Shenzhen, 518057 Guangdong China
| | - Yu Kang
- grid.13402.340000 0004 1759 700XInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang China
| | - Jiming Chen
- grid.13402.340000 0004 1759 700XCollege of Control Science and Engineering, Zhejiang University, Hangzhou, 310027 Zhejiang China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, Hunan, China.
| | - Shibo He
- College of Control Science and Engineering, Zhejiang University, Hangzhou, 310027, Zhejiang, China.
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China. .,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, 310058, China.
| |
Collapse
|
48
|
Zeng X, Tu X, Liu Y, Fu X, Su Y. Toward better drug discovery with knowledge graph. Curr Opin Struct Biol 2021; 72:114-126. [PMID: 34649044 DOI: 10.1016/j.sbi.2021.09.003] [Citation(s) in RCA: 74] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 08/18/2021] [Accepted: 09/06/2021] [Indexed: 01/08/2023]
Abstract
Drug discovery is the process of new drug identification. This process is driven by the increasing data from existing chemical libraries and data banks. The knowledge graph is introduced to the domain of drug discovery for imposing an explicit structure to integrate heterogeneous biomedical data. The graph can provide structured relations among multiple entities and unstructured semantic relations associated with entities. In this review, we summarize knowledge graph-based works that implement drug repurposing and adverse drug reaction prediction for drug discovery. As knowledge representation learning is a common way to explore knowledge graphs for prediction problems, we introduce several representative embedding models to provide a comprehensive understanding of knowledge representation learning.
Collapse
Affiliation(s)
- Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha, 410086, China
| | - Xinqi Tu
- College of Information Science and Engineering, Hunan University, Changsha, 410086, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, Changsha, 410086, China.
| | - Xiangzheng Fu
- College of Information Science and Engineering, Hunan University, Changsha, 410086, China
| | - Yansen Su
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| |
Collapse
|
49
|
Improving Risk Assessment of Miscarriage During Pregnancy with Knowledge Graph Embeddings. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2021; 5:359-381. [DOI: 10.1007/s41666-021-00096-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 02/28/2021] [Accepted: 03/03/2021] [Indexed: 01/08/2023]
|
50
|
Biomedical Knowledge Graph Embeddings for Personalized Medicine. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/978-3-030-86230-5_46] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|