1
|
Devarakonda MV, Mohanty S, Sunkishala RR, Mallampalli N, Liu X. Clinical trial recommendations using Semantics-Based inductive inference and knowledge graph embeddings. J Biomed Inform 2024; 154:104627. [PMID: 38561170 DOI: 10.1016/j.jbi.2024.104627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 02/06/2024] [Accepted: 03/20/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVE Designing a new clinical trial entails many decisions, such as defining a cohort and setting the study objectives to name a few, and therefore can benefit from recommendations based on exhaustive mining of past clinical trial records. This study proposes an approach based on knowledge graph embeddings and semantics-driven inductive inference for generating such recommendations. METHOD The proposed recommendation methodology is based on neural embeddings trained on first-of-its-kind knowledge graph constructed from clinical trials data. The methodology includes design of a knowledge graph for clinical trial data, evaluation of various knowledge graph embedding techniques for it, application of a novel inductive inference method using these embeddings, and generation of recommendations for clinical trial design. The study uses freely available data from clinicaltrials.gov and related sources. RESULTS The proposed approach for recommendations obtained relevance scores ranging from 70% to 83%. These scores were determined by evaluating the text similarity of recommended elements to actual elements used in clinical trials that are in progress. Furthermore, the most pertinent recommendations were consistently located towards the top of the list, indicating the effectiveness of our method. CONCLUSION Our study suggests that inductive inference using node semantics is a viable approach for generating recommendations using graphs neural embeddings, and that there is a potential for improvement in training graph embeddings using node semantics.
Collapse
Affiliation(s)
| | | | | | | | - Xiong Liu
- Biomedical Research, Novartis, Cambridge, MA, USA
| |
Collapse
|
2
|
Callahan TJ, Tripodi IJ, Stefanski AL, Cappelletti L, Taneja SB, Wyrwa JM, Casiraghi E, Matentzoglu NA, Reese J, Silverstein JC, Hoyt CT, Boyce RD, Malec SA, Unni DR, Joachimiak MP, Robinson PN, Mungall CJ, Cavalleri E, Fontana T, Valentini G, Mesiti M, Gillenwater LA, Santangelo B, Vasilevsky NA, Hoehndorf R, Bennett TD, Ryan PB, Hripcsak G, Kahn MG, Bada M, Baumgartner WA, Hunter LE. An open source knowledge graph ecosystem for the life sciences. Sci Data 2024; 11:363. [PMID: 38605048 PMCID: PMC11009265 DOI: 10.1038/s41597-024-03171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 03/21/2024] [Indexed: 04/13/2024] Open
Abstract
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA.
| | - Ignacio J Tripodi
- Computer Science Department, Interdisciplinary Quantitative Biology, University of Colorado Boulder, Boulder, CO, 80301, USA
| | - Adrianne L Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Jordan M Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Jonathan C Silverstein
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Scott A Malec
- Division of Translational Informatics, University of New Mexico School of Medicine, Albuquerque, NM, 87131, USA
| | - Deepak R Unni
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Marcin P Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- Berlin Institute of Health at Charité-Universitatsmedizin, 10117, Berlin, Germany
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Emanuele Cavalleri
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit, Italy
| | - Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Lucas A Gillenwater
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Brook Santangelo
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Nicole A Vasilevsky
- Data Collaboration Center, Critical Path Institute, 1840 E River Rd. Suite 100, Tucson, AZ, 85718, USA
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Tellen D Bennett
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
- Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, 08869, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Michael G Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael Bada
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - William A Baumgartner
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| |
Collapse
|
3
|
Verma G, Rebholz-Schuhmann D, Madden MG. Enabling personalised disease diagnosis by combining a patient's time-specific gene expression profile with a biomedical knowledge base. BMC Bioinformatics 2024; 25:62. [PMID: 38326757 PMCID: PMC10848462 DOI: 10.1186/s12859-024-05674-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Accepted: 01/25/2024] [Indexed: 02/09/2024] Open
Abstract
BACKGROUND Recent developments in the domain of biomedical knowledge bases (KBs) open up new ways to exploit biomedical knowledge that is available in the form of KBs. Significant work has been done in the direction of biomedical KB creation and KB completion, specifically, those having gene-disease associations and other related entities. However, the use of such biomedical KBs in combination with patients' temporal clinical data still largely remains unexplored, but has the potential to immensely benefit medical diagnostic decision support systems. RESULTS We propose two new algorithms, LOADDx and SCADDx, to combine a patient's gene expression data with gene-disease association and other related information available in the form of a KB, to assist personalized disease diagnosis. We have tested both of the algorithms on two KBs and on four real-world gene expression datasets of respiratory viral infection caused by Influenza-like viruses of 19 subtypes. We also compare the performance of proposed algorithms with that of five existing state-of-the-art machine learning algorithms (k-NN, Random Forest, XGBoost, Linear SVM, and SVM with RBF Kernel) using two validation approaches: LOOCV and a single internal validation set. Both SCADDx and LOADDx outperform the existing algorithms when evaluated with both validation approaches. SCADDx is able to detect infections with up to 100% accuracy in the cases of Datasets 2 and 3. Overall, SCADDx and LOADDx are able to detect an infection within 72 h of infection with 91.38% and 92.66% average accuracy respectively considering all four datasets, whereas XGBoost, which performed best among the existing machine learning algorithms, can detect the infection with only 86.43% accuracy on an average. CONCLUSIONS We demonstrate how our novel idea of using the most and least differentially expressed genes in combination with a KB can enable identification of the diseases that a patient is most likely to have at a particular time, from a KB with thousands of diseases. Moreover, the proposed algorithms can provide a short ranked list of the most likely diseases for each patient along with their most affected genes, and other entities linked with them in the KB, which can support health care professionals in their decision-making.
Collapse
Affiliation(s)
- Ghanshyam Verma
- Insight Centre for Data Analytics, School of Computer Science, University of Galway, Galway, Ireland.
- School of Computer Science, University of Galway, Galway, Ireland.
| | | | - Michael G Madden
- Insight Centre for Data Analytics, School of Computer Science, University of Galway, Galway, Ireland
- School of Computer Science, University of Galway, Galway, Ireland
| |
Collapse
|
4
|
Daza D, Alivanistos D, Mitra P, Pijnenburg T, Cochez M, Groth P. BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs. J Biomed Semantics 2023; 14:20. [PMID: 38066573 PMCID: PMC10709903 DOI: 10.1186/s13326-023-00301-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 11/29/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. OBJECTIVE We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. RESULTS In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. CONCLUSION BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.
Collapse
Affiliation(s)
- Daniel Daza
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
- University of Amsterdam, Amsterdam, The Netherlands.
- Discovery Lab, Elsevier, Amsterdam, The Netherlands.
| | - Dimitrios Alivanistos
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
- Discovery Lab, Elsevier, Amsterdam, The Netherlands.
| | | | | | - Michael Cochez
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
- Discovery Lab, Elsevier, Amsterdam, The Netherlands
| | - Paul Groth
- University of Amsterdam, Amsterdam, The Netherlands
- Discovery Lab, Elsevier, Amsterdam, The Netherlands
| |
Collapse
|
5
|
Boudin M, Diallo G, Drancé M, Mougin F. The OREGANO knowledge graph for computational drug repurposing. Sci Data 2023; 10:871. [PMID: 38057380 PMCID: PMC10700660 DOI: 10.1038/s41597-023-02757-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 11/16/2023] [Indexed: 12/08/2023] Open
Abstract
Drug repositioning is a faster and more affordable solution than traditional drug discovery approaches. From this perspective, computational drug repositioning using knowledge graphs is a very promising direction. Knowledge graphs constructed from drug data and information can be used to generate hypotheses (molecule/drug - target links) through link prediction using machine learning algorithms. However, it remains rare to have a holistically constructed knowledge graph using the broadest possible features and drug characteristics, which is freely available to the community. The OREGANO knowledge graph aims at filling this gap. The purpose of this paper is to present the OREGANO knowledge graph, which includes natural compounds related data. The graph was developed from scratch by retrieving data directly from the knowledge sources to be integrated. We therefore designed the expected graph model and proposed a method for merging nodes between the different knowledge sources, and finally, the data were cleaned. The knowledge graph, as well as the source codes for the ETL process, are openly available on the GitHub of the OREGANO project ( https://gitub.u-bordeaux.fr/erias/oregano ).
Collapse
Affiliation(s)
- Marina Boudin
- AHeaD team, Bordeaux Population Health Inserm U1219, Univ. Bordeaux, F-33000, Bordeaux, France.
| | - Gayo Diallo
- AHeaD team, Bordeaux Population Health Inserm U1219, Univ. Bordeaux, F-33000, Bordeaux, France
| | - Martin Drancé
- AHeaD team, Bordeaux Population Health Inserm U1219, Univ. Bordeaux, F-33000, Bordeaux, France
| | - Fleur Mougin
- AHeaD team, Bordeaux Population Health Inserm U1219, Univ. Bordeaux, F-33000, Bordeaux, France
| |
Collapse
|
6
|
Pascazio L, Rihm S, Naseri A, Mosbach S, Akroyd J, Kraft M. Chemical Species Ontology for Data Integration and Knowledge Discovery. J Chem Inf Model 2023; 63:6569-6586. [PMID: 37883649 PMCID: PMC10647085 DOI: 10.1021/acs.jcim.3c00820] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 10/13/2023] [Accepted: 10/13/2023] [Indexed: 10/28/2023]
Abstract
Web ontologies are important tools in modern scientific research because they provide a standardized way to represent and manage web-scale amounts of complex data. In chemistry, a semantic database for chemical species is indispensable for its ability to interrelate and infer relationships, enabling a more precise analysis and prediction of chemical behavior. This paper presents OntoSpecies, a web ontology designed to represent chemical species and their properties. The ontology serves as a core component of The World Avatar knowledge graph chemistry domain and includes a wide range of identifiers, chemical and physical properties, chemical classifications and applications, and spectral information associated with each species. The ontology includes provenance and attribution metadata, ensuring the reliability and traceability of data. Most of the information about chemical species are sourced from PubChem and ChEBI data on the respective compound Web pages using a software agent, making OntoSpecies a comprehensive semantic database of chemical species able to solve novel types of problems in the field. Access to this reliable source of chemical data is provided through a SPARQL end point. The paper presents example use cases to demonstrate the contribution of OntoSpecies in solving complex tasks that require integrated semantically searchable chemical data. The approach presented in this paper represents a significant advancement in the field of chemical data management, offering a powerful tool for representing, navigating, and analyzing chemical information to support scientific research.
Collapse
Affiliation(s)
- Laura Pascazio
- CARES,
Cambridge Centre for Advanced Research and Education in Singapore, 1 Create Way, CREATE Tower, #05-05, Singapore 138602, Singapore
| | - Simon Rihm
- CARES,
Cambridge Centre for Advanced Research and Education in Singapore, 1 Create Way, CREATE Tower, #05-05, Singapore 138602, Singapore
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
| | - Ali Naseri
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
| | - Sebastian Mosbach
- CARES,
Cambridge Centre for Advanced Research and Education in Singapore, 1 Create Way, CREATE Tower, #05-05, Singapore 138602, Singapore
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- CMCL
Innovations, Sheraton
House, Castle Park, Cambridge CB3 0AX, U.K.
| | - Jethro Akroyd
- CARES,
Cambridge Centre for Advanced Research and Education in Singapore, 1 Create Way, CREATE Tower, #05-05, Singapore 138602, Singapore
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- CMCL
Innovations, Sheraton
House, Castle Park, Cambridge CB3 0AX, U.K.
| | - Markus Kraft
- CARES,
Cambridge Centre for Advanced Research and Education in Singapore, 1 Create Way, CREATE Tower, #05-05, Singapore 138602, Singapore
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- CMCL
Innovations, Sheraton
House, Castle Park, Cambridge CB3 0AX, U.K.
- School
of Chemical and Biomedical Engineering, Nanyang Technological University, 62 Nanyang Drive, Singapore 637459, Singapore
- The
Alan Turing Institute, 96 Euston Rd., London NW1 2DB, U.K.
| |
Collapse
|
7
|
Diaz Benavides S, Cardoso SD, Da Silveira M, Pruski C. Analysis and implementation of the DynDiff tool when comparing versions of ontology. J Biomed Semantics 2023; 14:15. [PMID: 37770956 PMCID: PMC10537977 DOI: 10.1186/s13326-023-00295-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 09/09/2023] [Indexed: 09/30/2023] Open
Abstract
BACKGROUND Ontologies play a key role in the management of medical knowledge because they have the properties to support a wide range of knowledge-intensive tasks. The dynamic nature of knowledge requires frequent changes to the ontologies to keep them up-to-date. The challenge is to understand and manage these changes and their impact on depending systems well in order to handle the growing volume of data annotated with ontologies and the limited documentation describing the changes. METHODS We present a method to detect and characterize the changes occurring between different versions of an ontology together with an ontology of changes entitled DynDiffOnto, designed according to Semantic Web best practices and FAIR principles. We further describe the implementation of the method and the evaluation of the tool with different ontologies from the biomedical domain (i.e. ICD9-CM, MeSH, NCIt, SNOMEDCT, GO, IOBC and CIDO), showing its performance in terms of time execution and capacity to classify ontological changes, compared with other state-of-the-art approaches. RESULTS The experiments show a top-level performance of DynDiff for large ontologies and a good performance for smaller ones, with respect to execution time and capability to identify complex changes. In this paper, we further highlight the impact of ontology matchers on the diff computation and the possibility to parameterize the matcher in DynDiff, enabling the possibility of benefits from state-of-the-art matchers. CONCLUSION DynDiff is an efficient tool to compute differences between ontology versions and classify these differences according to DynDiffOnto concepts. This work also contributes to a better understanding of ontological changes through DynDiffOnto, which was designed to express the semantics of the changes between versions of an ontology and can be used to document the evolution of an ontology.
Collapse
Affiliation(s)
- Sara Diaz Benavides
- Luxembourg Institute of Science and Technology, 5, avenue des Hauts-Fourneaux, L-4362, Esch-sur-Alzette, Luxembourg
| | - Silvio D Cardoso
- Dynaccurate, 9, avenue des Hauts-Fourneaux, L-4362, Esch-sur-Alzette, Luxembourg
| | - Marcos Da Silveira
- Luxembourg Institute of Science and Technology, 5, avenue des Hauts-Fourneaux, L-4362, Esch-sur-Alzette, Luxembourg
| | - Cédric Pruski
- Luxembourg Institute of Science and Technology, 5, avenue des Hauts-Fourneaux, L-4362, Esch-sur-Alzette, Luxembourg.
| |
Collapse
|
8
|
Rongen S, Nikolova N, van der Pas M. Modelling with AAS and RDF in Industry 4.0. COMPUT IND 2023. [DOI: 10.1016/j.compind.2023.103910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023]
|
9
|
Van Woensel W, Tu SW, Michalowski W, Sibte Raza Abidi S, Abidi S, Alonso JR, Bottrighi A, Carrier M, Edry R, Hochberg I, Rao M, Kingwell S, Kogan A, Marcos M, Martínez Salvador B, Michalowski M, Piovesan L, Riaño D, Terenziani P, Wilk S, Peleg M. A Community-of-Practice-based Evaluation Methodology for Knowledge Intensive Computational Methods and its Application to Multimorbidity Decision Support. J Biomed Inform 2023; 142:104395. [PMID: 37201618 DOI: 10.1016/j.jbi.2023.104395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 04/25/2023] [Accepted: 05/15/2023] [Indexed: 05/20/2023]
Abstract
OBJECTIVE The study has dual objectives. Our first objective (1) is to develop a community-of-practice-based evaluation methodology for knowledge-intensive computational methods. We target a whitebox analysis of the computational methods to gain insight on their functional features and inner workings. In more detail, we aim to answer evaluation questions on (i) support offered by computational methods for functional features within the application domain; and (ii) in-depth characterizations of the underlying computational processes, models, data and knowledge of the computational methods. Our second objective (2) involves applying the evaluation methodology to answer questions (i) and (ii) for knowledge-intensive clinical decision support (CDS) methods, which operationalize clinical knowledge as computer interpretable guidelines (CIG); we focus on multimorbidity CIG-based clinical decision support (MGCDS) methods that target multimorbidity treatment plans. MATERIALS AND METHODS Our methodology directly involves the research community of practice in (a) identifying functional features within the application domain; (b) defining exemplar case studies covering these features; and (c) solving the case studies using their developed computational methods-research groups detail their solutions and functional feature support in solution reports. Next, the study authors (d) perform a qualitative analysis of the solution reports, identifying and characterizing common themes (or dimensions) among the computational methods. This methodology is well suited to perform whitebox analysis, as it directly involves the respective developers in studying inner workings and feature support of computational methods. Moreover, the established evaluation parameters (e.g., features, case studies, themes) constitute a re-usable benchmark framework, which can be used to evaluate new computational methods as they are developed. We applied our community-of-practice-based evaluation methodology on MGCDS methods. RESULTS Six research groups submitted comprehensive solution reports for the exemplar case studies. Solutions for two of these case studies were reported by all groups. We identified four evaluation dimensions: detection of adverse interactions, management strategy representation, implementation paradigms, and human-in-the-loop support.Based on our whitebox analysis, we present answers to the evaluation questions (i) and (ii) for MGCDS methods. DISCUSSION The proposed evaluation methodology includes features of illuminative and comparison-based approaches; focusing on understanding rather than judging/scoring or identifying gaps in current methods. It involves answering evaluation questions with direct involvement of the research community of practice, who participate in setting up evaluation parameters and solving exemplar case studies. Our methodology was successfully applied to evaluate six MGCDS knowledge-intensive computational methods. We established that, while the evaluated methods provide a multifaceted set of solutions with different benefits and drawbacks, no single MGCDS method currently provides a comprehensive solution for MGCDS. CONCLUSION We posit that our evaluation methodology, applied here to gain new insights into MGCDS, can be used to assess other types of knowledge-intensive computational methods and answer other types of evaluation questions. Our case studies can be accessed at our GitHub repository (https://github.com/william-vw/MGCDS).
Collapse
Affiliation(s)
| | - Samson W Tu
- Center for BioMedical Informatics Research, Stanford University, Stanford, CA, 94305, USA
| | | | | | - Samina Abidi
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | | | | | | | - Ruth Edry
- Bruce Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel; Rambam Medical Center, Haifa, Israel
| | - Irit Hochberg
- Bruce Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel; Rambam Medical Center, Haifa, Israel
| | - Malvika Rao
- Telfer School of Management, University of Ottawa, Ottawa, ON, Canada
| | | | - Alexandra Kogan
- Department of Information Systems, University of Haifa, Haifa, Israel, 3498838
| | - Mar Marcos
- Universitat Jaume I, Castelló de la Plana, Spain
| | | | | | - Luca Piovesan
- DISIT, Università del Piemonte Orientale, Alessandria, Italy
| | - David Riaño
- Universitat Rovira i Virgili, Tarragona, Spain; Institut d'Investigació Sanitària Pere Virgili, Tarragona, Spain
| | | | - Szymon Wilk
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland
| | - Mor Peleg
- Department of Information Systems, University of Haifa, Haifa, Israel, 3498838
| |
Collapse
|
10
|
Touré V, Krauss P, Gnodtke K, Buchhorn J, Unni D, Horki P, Raisaro JL, Kalt K, Teixeira D, Crameri K, Österle S. FAIRification of health-related data using semantic web technologies in the Swiss Personalized Health Network. Sci Data 2023; 10:127. [PMID: 36899064 PMCID: PMC10006404 DOI: 10.1038/s41597-023-02028-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 02/17/2023] [Indexed: 03/12/2023] Open
Abstract
The Swiss Personalized Health Network (SPHN) is a government-funded initiative developing federated infrastructures for a responsible and efficient secondary use of health data for research purposes in compliance with the FAIR principles (Findable, Accessible, Interoperable and Reusable). We built a common standard infrastructure with a fit-for-purpose strategy to bring together health-related data and ease the work of both data providers to supply data in a standard manner and researchers by enhancing the quality of the collected data. As a result, the SPHN Resource Description Framework (RDF) schema was implemented together with a data ecosystem that encompasses data integration, validation tools, analysis helpers, training and documentation for representing health metadata and data in a consistent manner and reaching nationwide data interoperability goals. Data providers can now efficiently deliver several types of health data in a standardised and interoperable way while a high degree of flexibility is granted for the various demands of individual research projects. Researchers in Switzerland have access to FAIR health data for further use in RDF triplestores.
Collapse
Affiliation(s)
- Vasundra Touré
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Philip Krauss
- Trivadis - Part of Accenture, 4051, Basel, Switzerland
| | - Kristin Gnodtke
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | | | - Deepak Unni
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Petar Horki
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Jean Louis Raisaro
- Health Informatics and Data Privacy Group, Biomedical Data Science Center, 1010 Lausanne University Hospital, Lausanne, Switzerland
| | - Katie Kalt
- Clinical Data Platform Research, Directorate of Research and Education, Zurich University Hospital, 8091, Zurich, Switzerland
| | - Daniel Teixeira
- DSI - Data Group, Geneva University Hospital, 1205, Geneva, Switzerland
| | - Katrin Crameri
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland
| | - Sabine Österle
- Personalized Health Informatics Group, SIB Swiss Institute of Bioinformatics, 4051, Basel, Switzerland.
| |
Collapse
|
11
|
Zhou F, Uddin S. Interpretable Drug-to-Drug Network Features for Predicting Adverse Drug Reactions. Healthcare (Basel) 2023; 11:healthcare11040610. [PMID: 36833144 PMCID: PMC9957267 DOI: 10.3390/healthcare11040610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 01/29/2023] [Accepted: 02/06/2023] [Indexed: 02/22/2023] Open
Abstract
Recent years have witnessed booming data on drugs and their associated adverse drug reactions (ADRs). It was reported that these ADRs have resulted in a high hospitalisation rate worldwide. Therefore, a tremendous amount of research has been carried out to predict ADRs in the early phases of drug development, with the goal of reducing possible future risks. The pre-clinical and clinical phases of drug research can be time-consuming and cost-ineffective, so academics are looking forward to more extensive data mining and machine learning methods to be applied in this field of study. In this paper, we try to construct a drug-to-drug network based on non-clinical data sources. The network presents underlying relationships between drug pairs according to their common ADRs. Then, multiple node-level and graph-level network features are extracted from this network, e.g., weighted degree centrality, weighted PageRanks, etc. After concatenating the network features to the original drug features, they were fed into seven machine learning models, e.g., logistic regression, random forest, support vector machine, etc., and were compared to the baseline, where there were no network-based features considered. These experiments indicate that all the tested machine-learning methods would benefit from adding these network features. Among all these models, logistic regression (LR) had the highest mean AUROC score (82.1%) across all ADRs tested. Weighted degree centrality and weighted PageRanks were identified to be the most critical network features in the LR classifier. These pieces of evidence strongly indicate that the network approach can be vital in future ADR prediction, and this network-based approach could also be applied to other health informatics datasets.
Collapse
|
12
|
Das P, Mazumder DH. An extensive survey on the use of supervised machine learning techniques in the past two decades for prediction of drug side effects. Artif Intell Rev 2023; 56:1-28. [PMID: 36819660 PMCID: PMC9930028 DOI: 10.1007/s10462-023-10413-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2023] [Indexed: 02/19/2023]
Abstract
Approved drugs for sale must be effective and safe, implying that the drug's advantages outweigh its known harmful side effects. Side effects (SE) of drugs are one of the common reasons for drug failure that may halt the whole drug discovery pipeline. The side effects might vary from minor concerns like a runny nose to potentially life-threatening issues like liver damage, heart attack, and death. Therefore, predicting the side effects of the drug is vital in drug development, discovery, and design. Supervised machine learning-based side effects prediction task has recently received much attention since it reduces time, chemical waste, design complexity, risk of failure, and cost. The advancement of supervised learning approaches for predicting side effects have emerged as essential computational tools. Supervised machine learning technique provides early information on drug side effects to develop an effective drug based on drug properties. Still, there are several challenges to predicting drug side effects. Thus, a near-exhaustive survey is carried out in this paper on the use of supervised machine learning approaches employed in drug side effects prediction tasks in the past two decades. In addition, this paper also summarized the drug descriptor required for the side effects prediction task, commonly utilized drug properties sources, computational models, and their performances. Finally, the research gap, open problems, and challenges for the further supervised learning-based side effects prediction task have been discussed.
Collapse
Affiliation(s)
- Pranab Das
- Department of Computer Science and Engineering, National Institute of Technology Nagaland, Chumukedima, Dimapur, Nagaland 797103 India
| | - Dilwar Hussain Mazumder
- Department of Computer Science and Engineering, National Institute of Technology Nagaland, Chumukedima, Dimapur, Nagaland 797103 India
| |
Collapse
|
13
|
Artificial Intelligence and Data Mining for the Pharmacovigilance of Drug-Drug Interactions. Clin Ther 2023; 45:117-133. [PMID: 36732152 DOI: 10.1016/j.clinthera.2023.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Revised: 12/15/2022] [Accepted: 01/09/2023] [Indexed: 02/01/2023]
Abstract
Despite increasing mechanistic understanding, undetected and underrecognized drug-drug interactions (DDIs) persist. This elusiveness relates to an interwoven complexity of increasing polypharmacy, multiplex mechanistic pathways, and human biological individuality. This persistent elusiveness motivates development of artificial intelligence (AI)-based approaches to enhancing DDI detection and prediction capabilities. The literature is vast and roughly divided into "prediction" and "detection." The former relatively emphasizes biological and chemical knowledge bases, drug development, new drugs, and beneficial interactions, whereas the latter utilizes more traditional sources such as spontaneous reports, claims data, and electronic health records to detect novel adverse DDIs with authorized drugs. However, it is not a bright line, either nominally or in practice, and both are in scope for pharmacovigilance supporting signal detection but also signal refinement and evaluation, by providing data-based mechanistic arguments for/against DDI signals. The wide array of intricate and elegant methods has expanded the pharmacovigilance tool kit. How much they add to real prospective pharmacovigilance, reduce the public health impact of DDIs, and at what cost in terms of false alarms amplified by automation bias and its sequelae are open questions. (Clin Ther. 2023;45:XXX-XXX) © 2023 Elsevier HS Journals, Inc.
Collapse
|
14
|
Systematic Construction of Knowledge Graphs for Research-Performing Organizations. INFORMATION 2022. [DOI: 10.3390/info13120562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Research-Performing Organizations (e.g., research centers, universities) usually accumulate a wealth of data related to their researchers, the generated scientific results and research outputs, and publicly and privately-funded projects that support their activities, etc. Even though the types of data handled may look similar across organizations, it is common to see that each institution has developed its own data model to provide support for many of their administrative activities (project reporting, curriculum management, personnel management, etc.). This creates obstacles to the integration and linking of knowledge across organizations, as well as difficulties when researchers move from one institution to another. In this paper, we take advantage of the ontology network created by the Spanish HERCULES initiative to facilitate the construction of knowledge graphs from existing information systems, such as the one managed by the company Universitas XXI, which provides support to more than 100 Spanish-speaking research-performing organizations worldwide. Our effort is not just focused on following the modeling choices from that ontology, but also on demonstrating how the use of standard declarative mapping rules (i.e., R2RML) guarantees a systematic and sustainable workflow for constructing and maintaining a KG. We also present several real-world use cases in which the proposed workflow is adopted together with a set of lessons learned and general recommendations that may also apply to other domains. The next steps include researching in the automation of the creation of the mapping rules, the enrichment of the KG with external sources, and its exploitation though distributed environments.
Collapse
|
15
|
Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, Hoyt CT, Hamilton WL. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Brief Bioinform 2022; 23:6712301. [PMID: 36151740 DOI: 10.1093/bib/bbac404] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 07/14/2022] [Accepted: 08/20/2022] [Indexed: 12/14/2022] Open
Abstract
Drug discovery and development is a complex and costly process. Machine learning approaches are being investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Of these, those that use Knowledge Graphs (KG) have promise in many tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritization. In a drug discovery KG, crucial elements including genes, diseases and drugs are represented as entities, while relationships between them indicate an interaction. However, to construct high-quality KGs, suitable data are required. In this review, we detail publicly available sources suitable for use in constructing drug discovery focused KGs. We aim to help guide machine learning and KG practitioners who are interested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. The datasets are selected via strict criteria, categorized according to the primary type of information contained within and are considered based upon what information could be extracted to build a KG. We then present a comparative analysis of existing public drug discovery KGs and an evaluation of selected motivating case studies from the literature. Additionally, we raise numerous and unique challenges and issues associated with the domain and its datasets, while also highlighting key future research directions. We hope this review will motivate KGs use in solving key and emerging questions in the drug discovery domain.
Collapse
Affiliation(s)
- Stephen Bonner
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Ian P Barrett
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Cheng Ye
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Rowan Swiers
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweeden
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, UK
| | | | - William L Hamilton
- School of Computer Science, McGill University, Canada.,Mila-Quebec AI Institute, Montreal, Canada
| |
Collapse
|
16
|
Ikeda S, Ono H, Ohta T, Chiba H, Naito Y, Moriya Y, Kawashima S, Yamamoto Y, Okamoto S, Goto S, Katayama T. TogoID: an exploratory ID converter to bridge biological datasets. Bioinformatics 2022; 38:4194-4199. [PMID: 35801937 PMCID: PMC9438948 DOI: 10.1093/bioinformatics/btac491] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 06/08/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Understanding life cannot be accomplished without making full use of biological data, which are scattered across databases of diverse categories in life sciences. To connect such data seamlessly, identifier (ID) conversion plays a key role. However, existing ID conversion services have disadvantages, such as covering only a limited range of biological categories of databases, not keeping up with the updates of the original databases and outputs being hard to interpret in the context of biological relations, especially when converting IDs in multiple steps. RESULTS TogoID is an ID conversion service implementing unique features with an intuitive web interface and an application programming interface (API) for programmatic access. TogoID currently supports 65 datasets covering various biological categories. TogoID users can perform exploratory multistep conversions to find a path among IDs. To guide the interpretation of biological meanings in the conversions, we crafted an ontology that defines the semantics of the dataset relations. AVAILABILITY AND IMPLEMENTATION The TogoID service is freely available on the TogoID website (https://togoid.dbcls.jp/) and the API is also provided to allow programmatic access. To encourage developers to add new dataset pairs, the system stores the configurations of pairs at the GitHub repository (https://github.com/togoid/togoid-config) and accepts the request of additional pairs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Tazro Ohta
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Hirokazu Chiba
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Yuki Naito
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Yuki Moriya
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Shuichi Kawashima
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Yasunori Yamamoto
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Shinobu Okamoto
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | - Susumu Goto
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, University of Tokyo Kashiwanoha-campus Station Satellite 6F, Kashiwa, Chiba 277-0871, Japan
| | | |
Collapse
|
17
|
Zong N, Li N, Wen A, Ngo V, Yu Y, Huang M, Chowdhury S, Jiang C, Fu S, Weinshilboum R, Jiang G, Hunter L, Liu H. BETA: a comprehensive benchmark for computational drug-target prediction. Brief Bioinform 2022; 23:6596989. [PMID: 35649342 PMCID: PMC9294420 DOI: 10.1093/bib/bbac199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/10/2022] [Accepted: 04/29/2022] [Indexed: 11/14/2022] Open
Abstract
Internal validation is the most popular evaluation strategy used for drug-target predictive models. The simple random shuffling in the cross-validation, however, is not always ideal to handle large, diverse and copious datasets as it could potentially introduce bias. Hence, these predictive models cannot be comprehensively evaluated to provide insight into their general performance on a variety of use-cases (e.g. permutations of different levels of connectiveness and categories in drug and target space, as well as validations based on different data sources). In this work, we introduce a benchmark, BETA, that aims to address this gap by (i) providing an extensive multipartite network consisting of 0.97 million biomedical concepts and 8.5 million associations, in addition to 62 million drug-drug and protein-protein similarities and (ii) presenting evaluation strategies that reflect seven cases (i.e. general, screening with different connectivity, target and drug screening based on categories, searching for specific drugs and targets and drug repurposing for specific diseases), a total of seven Tests (consisting of 344 Tasks in total) across multiple sampling and validation strategies. Six state-of-the-art methods covering two broad input data types (chemical structure- and gene sequence-based and network-based) were tested across all the developed Tasks. The best-worst performing cases have been analyzed to demonstrate the ability of the proposed benchmark to identify limitations of the tested methods for running over the benchmark tasks. The results highlight BETA as a benchmark in the selection of computational strategies for drug repurposing and target discovery.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Ning Li
- Center for Structure Biology, Center for Cancer Research, National Cancer Institute, Frederick, MD
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Victoria Ngo
- Betty Irene Moore School of Nursing, University of California Davis Health, Sacramento, CA.,Stanford Health Policy, Stanford School of Medicine and Freeman Spogli Institute for International Studies, Palo Alto, CA
| | - Yue Yu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Ming Huang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Shaika Chowdhury
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Chao Jiang
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Richard Weinshilboum
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Lawrence Hunter
- Department of Pharmacology, University of Colorado Denver, Aurora, CO
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| |
Collapse
|
18
|
Gurupur VP. Key observations in terms of management of electronic health records from a mHealth perspective. Mhealth 2022; 8:18. [PMID: 35449505 PMCID: PMC9014234 DOI: 10.21037/mhealth-21-39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 01/11/2022] [Indexed: 11/06/2022] Open
Abstract
The article is a narrative review that briefly describes some of the recent advances in healthcare data management that will have positive effect on mHealth. The advances described in this article are in fact innovation introduced by the author to the field of data management with respect to electronic health records. The research delineated is transdisciplinary in nature and will potentially have positive impact on healthcare outcomes. Also, the article illustrates the necessity for an out of the box thinking approach to improve mHealth while discussing the current impending issues related to data incompleteness of electronic health records and the much-needed decision support systems for mHealth. It is to be noted that most of the electronic health records are now accessed by patients through mobile devices. These mobile devices will run as clients while much of the heavy computing is performed using servers. Here it is important to discuss some of the important technologies and methods used for decision making. The article attempts to present a discussion on how this myriad of intertwining technologies support this decision making with respect to electronic health records. More importantly it is these processes that assist in decision making and efficiency for both mHealth users and providers. In this respect, the article first provides insights on the complexities of decision making involved with electronic health records. This is followed by a discussion on the problem of data incompleteness of electronic health records. Finally, the author provides some insights into the gravity of the problem of data incompleteness in terms of revenue loss/gain for healthcare providers.
Collapse
Affiliation(s)
- Varadraj P Gurupur
- School of Global Health Management and Informatics, University of Central Florida, Orlando, FL, USA
| |
Collapse
|
19
|
Alshahrani M, Almansour A, Alkhaldi A, Thafar MA, Uludag M, Essack M, Hoehndorf R. Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications. PeerJ 2022; 10:e13061. [PMID: 35402106 PMCID: PMC8988936 DOI: 10.7717/peerj.13061] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 02/13/2022] [Indexed: 01/11/2023] Open
Abstract
Biomedical knowledge is represented in structured databases and published in biomedical literature, and different computational approaches have been developed to exploit each type of information in predictive models. However, the information in structured databases and literature is often complementary. We developed a machine learning method that combines information from literature and databases to predict drug targets and indications. To effectively utilize information in published literature, we integrate knowledge graphs and published literature using named entity recognition and normalization before applying a machine learning model that utilizes the combination of graph and literature. We then use supervised machine learning to show the effects of combining features from biomedical knowledge and published literature on the prediction of drug targets and drug indications. We demonstrate that our approach using datasets for drug-target interactions and drug indications is scalable to large graphs and can be used to improve the ranking of targets and indications by exploiting features from either structure or unstructured information alone.
Collapse
Affiliation(s)
- Mona Alshahrani
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Abdullah Almansour
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Asma Alkhaldi
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Maha A. Thafar
- College of Computers and Information Technology, Taif University, Taif, Saudi Arabia,Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
20
|
|
21
|
Han X, Xie R, Li X, Li J. SmileGNN: Drug–Drug Interaction Prediction Based on the SMILES and Graph Neural Network. Life (Basel) 2022; 12:life12020319. [PMID: 35207606 PMCID: PMC8879716 DOI: 10.3390/life12020319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 12/27/2021] [Accepted: 01/05/2022] [Indexed: 11/16/2022] Open
Abstract
Concurrent use of multiple drugs can lead to unexpected adverse drug reactions. The interaction between drugs can be confirmed by routine in vitro and clinical trials. However, it is difficult to test the drug–drug interactions widely and effectively before the drugs enter the market. Therefore, the prediction of drug–drug interactions has become one of the research priorities in the biomedical field. In recent years, researchers have been using deep learning to predict drug–drug interactions by exploiting drug structural features and graph theory, and have achieved a series of achievements. A drug–drug interaction prediction model SmileGNN is proposed in this paper, which can be characterized by aggregating the structural features of drugs constructed by SMILES data and the topological features of drugs in knowledge graphs obtained by graph neural networks. The experimental results show that the model proposed in this paper combines a variety of data sources and has a better prediction performance compared with existing prediction models of drug–drug interactions. Five out of the top ten predicted new drug–drug interactions are verified from the latest database, which proves the credibility of SmileGNN.
Collapse
Affiliation(s)
- Xueting Han
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China; (X.H.); (X.L.)
| | - Ruixia Xie
- School of Medical Technology and Nursing, Shenzhen Polytechnic, Shenzhen 518055, China;
| | - Xutao Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China; (X.H.); (X.L.)
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China; (X.H.); (X.L.)
- Correspondence:
| |
Collapse
|
22
|
Angioni S, Salatino A, Osborne F, Recupero DR, Motta E. AIDA: A knowledge graph about research dynamics in academia and industry. QUANTITATIVE SCIENCE STUDIES 2022. [DOI: 10.1162/qss_a_00162] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
Academia and industry share a complex, multifaceted, and symbiotic relationship. Analyzing the knowledge flow between them, understanding which directions have the biggest potential, and discovering the best strategies to harmonize their efforts is a critical task for several stakeholders. Research publications and patents are an ideal medium to analyze this space, but current data sets of scholarly data cannot be used for such a purpose because they lack a high-quality characterization of the relevant research topics and industrial sectors. In this paper, we introduce the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21 million publications and 8 million patents according to the research topics drawn from the Computer Science Ontology. 5.1 million publications and 5.6 million patents are further characterized according to the type of the author’s affiliations and 66 industrial sectors from the proposed Industrial Sectors Ontology (INDUSO). AIDA was generated by an automatic pipeline that integrates data from Microsoft Academic Graph, Dimensions, DBpedia, the Computer Science Ontology, and the Global Research Identifier Database. It is publicly available under CC BY 4.0 and can be downloaded as a dump or queried via a triplestore. We evaluated the different parts of the generation pipeline on a manually crafted gold standard yielding competitive results.
Collapse
Affiliation(s)
- Simone Angioni
- Department of Mathematics and Computer Science, University of Cagliari (Italy)
| | - Angelo Salatino
- Knowledge Media Institute, The Open University, Milton Keynes (UK)
| | | | | | - Enrico Motta
- Knowledge Media Institute, The Open University, Milton Keynes (UK)
| |
Collapse
|
23
|
Larmande P, Tagny Ngompe G, Venkatesan A, Ruiz M. AgroLD: A Knowledge Graph Database for Plant Functional Genomics. Methods Mol Biol 2022; 2443:527-540. [PMID: 35037225 DOI: 10.1007/978-1-0716-2067-0_28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Recent advances in high-throughput technologies have resulted in tremendous increase in the amount of data in the agronomic domain. There is an urgent need to effectively integrate complementary information to understand the biological system in its entirety. We have developed AgroLD, a knowledge graph that exploits the Semantic Web technology and some of the relevant standard domain ontologies, to integrate information on plant species and in this way facilitating the formulation of new scientific hypotheses. This chapter outlines some integration results of the project, which initially focused on genomics, proteomics and phenomics.
Collapse
Affiliation(s)
- Pierre Larmande
- DIADE, IRD, CIRAD, Univ. Montpellier, Montpellier, France.
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France.
| | - Gildas Tagny Ngompe
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
- AGAP, CIRAD, INRAE, Univ. Montpellier, av Agropolis, Montpellier, France
| | | | - Manuel Ruiz
- French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
- AGAP, CIRAD, INRAE, Univ. Montpellier, av Agropolis, Montpellier, France
| |
Collapse
|
24
|
Abstract
Understanding the role played by genetic variations in diseases, exploring genomic variants and discovering disease-associated loci are among the most pressing challenges of genomic medicine. A huge and ever-increasing amount of information is available to researchers to address these challenges. Unfortunately, it is stored in fragmented ontologies and databases, which use heterogeneous formats and poorly integrated schemas. To overcome these limitations, we propose a linked data approach, based on the formalism of multilayer networks, able to integrate and harmonize biomedical information from multiple sources into a single dense network covering different aspects on Neuroendocrine Neoplasms (NENs). The proposed integration schema consists of three interconnected layers representing, respectively, information on the disease, on the affected genes, on the related biological processes and molecular functions. An easy-to-use client-server application was also developed to browse and search for information on the model supporting multilayer network analysis.
Collapse
|
25
|
Nayyeri M, Cil GM, Vahdati S, Osborne F, Rahman M, Angioni S, Salatino A, Recupero DR, Vassilyeva N, Motta E, Lehmann J. Trans4E: Link prediction on scholarly knowledge graphs. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.02.100] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
26
|
Wang M, Wang H, Liu X, Ma X, Wang B. Drug-Drug Interaction Predictions via Knowledge Graph and Text Embedding: Instrument Validation Study. JMIR Med Inform 2021; 9:e28277. [PMID: 34185011 PMCID: PMC8277366 DOI: 10.2196/28277] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Revised: 04/29/2021] [Accepted: 05/05/2021] [Indexed: 11/23/2022] Open
Abstract
Background Minimizing adverse reactions caused by drug-drug interactions (DDIs) has always been a prominent research topic in clinical pharmacology. Detecting all possible interactions through clinical studies before a drug is released to the market is a demanding task. The power of big data is opening up new approaches to discovering various DDIs. However, these data contain a huge amount of noise and provide knowledge bases that are far from being complete or used with reliability. Most existing studies focus on predicting binary DDIs between drug pairs and ignore other interactions. Objective Leveraging both drug knowledge graphs and biomedical text is a promising pathway for rich and comprehensive DDI prediction, but it is not without issues. Our proposed model seeks to address the following challenges: data noise and incompleteness, data sparsity, and computational complexity. Methods We propose a novel framework, Predicting Rich DDI, to predict DDIs. The framework uses graph embedding to overcome data incompleteness and sparsity issues to make multiple DDI label predictions. First, a large-scale drug knowledge graph is generated from different sources. The knowledge graph is then embedded with comprehensive biomedical text into a common low-dimensional space. Finally, the learned embeddings are used to efficiently compute rich DDI information through a link prediction process. Results To validate the effectiveness of the proposed framework, extensive experiments were conducted on real-world data sets. The results demonstrate that our model outperforms several state-of-the-art baseline methods in terms of capability and accuracy. Conclusions We propose a novel framework, Predicting Rich DDI, to predict DDIs. Using rich DDI information, it can competently predict multiple labels for a pair of drugs across numerous domains, ranging from pharmacological mechanisms to side effects. To the best of our knowledge, this framework is the first to provide a joint translation-based embedding model that learns DDIs by integrating drug knowledge graphs and biomedical text simultaneously in a common low-dimensional space. The model also predicts DDIs using multiple labels rather than single or binary labels. Extensive experiments were conducted on real-world data sets to demonstrate the effectiveness and efficiency of the model. The results show our proposed framework outperforms several state-of-the-art baselines.
Collapse
Affiliation(s)
- Meng Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, China.,Key Laboratory of Computer Network and Information Integration, Southeast University, Nanjing, China
| | - Haofen Wang
- College of Design and Innovation, Tongji University, Shanghai, China
| | - Xing Liu
- Third Xiangya Hospital, Central South University, Changsha, China
| | - Xinyu Ma
- School of Computer Science and Engineering, Southeast University, Nanjing, China
| | - Beilun Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, China
| |
Collapse
|
27
|
Galgonek J, Vondrášek J. IDSM ChemWebRDF: SPARQLing small-molecule datasets. J Cheminform 2021; 13:38. [PMID: 33980298 PMCID: PMC8117646 DOI: 10.1186/s13321-021-00515-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 04/23/2021] [Indexed: 11/12/2022] Open
Abstract
The Resource Description Framework (RDF), together with well-defined ontologies, significantly increases data interoperability and usability. The SPARQL query language was introduced to retrieve requested RDF data and to explore links between them. Among other useful features, SPARQL supports federated queries that combine multiple independent data source endpoints. This allows users to obtain insights that are not possible using only a single data source. Owing to all of these useful features, many biological and chemical databases present their data in RDF, and support SPARQL querying. In our project, we primary focused on PubChem, ChEMBL and ChEBI small-molecule datasets. These datasets are already being exported to RDF by their creators. However, none of them has an official and currently supported SPARQL endpoint. This omission makes it difficult to construct complex or federated queries that could access all of the datasets, thus underutilising the main advantage of the availability of RDF data. Our goal is to address this gap by integrating the datasets into one database called the Integrated Database of Small Molecules (IDSM) that will be accessible through a SPARQL endpoint. Beyond that, we will also focus on increasing mutual interoperability of the datasets. To realise the endpoint, we decided to implement an in-house developed SPARQL engine based on the PostgreSQL relational database for data storage. In our approach, data are stored in the traditional relational form, and the SPARQL engine translates incoming SPARQL queries into equivalent SQL queries. An important feature of the engine is that it optimises the resulting SQL queries. Together with optimisations performed by PostgreSQL, this allows efficient evaluations of SPARQL queries. The endpoint provides not only querying in the dataset, but also the compound substructure and similarity search supported by our Sachem project. Although the endpoint is accessible from an internet browser, it is mainly intended to be used for programmatic access by other services, for example as a part of federated queries. For regular users, we offer a rich web application called ChemWebRDF using the endpoint. The application is publicly available at https://idsm.elixir-czech.cz/chemweb/.
Collapse
Affiliation(s)
- Jakub Galgonek
- Institute of Organic Chemistry and Biochemistry of the CAS, Flemingovo náměstí 2, 166 10, Prague 6, Czech Republic.
| | - Jiří Vondrášek
- Institute of Organic Chemistry and Biochemistry of the CAS, Flemingovo náměstí 2, 166 10, Prague 6, Czech Republic
| |
Collapse
|
28
|
|
29
|
Abstract
INTRODUCTION Knowledge graphs have proven to be promising systems of information storage and retrieval. Due to the recent explosion of heterogeneous multimodal data sources generated in the biomedical domain, and an industry shift toward a systems biology approach, knowledge graphs have emerged as attractive methods of data storage and hypothesis generation. AREAS COVERED In this review, the author summarizes the applications of knowledge graphs in drug discovery. They evaluate their utility; differentiating between academic exercises in graph theory, and useful tools to derive novel insights, highlighting target identification and drug repurposing as two areas showing particular promise. They provide a case study on COVID-19, summarizing the research that used knowledge graphs to identify repurposable drug candidates. They describe the dangers of degree and literature bias, and discuss mitigation strategies. EXPERT OPINION Whilst knowledge graphs and graph-based machine learning have certainly shown promise, they remain relatively immature technologies. Many popular link prediction algorithms fail to address strong biases in biomedical data, and only highlight biological associations, failing to model causal relationships in complex dynamic biological systems. These problems need to be addressed before knowledge graphs reach their true potential in drug discovery.
Collapse
Affiliation(s)
- Finlay MacLean
- Target Identification., BenevolentAI, United Kingdom of Great Britain and Northern Ireland
| |
Collapse
|
30
|
Chen Y, Ma T, Yang X, Wang J, Song B, Zeng X. MUFFIN: Multi-Scale Feature Fusion for Drug-Drug Interaction Prediction. Bioinformatics 2021; 37:2651-2658. [PMID: 33720331 DOI: 10.1093/bioinformatics/btab169] [Citation(s) in RCA: 73] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 02/05/2021] [Accepted: 03/11/2021] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Adverse drug-drug interactions (DDIs) are crucial for drug research and mainly cause morbidity and mortality. Thus, the identification of potential DDIs is essential for doctors, patients, and the society. Existing traditional machine learning models rely heavily on handcraft features and lack generalization. Recently, the deep learning approaches that can automatically learn drug features from the molecular graph or drug-related network have improved the ability of computational models to predict unknown DDIs. However, previous works utilized large labeled data and merely considered the structure or sequence information of drugs without considering the relations or topological information between drug and other biomedical objects (e.g., gene, disease, and pathway), or considered knowledge graph (KG) without considering the information from the drug molecular structure. RESULTS Accordingly, to effectively explore the joint effect of drug molecular structure and semantic information of drugs in knowledge graph for DDI prediction, we propose a multi-scale feature fusion deep learning model named MUFFIN. MUFFIN can jointly learn the drug representation based on both the drug-self structure information and the KG with rich bio-medical information. In MUFFIN, we designed a bi-level cross strategy that includes cross- and scalar-level components to fuse multi-modal features well. MUFFIN can alleviate the restriction of limited labeled data on deep learning models by crossing the features learned from large-scale KG and drug molecular graph. We evaluated our approach on three datasets and three different tasks including binary-class, multi-class, and multi-label DDI prediction tasks. The results showed that MUFFIN outperformed other state-of-the-art baselines. AVAILABILITY The source code and data are available at https://github.com/xzenglab/MUFFIN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yujie Chen
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| | - Tengfei Ma
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| | - Xixi Yang
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| | - Jianmin Wang
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| | - Bosheng Song
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| | - Xiangxiang Zeng
- School of Computer Science and Engineering, Hunan University, Changsha, 410012, China
| |
Collapse
|
31
|
Biswas S, Mitra P, Rao KS. Relation Prediction of Co-Morbid Diseases Using Knowledge Graph Completion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:708-717. [PMID: 31295118 DOI: 10.1109/tcbb.2019.2927310] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Co-morbid disease condition refers to the simultaneous presence of one or more diseases along with the primary disease. A patient suffering from co-morbid diseases possess more mortality risk than with a disease alone. So, it is necessary to predict co-morbid disease pairs. In past years, though several methods have been proposed by researchers for predicting the co-morbid diseases, not much work is done in prediction using knowledge graph embedding using tensor factorization. Moreover, the complex-valued vector-based tensor factorization is not being used in any knowledge graph with biological and biomedical entities. We propose a tensor factorization based approach on biological knowledge graphs. Our method introduces the concept of complex-valued embedding in knowledge graphs with biological entities. Here, we build a knowledge graph with disease-gene associations and their corresponding background information. To predict the association between prevalent diseases, we use ComplEx embedding based tensor decomposition method. Besides, we obtain new prevalent disease pairs using the MCL algorithm in a disease-gene-gene network and check their corresponding inter-relations using edge prediction task.
Collapse
|
32
|
Irshad O, Ghani Khan MU. Formalization and Semantic Integration of Heterogeneous Omics Annotations for Exploratory Searches. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200127122818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aim:
To facilitate researchers and practitioners for unveiling the mysterious functional aspects of human cellular system through performing exploratory searching on semantically integrated heterogeneous and geographically dispersed omics annotations.
Background:
Improving health standards of life is one of the motives which continuously instigates researchers and practitioners to strive for uncovering the mysterious aspects of human cellular system. Inferring new knowledge from known facts always requires reasonably large amount of data in well-structured, integrated and unified form. Due to the advent of especially high throughput and sensor technologies, biological data is growing heterogeneously and geographically at astronomical rate. Several data integration systems have been deployed to cope with the issues of data heterogeneity and global dispersion. Systems based on semantic data integration models are more flexible and expandable than syntax-based ones but still lack aspect-based data integration, persistence and querying. Furthermore, these systems do not fully support to warehouse biological entities in the form of semantic associations as naturally possessed by the human cell.
Objective:
To develop aspect-oriented formal data integration model for semantically integrating heterogeneous and geographically dispersed omics annotations for providing exploratory querying on integrated data.
Method:
We propose an aspect-oriented formal data integration model which uses web semantics standards to formally specify its each construct. Proposed model supports aspect-oriented representation of biological entities while addressing the issues of data heterogeneity and global dispersion. It associates and warehouses biological entities in the way they relate with
Result:
To show the significance of proposed model, we developed a data warehouse and information retrieval system based on proposed model compliant multi-layered and multi-modular software architecture. Results show that our model supports well for gathering, associating, integrating, persisting and querying each entity with respect to its all possible aspects within or across the various associated omics layers.
Conclusion:
Formal specifications better facilitate for addressing data integration issues by providing formal means for understanding omics data based on meaning instead of syntax
Collapse
Affiliation(s)
- Omer Irshad
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore,Pakistan
| | - Muhammad Usman Ghani Khan
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore,Pakistan
| |
Collapse
|
33
|
Rickett CD, Maschhoff KJ, Sukumar SR. Does tetanus vaccination contribute to reduced severity of the COVID-19 infection? Med Hypotheses 2021; 146:110395. [PMID: 33341328 PMCID: PMC7695568 DOI: 10.1016/j.mehy.2020.110395] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 11/06/2020] [Indexed: 02/09/2023]
Abstract
We present the hypothesis to the scientific community actively designing clinical trials and recommending public health guidelines to control the pandemic that - "Tetanus vaccination may be contributing to reduced severity of the COVID-19 infection" - and urge further research to validate or invalidate the effectiveness of the tetanus toxoid vaccine against COVID-19. This hypothesis was revealed by an explainable artificial intelligence system unleashed on open public biomedical datasets. As a foundation for scientific rigor, we describe the data and the artificial intelligence system, document the provenance and methodology used to derive the hypothesis and also gather potentially relevant data/evidence from recent studies. We conclude that while correlations may not be reason for causation, correlations from multiple sources is more than a serendipitous coincidence that is worthy of further and deeper investigation.
Collapse
|
34
|
Marín-Llaó J, Mubeen S, Perera-Lluna A, Hofmann-Apitius M, Picart-Armada S, Domingo-Fernández D. MultiPaths: a python framework for analyzing multi-layer biological networks using diffusion algorithms. Bioinformatics 2020; 37:137-139. [PMID: 33367476 PMCID: PMC8034528 DOI: 10.1093/bioinformatics/btaa1069] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Revised: 11/23/2020] [Accepted: 12/14/2020] [Indexed: 11/13/2022] Open
Abstract
Summary High-throughput screening yields vast amounts of biological data which can be highly challenging to interpret. In response, knowledge-driven approaches emerged as possible solutions to analyze large datasets by leveraging prior knowledge of biomolecular interactions represented in the form of biological networks. Nonetheless, given their size and complexity, their manual investigation quickly becomes impractical. Thus, computational approaches, such as diffusion algorithms, are often employed to interpret and contextualize the results of high-throughput experiments. Here, we present MultiPaths, a framework consisting of two independent Python packages for network analysis. While the first package, DiffuPy, comprises numerous commonly used diffusion algorithms applicable to any generic network, the second, DiffuPath, enables the application of these algorithms on multi-layer biological networks. To facilitate its usability, the framework includes a command line interface, reproducible examples and documentation. To demonstrate the framework, we conducted several diffusion experiments on three independent multi-omics datasets over disparate networks generated from pathway databases, thus, highlighting the ability of multi-layer networks to integrate multiple modalities. Finally, the results of these experiments demonstrate how the generation of harmonized networks from disparate databases can improve predictive performance with respect to individual resources. Availability and implementation DiffuPy and DiffuPath are publicly available under the Apache License 2.0 at https://github.com/multipaths. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Josep Marín-Llaó
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, 53757, Germany.,B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain
| | - Sarah Mubeen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, 53757, Germany.,Fraunhofer Center for Machine Learning, Germany
| | - Alexandre Perera-Lluna
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, 53757, Germany
| | - Sergio Picart-Armada
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, 53757, Germany.,Fraunhofer Center for Machine Learning, Germany
| |
Collapse
|
35
|
Zheng S, Rao J, Song Y, Zhang J, Xiao X, Fang EF, Yang Y, Niu Z. PharmKG: a dedicated knowledge graph benchmark for bomedical data mining. Brief Bioinform 2020; 22:6042240. [PMID: 33341877 DOI: 10.1093/bib/bbaa344] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Revised: 10/12/2020] [Accepted: 10/28/2020] [Indexed: 12/11/2022] Open
Abstract
Biomedical knowledge graphs (KGs), which can help with the understanding of complex biological systems and pathologies, have begun to play a critical role in medical practice and research. However, challenges remain in their embedding and use due to their complex nature and the specific demands of their construction. Existing studies often suffer from problems such as sparse and noisy datasets, insufficient modeling methods and non-uniform evaluation metrics. In this work, we established a comprehensive KG system for the biomedical field in an attempt to bridge the gap. Here, we introduced PharmKG, a multi-relational, attributed biomedical KG, composed of more than 500 000 individual interconnections between genes, drugs and diseases, with 29 relation types over a vocabulary of ~8000 disambiguated entities. Each entity in PharmKG is attached with heterogeneous, domain-specific information obtained from multi-omics data, i.e. gene expression, chemical structure and disease word embedding, while preserving the semantic and biomedical features. For baselines, we offered nine state-of-the-art KG embedding (KGE) approaches and a new biological, intuitive, graph neural network-based KGE method that uses a combination of both global network structure and heterogeneous domain features. Based on the proposed benchmark, we conducted extensive experiments to assess these KGE models using multiple evaluation metrics. Finally, we discussed our observations across various downstream biological tasks and provide insights and guidelines for how to use a KG in biomedicine. We hope that the unprecedented quality and diversity of PharmKG will lead to advances in biomedical KG construction, embedding and application.
Collapse
Affiliation(s)
- Shuangjia Zheng
- School of Data and Computer Science at the Sun Yat-Sen University
| | - Jiahua Rao
- School of Data and Computer Science at the Sun Yat-Sen University
| | - Ying Song
- School of Systems Science and Engineering at the Sun Yat-Sen University
| | | | | | - Evandro Fei Fang
- Department of Clinical Molecular Biology, University of Oslo and Akershus University Hospital, Lørenskog, Norway
| | - Yuedong Yang
- School of Data and Computer Science and the National Super Computer Center at Guangzhou, Sun Yat-sen University, China
| | | |
Collapse
|
36
|
Rossanez A, Dos Reis JC, Torres RDS, de Ribaupierre H. KGen: a knowledge graph generator from biomedical scientific literature. BMC Med Inform Decis Mak 2020; 20:314. [PMID: 33317512 PMCID: PMC7734730 DOI: 10.1186/s12911-020-01341-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 11/17/2020] [Indexed: 11/26/2022] Open
Abstract
Background Knowledge is often produced from data generated in scientific investigations. An ever-growing number of scientific studies in several domains result into a massive amount of data, from which obtaining new knowledge requires computational help. For example, Alzheimer’s Disease, a life-threatening degenerative disease that is not yet curable. As the scientific community strives to better understand it and find a cure, great amounts of data have been generated, and new knowledge can be produced. A proper representation of such knowledge brings great benefits to researchers, to the scientific community, and consequently, to society. Methods In this article, we study and evaluate a semi-automatic method that generates knowledge graphs (KGs) from biomedical texts in the scientific literature. Our solution explores natural language processing techniques with the aim of extracting and representing scientific literature knowledge encoded in KGs. Our method links entities and relations represented in KGs to concepts from existing biomedical ontologies available on the Web. We demonstrate the effectiveness of our method by generating KGs from unstructured texts obtained from a set of abstracts taken from scientific papers on the Alzheimer’s Disease. We involve physicians to compare our extracted triples from their manual extraction via their analysis of the abstracts. The evaluation further concerned a qualitative analysis by the physicians of the generated KGs with our software tool. Results The experimental results indicate the quality of the generated KGs. The proposed method extracts a great amount of triples, showing the effectiveness of our rule-based method employed in the identification of relations in texts. In addition, ontology links are successfully obtained, which demonstrates the effectiveness of the ontology linking method proposed in this investigation. Conclusions We demonstrate that our proposal is effective on building ontology-linked KGs representing the knowledge obtained from biomedical scientific texts. Such representation can add value to the research in various domains, enabling researchers to compare the occurrence of concepts from different studies. The KGs generated may pave the way to potential proposal of new theories based on data analysis to advance the state of the art in their research domains.
Collapse
Affiliation(s)
- Anderson Rossanez
- Institute of Computing, University of Campinas, Campinas, SP, Brazil.
| | | | - Ricardo da Silva Torres
- Department of ICT and Natural Sciences, Faculty of Information Technology and Electrical Engineering, NTNU - Norwegian University of Science and Technology, Ålesund, Norway
| | | |
Collapse
|
37
|
Sung HY, Chi YL. A knowledge-based system to find over-the-counter medicines for self-medication. J Biomed Inform 2020; 108:103504. [PMID: 32673790 DOI: 10.1016/j.jbi.2020.103504] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 07/05/2020] [Accepted: 07/06/2020] [Indexed: 11/30/2022]
Abstract
This study developed a medicine query system based on Semantic Web and open data especially for self-medication users to search over-the-counter (OTC) medicines. Most existing medicine query systems are based on keyword searches. If users are uncertain about the exact search words, these query systems do not offer effective help. Furthermore, most systems provide inadequate explanations of symptoms and ailments for users to use with confidence. To remedy these issues, this study builds a knowledge base to enable inference-based searches and data mashup for integrating information from across the Web. Three components were identified: (1) building an ontology model to describe the relationships between ailments and symptoms; (2) upgrading medicinal product datasets to link them with the ontology model on a semantic level; and (3) developing a data mashup to integrate web resources to help users to find references. Furthermore, the aim was to develop a web-based application that utilizes inference mechanisms to provide users with tools for interactive manipulation. A pilot experiment for skin ailments was implemented to learn the problem-solving skills of the system. Finally, two experts utilized a content validity index to rate a four-dimension 15-item scale. The evaluation results show that experts found the proposed system excellent for content validity.
Collapse
Affiliation(s)
- Han-Yu Sung
- Department of Allied Health Education and Digital Learning, National Taipei University of Nursing and Health Sciences, No. 365, Mingde Road, Beitou District, Taipei City 11219, Taiwan, ROC
| | - Yu-Liang Chi
- Department of Information Management, Chung Yuan Christian University, No. 200, Zhongbei Road, Zhongli District, Taoyuan City 32023, Taiwan, ROC.
| |
Collapse
|
38
|
Pellison FC, Rijo RPCL, Lima VC, Crepaldi NY, Bernardi FA, Galliez RM, Kritski A, Abhishek K, Alves D. Data Integration in the Brazilian Public Health System for Tuberculosis: Use of the Semantic Web to Establish Interoperability. JMIR Med Inform 2020; 8:e17176. [PMID: 32628611 PMCID: PMC7381074 DOI: 10.2196/17176] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Revised: 02/17/2020] [Accepted: 03/22/2020] [Indexed: 11/13/2022] Open
Abstract
Background Interoperability of health information systems is a challenge due to the heterogeneity of existing systems at both the technological and semantic levels of their data. The lack of existing data about interoperability disrupts intra-unit and inter-unit medical operations as well as creates challenges in conducting studies on existing data. The goal is to exchange data while providing the same meaning for data from different sources. Objective To find ways to solve this challenge, this research paper proposes an interoperability solution for the tuberculosis treatment and follow-up scenario in Brazil using Semantic Web technology supported by an ontology. Methods The entities of the ontology were allocated under the definitions of Basic Formal Ontology. Brazilian tuberculosis applications were tagged with entities from the resulting ontology. Results An interoperability layer was developed to retrieve data with the same meaning and in a structured way enabling semantic and functional interoperability. Conclusions Health professionals could use the data gathered from several data sources to enhance the effectiveness of their actions and decisions, as shown in a practical use case to integrate tuberculosis data in the State of São Paulo.
Collapse
Affiliation(s)
- Felipe Carvalho Pellison
- Bioengineering Postgraduate Program of the São Carlos School of Engineering, University of São Paulo, São Carlos, Brazil
| | - Rui Pedro Charters Lopes Rijo
- Polytechnic Institute of Leiria, Leiria, Portugal.,Institute for Systems and Computers Engineering at Coimbra, Coimbra, Portugal.,Center for Health Technology and Services Research, Porto, Portugal.,Department of Social Medicine of Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil
| | - Vinicius Costa Lima
- Bioengineering Postgraduate Program of the São Carlos School of Engineering, University of São Paulo, São Carlos, Brazil
| | | | - Filipe Andrade Bernardi
- Bioengineering Postgraduate Program of the São Carlos School of Engineering, University of São Paulo, São Carlos, Brazil
| | - Rafael Mello Galliez
- Academic Tuberculosis Program, Medical School of Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Afrânio Kritski
- Academic Tuberculosis Program, Medical School of Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Kumar Abhishek
- Department of Computer Science and Engineering, National Institute of Technology, Patna, India
| | - Domingos Alves
- Department of Social Medicine of Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, Brazil
| |
Collapse
|
39
|
Abstract
Knowledge-based biomedical data science involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey recent progress in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as progress on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing to construct knowledge graphs, and the expansion of novel knowledge-based approaches to clinical and biological domains.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| | - Ignacio J Tripodi
- Department of Computer Science, University of Colorado, Boulder, Colorado 80309, USA
| | - Harrison Pielke-Lombardo
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| | - Lawrence E Hunter
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| |
Collapse
|
40
|
ABHD11, a new diacylglycerol lipase involved in weight gain regulation. PLoS One 2020; 15:e0234780. [PMID: 32579589 PMCID: PMC7313976 DOI: 10.1371/journal.pone.0234780] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Accepted: 06/02/2020] [Indexed: 01/26/2023] Open
Abstract
Obesity epidemic continues to spread and obesity rates are increasing in the world. In addition to public health effort to reduce obesity, there is a need to better understand the underlying biology to enable more effective treatment and the discovery of new pharmacological agents. Abhydrolase domain-containing protein 11 (ABHD11) is a serine hydrolase enzyme, localized in mitochondria, that can synthesize the endocannabinoid 2-arachidonoyl glycerol (2AG) in vitro. In vivo preclinical studies demonstrated that knock-out ABHD11 mice have a similar 2AG level as WT mice and exhibit a lean metabolic phenotype. Such mice resist to weight gain in Diet Induced Obesity studies (DIO) and display normal biochemical plasma parameters. Metabolic and transcriptomic analyses on serum and tissues of ABHD11 KO mice from DIO studies show a modulation in bile salts associated with reduced fat intestinal absorption. These data suggest that modulating ABHD11 signaling pathway could be of therapeutic value for the treatment of metabolic disorders.
Collapse
|
41
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
42
|
Affiliation(s)
- Tianyi Fan
- College of Computer Science & TechnologyNanjing University of Aeronautics and Astronautics Nanjing China
| | - Li Yan
- College of Computer Science & TechnologyNanjing University of Aeronautics and Astronautics Nanjing China
| | - Zongmin Ma
- College of Computer Science & TechnologyNanjing University of Aeronautics and Astronautics Nanjing China
| |
Collapse
|
43
|
Sima AC, Mendes de Farias T, Zbinden E, Anisimova M, Gil M, Stockinger H, Stockinger K, Robinson-Rechavi M, Dessimoz C. Enabling semantic queries across federated bioinformatics databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5614223. [PMID: 31697362 PMCID: PMC6836710 DOI: 10.1093/database/baz106] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 08/01/2019] [Accepted: 08/02/2019] [Indexed: 11/23/2022]
Abstract
Motivation: Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases. Results: We introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: (i) Bgee, a gene expression relational database; (ii) Orthologous Matrix (OMA), a Hierarchical Data Format 5 orthology DS; and (iii) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialized RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.
Collapse
Affiliation(s)
- Ana Claudia Sima
- ZHAW Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400 Winterthur Switzerland.,Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tarcisio Mendes de Farias
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
| | - Erich Zbinden
- ZHAW Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400 Winterthur Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Maria Anisimova
- ZHAW Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400 Winterthur Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Manuel Gil
- ZHAW Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400 Winterthur Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Heinz Stockinger
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Kurt Stockinger
- ZHAW Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400 Winterthur Switzerland
| | - Marc Robinson-Rechavi
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.,Department of Genetics, Evolution, and Environment, University College London, Gower St, London WC1E 6BT, UK.,Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| |
Collapse
|
44
|
Mohamed SK, Nounu A, Nováček V. Biological applications of knowledge graph embedding models. Brief Bioinform 2020; 22:1679-1693. [PMID: 32065227 DOI: 10.1093/bib/bbaa012] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Revised: 01/10/2020] [Accepted: 01/21/2020] [Indexed: 01/04/2023] Open
Abstract
Complex biological systems are traditionally modelled as graphs of interconnected biological entities. These graphs, i.e. biological knowledge graphs, are then processed using graph exploratory approaches to perform different types of analytical and predictive tasks. Despite the high predictive accuracy of these approaches, they have limited scalability due to their dependency on time-consuming path exploratory procedures. In recent years, owing to the rapid advances of computational technologies, new approaches for modelling graphs and mining them with high accuracy and scalability have emerged. These approaches, i.e. knowledge graph embedding (KGE) models, operate by learning low-rank vector representations of graph nodes and edges that preserve the graph's inherent structure. These approaches were used to analyse knowledge graphs from different domains where they showed superior performance and accuracy compared to previous graph exploratory approaches. In this work, we study this class of models in the context of biological knowledge graphs and their different applications. We then show how KGE models can be a natural fit for representing complex biological knowledge modelled as graphs. We also discuss their predictive and analytical capabilities in different biology applications. In this regard, we present two example case studies that demonstrate the capabilities of KGE models: prediction of drug-target interactions and polypharmacy side effects. Finally, we analyse different practical considerations for KGEs, and we discuss possible opportunities and challenges related to adopting them for modelling biological systems.
Collapse
Affiliation(s)
| | - Aayah Nounu
- Insight Centre for Data Analytics, NUI Galway, Galway, Ireland
| | - Vít Nováček
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK
| |
Collapse
|
45
|
Irshad O, Khan MUG. Integration and Querying of Heterogeneous Omics Semantic Annotations for Biomedical and Biomolecular Knowledge Discovery. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190409112025] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Background:Exploring various functional aspects of a biological cell system has been a focused research trend for last many decades. Biologists, scientists and researchers are continuously striving for unveiling the mysteries of these functional aspects to improve the health standards of life. For getting such understanding, astronomically growing, heterogeneous and geographically dispersed omics data needs to be critically analyzed. Currently, omics data is available in different types and formats through various data access interfaces. Applications which require offline and integrated data encounter a lot of data heterogeneity and global dispersion issues.Objective:For facilitating especially such applications, heterogeneous data must be collected, integrated and warehoused in such a loosely coupled way so that each molecular entity can computationally be understood independently or in association with other entities within or across the various cellular aspects.Methods:In this paper, we propose an omics data integration schema and its corresponding data warehouse system for integrating, warehousing and presenting heterogeneous and geographically dispersed omics entities according to the cellular functional aspects.Results & Conclusion:Such aspect-oriented data integration, warehousing and data access interfacing through graphical search, web services and application programing interfaces make our proposed integrated data schema and warehouse system better and useful than other contemporary ones.
Collapse
Affiliation(s)
- Omer Irshad
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan
| | - Muhammad Usman Ghani Khan
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore, Pakistan
| |
Collapse
|
46
|
Struck A, Walsh B, Buchanan A, Lee JA, Spangler R, Stuart JM, Ellrott K. Exploring Integrative Analysis Using the BioMedical Evidence Graph. JCO Clin Cancer Inform 2020; 4:147-159. [PMID: 32097025 PMCID: PMC7049249 DOI: 10.1200/cci.19.00110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/16/2020] [Indexed: 12/22/2022] Open
Abstract
PURPOSE The analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis. METHODS We introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations. RESULTS The construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query-based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross-data set analysis to show the utility of the system. CONCLUSION The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.
Collapse
Affiliation(s)
- Adam Struck
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Brian Walsh
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Alexander Buchanan
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Jordan A. Lee
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Ryan Spangler
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Joshua M. Stuart
- Biomolecular Engineering Department, University of California, Santa Cruz, Santa Cruz, CA
- University of California Santa Cruz Genomics Institute, University of California, Santa Cruz Santa Cruz, CA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| |
Collapse
|
47
|
Sima AC, Stockinger K, de Farias TM, Gil M. Semantic Integration and Enrichment of Heterogeneous Biological Databases. Methods Mol Biol 2020; 1910:655-690. [PMID: 31278681 DOI: 10.1007/978-1-4939-9074-0_22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Biological databases are growing at an exponential rate, currently being among the major producers of Big Data, almost on par with commercial generators, such as YouTube or Twitter. While traditionally biological databases evolved as independent silos, each purposely built by a different research group in order to answer specific research questions; more recently significant efforts have been made toward integrating these heterogeneous sources into unified data access systems or interoperable systems using the FAIR principles of data sharing. Semantic Web technologies have been key enablers in this process, opening the path for new insights into the unified data, which were not visible at the level of each independent database. In this chapter, we first provide an introduction into two of the most used database models for biological data: relational databases and RDF stores. Next, we discuss ontology-based data integration, which serves to unify and enrich heterogeneous data sources. We present an extensive timeline of milestones in data integration based on Semantic Web technologies in the field of life sciences. Finally, we discuss some of the remaining challenges in making ontology-based data access (OBDA) systems easily accessible to a larger audience. In particular, we introduce natural language search interfaces, which alleviate the need for database users to be familiar with technical query languages. We illustrate the main theoretical concepts of data integration through concrete examples, using two well-known biological databases: a gene expression database, Bgee, and an orthology database, OMA.
Collapse
Affiliation(s)
- Ana Claudia Sima
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland. .,University of Lausanne, Lausanne, Switzerland.
| | - Kurt Stockinger
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland
| | - Tarcisio Mendes de Farias
- University of Lausanne, Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Manuel Gil
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
48
|
Liu J, Yang M, Zhang L, Zhou W. An effective biomedical data migration tool from resource description framework to JSON. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5538640. [PMID: 31343683 PMCID: PMC6657663 DOI: 10.1093/database/baz088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 06/06/2019] [Accepted: 06/09/2019] [Indexed: 12/28/2022]
Abstract
Resource Description Framework (RDF) is widely used for representing biomedical data in practical applications. With the increases of RDF-based applications, there is an emerging requirement of novel architectures to provide effective supports for the future RDF data explosion. Inspired by the success of the new designs in National Center for Biotechnology Information dbSNP (The Single Nucleotide Polymorphism Database) for managing the increasing data volumes using JSON (JavaScript Object Notation), in this paper we present an effective mapping tool that allows data migrations from RDF to JSON for supporting future massive data explosions and releases. We firstly introduce a set of mapping rules, which transform an RDF format into the JSON format, and then present the corresponding transformation algorithm. On this basis, we develop an effective and user-friendly tool called RDF2JSON, which enables automating the process of RDF data extractions and the corresponding JSON data generations.
Collapse
Affiliation(s)
- Jian Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Mo Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Lei Zhang
- Zhejiang University of Science and Technology, Hangzhou, China
| | - Weijun Zhou
- Department of Hematology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
49
|
Zong N, Wong RSN, Yu Y, Wen A, Huang M, Li N. Drug-target prediction utilizing heterogeneous bio-linked network embeddings. Brief Bioinform 2019; 22:568-580. [PMID: 31885036 DOI: 10.1093/bib/bbz147] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 10/11/2019] [Accepted: 10/29/2019] [Indexed: 11/12/2022] Open
Abstract
To enable modularization for network-based prediction, we conducted a review of known methods conducting the various subtasks corresponding to the creation of a drug-target prediction framework and associated benchmarking to determine the highest-performing approaches. Accordingly, our contributions are as follows: (i) from a network perspective, we benchmarked the association-mining performance of 32 distinct subnetwork permutations, arranging based on a comprehensive heterogeneous biomedical network derived from 12 repositories; (ii) from a methodological perspective, we identified the best prediction strategy based on a review of combinations of the components with off-the-shelf classification, inference methods and graph embedding methods. Our benchmarking strategy consisted of two series of experiments, totaling six distinct tasks from the two perspectives, to determine the best prediction. We demonstrated that the proposed method outperformed the existing network-based methods as well as how combinatorial networks and methodologies can influence the prediction. In addition, we conducted disease-specific prediction tasks for 20 distinct diseases and showed the reliability of the strategy in predicting 75 novel drug-target associations as shown by a validation utilizing DrugBank 5.1.0. In particular, we revealed a connection of the network topology with the biological explanations for predicting the diseases, 'Asthma' 'Hypertension', and 'Dementia'. The results of our benchmarking produced knowledge on a network-based prediction framework with the modularization of the feature selection and association prediction, which can be easily adapted and extended to other feature sources or machine learning algorithms as well as a performed baseline to comprehensively evaluate the utility of incorporating varying data sources.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA
| | - Rachael Sze Nga Wong
- Department of Bioengineering, UC San Diego, 9500 Gilman Drive, San Diego, CA 92093-0412, USA
| | - Yue Yu
- Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA
| | - Andrew Wen
- Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA
| | - Ming Huang
- Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA
| | - Ning Li
- Scripps Research Institute, 10550 North Torrey Pines Road, San Diego, CA, 92037, USA
| |
Collapse
|
50
|
Nguyen DA, Nguyen CH, Mamitsuka H. A survey on adverse drug reaction studies: data, tasks and machine learning methods. Brief Bioinform 2019; 22:164-177. [PMID: 31838499 DOI: 10.1093/bib/bbz140] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Adverse drug reaction (ADR) or drug side effect studies play a crucial role in drug discovery. Recently, with the rapid increase of both clinical and non-clinical data, machine learning methods have emerged as prominent tools to support analyzing and predicting ADRs. Nonetheless, there are still remaining challenges in ADR studies. RESULTS In this paper, we summarized ADR data sources and review ADR studies in three tasks: drug-ADR benchmark data creation, drug-ADR prediction and ADR mechanism analysis. We focused on machine learning methods used in each task and then compare performances of the methods on the drug-ADR prediction task. Finally, we discussed open problems for further ADR studies. AVAILABILITY Data and code are available at https://github.com/anhnda/ADRPModels.
Collapse
Affiliation(s)
| | - Canh Hao Nguyen
- Bioinformatics Center, Institute for Chemical Research, Kyoto University
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University
| |
Collapse
|