1
|
Glen AK, Ma C, Mendoza L, Womack F, Wood EC, Sinha M, Acevedo L, Kvarfordt LG, Peene RC, Liu S, Hoffman AS, Roach JC, Deutsch EW, Ramsey SA, Koslicki D. ARAX: a graph-based modular reasoning tool for translational biomedicine. Bioinformatics 2023; 39:7031241. [PMID: 36752514 PMCID: PMC10027432 DOI: 10.1093/bioinformatics/btad082] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 12/17/2022] [Accepted: 02/07/2023] [Indexed: 04/12/2023] Open
Abstract
MOTIVATION With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been hindered by the lack of an expressive analysis workflow language for translational reasoning and by the lack of a reasoning engine-supporting that language-that federates semantically integrated knowledge-bases. RESULTS We introduce ARAX, a new reasoning system for translational biomedicine that provides a web browser user interface and an application programming interface (API). ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user's query and facilitate exploration of results. For ARAX, we developed new approaches to query planning, knowledge-gathering, reasoning and result ranking and dynamically integrate knowledge providers for answering biomedical questions. To illustrate ARAX's application and utility in specific disease contexts, we present several use-case examples. AVAILABILITY AND IMPLEMENTATION The source code and technical documentation for building the ARAX server-side software and its built-in knowledge database are freely available online (https://github.com/RTXteam/RTX). We provide a hosted ARAX service with a web browser interface at arax.rtx.ai and a web API endpoint at arax.rtx.ai/api/arax/v1.3/ui/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Luis Mendoza
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Finn Womack
- Huck Institutes of the Life Sciences, Pennsylvania State University, State College, PA 16802, USA
| | - E C Wood
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Meghamala Sinha
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Liliana Acevedo
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Lindsey G Kvarfordt
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Ross C Peene
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Shaopeng Liu
- Huck Institutes of the Life Sciences, Pennsylvania State University, State College, PA 16802, USA
| | - Andrew S Hoffman
- Interdisciplinary Hub for Digitalization and Society, Radboud University, Nijmegen 6500GL, The Netherlands
| | - Jared C Roach
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | | | | |
Collapse
|
2
|
Schäfer J, Tang M, Luu D, Bergmann AK, Wiese L. Graph4Med: a web application and a graph database for visualizing and analyzing medical databases. BMC Bioinformatics 2022; 23:537. [PMID: 36503436 PMCID: PMC9743588 DOI: 10.1186/s12859-022-05092-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Accepted: 12/01/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Medical databases normally contain large amounts of data in a variety of forms. Although they grant significant insights into diagnosis and treatment, implementing data exploration into current medical databases is challenging since these are often based on a relational schema and cannot be used to easily extract information for cohort analysis and visualization. As a consequence, valuable information regarding cohort distribution or patient similarity may be missed. With the rapid advancement of biomedical technologies, new forms of data from methods such as Next Generation Sequencing (NGS) or chromosome microarray (array CGH) are constantly being generated; hence it can be expected that the amount and complexity of medical data will rise and bring relational database systems to a limit. DESCRIPTION We present Graph4Med, a web application that relies on a graph database obtained by transforming a relational database. Graph4Med provides a straightforward visualization and analysis of a selected patient cohort. Our use case is a database of pediatric Acute Lymphoblastic Leukemia (ALL). Along routine patients' health records it also contains results of latest technologies such as NGS data. We developed a suitable graph data schema to convert the relational data into a graph data structure and store it in Neo4j. We used NeoDash to build a dashboard for querying and displaying patients' cohort analysis. This way our tool (1) quickly displays the overview of patients' cohort information such as distributions of gender, age, mutations (fusions), diagnosis; (2) provides mutation (fusion) based similarity search and display in a maneuverable graph; (3) generates an interactive graph of any selected patient and facilitates the identification of interesting patterns among patients. CONCLUSION We demonstrate the feasibility and advantages of a graph database for storing and querying medical databases. Our dashboard allows a fast and interactive analysis and visualization of complex medical data. It is especially useful for patients similarity search based on mutations (fusions), of which vast amounts of data have been generated by NGS in recent years. It can discover relationships and patterns in patients cohorts that are normally hard to grasp. Expanding Graph4Med to more medical databases will bring novel insights into diagnostic and research.
Collapse
Affiliation(s)
- Jero Schäfer
- grid.7839.50000 0004 1936 9721Institute of Computer Science, Goethe-Universität Frankfurt, Frankfurt am Main, Germany
| | - Ming Tang
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany ,grid.9122.80000 0001 2163 2777L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
| | - Danny Luu
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany
| | - Anke Katharina Bergmann
- grid.10423.340000 0000 9529 9877Department of Human Genetics, Hannover Medical School, Hannover, Germany
| | - Lena Wiese
- grid.7839.50000 0004 1936 9721Institute of Computer Science, Goethe-Universität Frankfurt, Frankfurt am Main, Germany ,grid.418009.40000 0000 9191 9864Bioinformatics Group, Fraunhofer ITEM, Hannover, Germany
| |
Collapse
|
3
|
Toti D, Macari G, Barbierato E, Polticelli F. FGDB: a comprehensive graph database of ligand fragments from the Protein Data Bank. Database (Oxford) 2022; 2022:6619197. [PMID: 35763362 PMCID: PMC9239314 DOI: 10.1093/database/baac044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 05/06/2022] [Accepted: 05/31/2022] [Indexed: 11/22/2022]
Abstract
This work presents Fragment Graph DataBase (FGDB), a graph database of ligand fragments extracted and generated from the protein entries available in the Protein Data Bank (PDB). FGDB is meant to support and elicit campaigns of fragment-based drug design, by enabling users to query it in order to construct ad hoc, target-specific libraries. In this regard, the database features more than 17 000 fragments, typically small, highly soluble and chemically stable molecules expressed via their canonical Simplified Molecular Input Line Entry System (SMILES) representation. For these fragments, the database provides information related to their contact frequencies with the amino acids, the ligands they are contained in and the proteins the latter bind to. The graph database can be queried via standard web forms and textual searches by a number of identifiers (SMILES, ligand and protein PDB ids) as well as via graphical queries that can be performed against the graph itself, providing users with an intuitive and effective view upon the underlying biological entities. Further search mechanisms via advanced conjunctive/disjunctive/negated textual queries are also possible, in order to allow scientists to look for specific relationships and export their results for further studies. This work also presents two sample use cases where maternal embryonic leucine zipper kinase and mesotrypsin are used as a target, being proteins of high biomedical relevance for the development of cancer therapies. Database URL: http://biochimica3.bio.uniroma3.it/fragments-web/
Collapse
Affiliation(s)
- Daniele Toti
- Department of Mathematics and Physics, Catholic University of the Sacred Heart, Faculty of Mathematical, Physical and Natural Sciences , via della Garzetta 48, Brescia 25133, Italy
| | - Gabriele Macari
- Department of Sciences, Roma Tre University , viale Marconi 446, Roma, Lazio 00146, Italy
| | - Enrico Barbierato
- Department of Mathematics and Physics, Catholic University of the Sacred Heart, Faculty of Mathematical, Physical and Natural Sciences , via della Garzetta 48, Brescia 25133, Italy
| | - Fabio Polticelli
- Department of Sciences, Roma Tre University , viale Marconi 446, Roma, Lazio 00146, Italy
- Roma Tre Section, National Institute of Nuclear Physics , via della Vasca Navale 84, Roma 00146, Italy
| |
Collapse
|
4
|
|
5
|
Doğan T, Atas H, Joshi V, Atakan A, Rifaioglu A, Nalbat E, Nightingale A, Saidi R, Volynkin V, Zellner H, Cetin-Atalay R, Martin M, Atalay V. CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 2021; 49:e96. [PMID: 34181736 PMCID: PMC8450100 DOI: 10.1093/nar/gkab543] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 04/11/2021] [Accepted: 06/10/2021] [Indexed: 12/11/2022] Open
Abstract
Systemic analysis of available large-scale biological/biomedical data is critical for studying biological mechanisms, and developing novel and effective treatment approaches against diseases. However, different layers of the available data are produced using different technologies and scattered across individual computational resources without any explicit connections to each other, which hinders extensive and integrative multi-omics-based analysis. We aimed to address this issue by developing a new data integration/representation methodology and its application by constructing a biological data resource. CROssBAR is a comprehensive system that integrates large-scale biological/biomedical data from various resources and stores them in a NoSQL database. CROssBAR is enriched with the deep-learning-based prediction of relationships between numerous data entries, which is followed by the rigorous analysis of the enriched data to obtain biologically meaningful modules. These complex sets of entities and relationships are displayed to users via easy-to-interpret, interactive knowledge graphs within an open-access service. CROssBAR knowledge graphs incorporate relevant genes-proteins, molecular interactions, pathways, phenotypes, diseases, as well as known/predicted drugs and bioactive compounds, and they are constructed on-the-fly based on simple non-programmatic user queries. These intensely processed heterogeneous networks are expected to aid systems-level research, especially to infer biological mechanisms in relation to genes, proteins, their ligands, and diseases.
Collapse
Affiliation(s)
- Tunca Doğan
- Department of Computer Engineering, Hacettepe University, Ankara 06800, Turkey
- Institute of Informatics, Hacettepe University, Ankara 06800, Turkey
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Heval Atas
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
| | - Vishal Joshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Ahmet Atakan
- Department of Computer Engineering, METU, Ankara 06800, Turkey
- Department of Computer Engineering, EBYU, Erzincan 24002, Turkey
| | - Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, METU, Ankara 06800, Turkey
- Department of Computer Engineering, İskenderun Technical University, Hatay 31200, Turkey
| | - Esra Nalbat
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
| | - Andrew Nightingale
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Vladimir Volynkin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Hermann Zellner
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Rengul Cetin-Atalay
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
- Section of Pulmonary and Critical Care Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Volkan Atalay
- Department of Computer Engineering, METU, Ankara 06800, Turkey
| |
Collapse
|
6
|
Hassani‐Pak K, Singh A, Brandizi M, Hearnshaw J, Parsons JD, Amberkar S, Phillips AL, Doonan JH, Rawlings C. KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:1670-1678. [PMID: 33750020 PMCID: PMC8384599 DOI: 10.1111/pbi.13583] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 12/17/2020] [Accepted: 03/16/2021] [Indexed: 05/03/2023]
Abstract
The generation of new ideas and scientific hypotheses is often the result of extensive literature and database searches, but, with the growing wealth of public and private knowledge, the process of searching diverse and interconnected data to generate new insights into genes, gene networks, traits and diseases is becoming both more complex and more time-consuming. To guide this technically challenging data integration task and to make gene discovery and hypotheses generation easier for researchers, we have developed a comprehensive software package called KnetMiner which is open-source and containerized for easy use. KnetMiner is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species. It features fast algorithms for generating rich interactive gene networks and prioritizing candidate genes based on knowledge mining approaches. KnetMiner is used in many plant science institutions and has been adopted by several plant breeding organizations to accelerate gene discovery. The software is generic and customizable and can therefore be readily applied to new species and data types; for example, it has been applied to pest insects and fungal pathogens; and most recently repurposed to support COVID-19 research. Here, we give an overview of the main approaches behind KnetMiner and we report plant-centric case studies for identifying genes, gene networks and trait relationships in Triticum aestivum (bread wheat), as well as, an evidence-based approach to rank candidate genes under a large Arabidopsis thaliana QTL. KnetMiner is available at: https://knetminer.org.
Collapse
|
7
|
Timón-Reina S, Rincón M, Martínez-Tomás R. An overview of graph databases and their applications in the biomedical domain. Database (Oxford) 2021; 2021:baab026. [PMID: 34003247 PMCID: PMC8130509 DOI: 10.1093/database/baab026] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Revised: 03/24/2021] [Accepted: 04/30/2021] [Indexed: 01/18/2023]
Abstract
Over the past couple of decades, the explosion of densely interconnected data has stimulated the research, development and adoption of graph database technologies. From early graph models to more recent native graph databases, the landscape of implementations has evolved to cover enterprise-ready requirements. Because of the interconnected nature of its data, the biomedical domain has been one of the early adopters of graph databases, enabling more natural representation models and better data integration workflows, exploration and analysis facilities. In this work, we survey the literature to explore the evolution, performance and how the most recent graph database solutions are applied in the biomedical domain, compiling a great variety of use cases. With this evidence, we conclude that the available graph database management systems are fit to support data-intensive, integrative applications, targeted at both basic research and exploratory tasks closer to the clinic.
Collapse
Affiliation(s)
- Santiago Timón-Reina
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Mariano Rincón
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| | - Rafael Martínez-Tomás
- Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain
| |
Collapse
|
8
|
D’Agostino D, Liò P, Aldinucci M, Merelli I. Advantages of using graph databases to explore chromatin conformation capture experiments. BMC Bioinformatics 2021; 22:43. [PMID: 33902433 PMCID: PMC8073886 DOI: 10.1186/s12859-020-03937-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 12/15/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing Chromosome Conformation Capture (Hi-C) allows the study of DNA interactions and 3D chromosome folding at the genome-wide scale. Usually, these data are represented as matrices describing the binary contacts among the different chromosome regions. On the other hand, a graph-based representation can be advantageous to describe the complex topology achieved by the DNA in the nucleus of eukaryotic cells. METHODS Here we discuss the use of a graph database for storing and analysing data achieved by performing Hi-C experiments. The main issue is the size of the produced data and, working with a graph-based representation, the consequent necessity of adequately managing a large number of edges (contacts) connecting nodes (genes), which represents the sources of information. For this, currently available graph visualisation tools and libraries fall short with Hi-C data. The use of graph databases, instead, supports both the analysis and the visualisation of the spatial pattern present in Hi-C data, in particular for comparing different experiments or for re-mapping omics data in a space-aware context efficiently. In particular, the possibility of describing graphs through statistical indicators and, even more, the capability of correlating them through statistical distributions allows highlighting similarities and differences among different Hi-C experiments, in different cell conditions or different cell types. RESULTS These concepts have been implemented in NeoHiC, an open-source and user-friendly web application for the progressive visualisation and analysis of Hi-C networks based on the use of the Neo4j graph database (version 3.5). CONCLUSION With the accumulation of more experiments, the tool will provide invaluable support to compare neighbours of genes across experiments and conditions, helping in highlighting changes in functional domains and identifying new co-organised genomic compartments.
Collapse
Affiliation(s)
- Daniele D’Agostino
- Institute of Electronics, Computer and Telecommunication Engineering, National Research Council of Italy, Genoa, Italy
| | - Pietro Liò
- Computer Laboratory, University of Cambridge, Cambridge, UK
| | - Marco Aldinucci
- Computer Science Department, University of Turin, Turin, Italy
| | - Ivan Merelli
- Institute for Biomedical Technologies, National Research Council of Italy, Segrate, MI Italy
| |
Collapse
|
9
|
Queralt-Rosinach N, Stupp GS, Li TS, Mayers M, Hoatlin ME, Might M, Good BM, Su AI. Structured reviews for data and knowledge-driven research. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5818923. [PMID: 32283553 PMCID: PMC7153956 DOI: 10.1093/database/baaa015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 01/21/2020] [Accepted: 02/07/2020] [Indexed: 12/25/2022]
Abstract
Hypothesis generation is a critical step in research and a cornerstone in the rare disease field. Research is most efficient when those hypotheses are based on the entirety of knowledge known to date. Systematic review articles are commonly used in biomedicine to summarize existing knowledge and contextualize experimental data. But the information contained within review articles is typically only expressed as free-text, which is difficult to use computationally. Researchers struggle to navigate, collect and remix prior knowledge as it is scattered in several silos without seamless integration and access. This lack of a structured information framework hinders research by both experimental and computational scientists. To better organize knowledge and data, we built a structured review article that is specifically focused on NGLY1 Deficiency, an ultra-rare genetic disease first reported in 2012. We represented this structured review as a knowledge graph and then stored this knowledge graph in a Neo4j database to simplify dissemination, querying and visualization of the network. Relative to free-text, this structured review better promotes the principles of findability, accessibility, interoperability and reusability (FAIR). In collaboration with domain experts in NGLY1 Deficiency, we demonstrate how this resource can improve the efficiency and comprehensiveness of hypothesis generation. We also developed a read–write interface that allows domain experts to contribute FAIR structured knowledge to this community resource. In contrast to traditional free-text review articles, this structured review exists as a living knowledge graph that is curated by humans and accessible to computational analyses. Finally, we have generalized this workflow into modular and repurposable components that can be applied to other domain areas. This NGLY1 Deficiency-focused network is publicly available at http://ngly1graph.org/. Availability and implementation Database URL: http://ngly1graph.org/. Network data files are at: https://github.com/SuLab/ngly1-graph and source code at: https://github.com/SuLab/bioknowledge-reviewer. Contact asu@scripps.edu
Collapse
Affiliation(s)
- Núria Queralt-Rosinach
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Gregory S Stupp
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Tong Shu Li
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Michael Mayers
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Maureen E Hoatlin
- Department of Biochemistry and Molecular Biology, Oregon Health and Science University, 3181 SW Sam Jackson Parkway, Portland, OR 97239, USA
| | - Matthew Might
- Department of Medicine, Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, 510 20th St S, Birmingham, AL 35210, USA
| | - Benjamin M Good
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| |
Collapse
|
10
|
Mei S, Huang X, Xie C, Mora A. GREG-studying transcriptional regulation using integrative graph databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5735477. [PMID: 32055858 PMCID: PMC7018612 DOI: 10.1093/database/baz162] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 12/08/2019] [Accepted: 12/31/2019] [Indexed: 01/08/2023]
Abstract
A gene regulatory process is the result of the concerted action of transcription factors, co-factors, regulatory non-coding RNAs (ncRNAs) and chromatin interactions. Therefore, the combination of protein-DNA, protein-protein, ncRNA-DNA, ncRNA-protein and DNA-DNA data in a single graph database offers new possibilities regarding generation of biological hypotheses. GREG (The Gene Regulation Graph Database) is an integrative database and web resource that allows the user to visualize and explore the network of all above-mentioned interactions for a query transcription factor, long non-coding RNA, genomic range or DNA annotation, as well as extracting node and interaction information, identifying connected nodes and performing advanced graphical queries directly on the regulatory network, in a simple and efficient way. In this article, we introduce GREG together with some application examples (including exploratory research of Nanog's regulatory landscape and the etiology of chronic obstructive pulmonary disease), which we use as a demonstration of the advantages of using graph databases in biomedical research. Database URL: https://mora-lab.github.io/projects/greg.html, www.moralab.science/GREG/.
Collapse
Affiliation(s)
- Songqing Mei
- School of Basic Medical Sciences, Guangzhou Medical University, Panyu Campus of Guangzhou Medical University, Xinzao, 511436 Guangzhou, P.R. China
| | - Xiaowei Huang
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Panyu Campus of Guangzhou Medical University, Xinzao, 511436 Guangzhou, P.R. China
| | - Chengshu Xie
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Panyu Campus of Guangzhou Medical University, Xinzao, 511436 Guangzhou, P.R. China
| | - Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Panyu Campus of Guangzhou Medical University, Xinzao, 511436 Guangzhou, P.R. China
| |
Collapse
|
11
|
Mrozek D. A review of Cloud computing technologies for comprehensive microRNA analyses. Comput Biol Chem 2020; 88:107365. [PMID: 32906056 DOI: 10.1016/j.compbiolchem.2020.107365] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 08/05/2020] [Accepted: 08/18/2020] [Indexed: 01/08/2023]
Abstract
Cloud computing revolutionized many fields that require ample computational power. Cloud platforms may also provide huge support for microRNA analysis mainly through disclosing scalable resources of different types. In Clouds, these resources are available as services, which simplifies their allocation and releasing. This feature is especially useful during the analysis of large volumes of data, like the one produced by next generation sequencing experiments, which require not only extended storage space but also a distributed computing environment. In this paper, we show which of the Cloud properties and service models can be especially beneficial for microRNA analysis. We also explain the most useful services of the Cloud (including storage space, computational power, web application hosting, machine learning models, and Big Data frameworks) that can be used for microRNA analysis. At the same time, we review several solutions for microRNA and show that the utilization of the Cloud in this field is still weak, but can increase in the future when the awareness of their applicability grows.
Collapse
Affiliation(s)
- Dariusz Mrozek
- Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland.
| |
Collapse
|
12
|
Aguilera-Mendoza L, Marrero-Ponce Y, Beltran JA, Tellez Ibarra R, Guillen-Ramirez HA, Brizuela CA. Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics 2020; 35:4739-4747. [PMID: 30994884 DOI: 10.1093/bioinformatics/btz260] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Revised: 03/30/2019] [Accepted: 04/10/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Bioactive peptides have gained great attention in the academy and pharmaceutical industry since they play an important role in human health. However, the increasing number of bioactive peptide databases is causing the problem of data redundancy and duplicated efforts. Even worse is the fact that the available data is non-standardized and often dirty with data entry errors. Therefore, there is a need for a unified view that enables a more comprehensive analysis of the information on this topic residing at different sites. RESULTS After collecting web pages from a large variety of bioactive peptide databases, we organized the web content into an integrated graph database (starPepDB) that holds a total of 71 310 nodes and 348 505 relationships. In this graph structure, there are 45 120 nodes representing peptides, and the rest of the nodes are connected to peptides for describing metadata. Additionally, to facilitate a better understanding of the integrated data, a software tool (starPep toolbox) has been developed for supporting visual network analysis in a user-friendly way; providing several functionalities such as peptide retrieval and filtering, network construction and visualization, interactive exploration and exporting data options. AVAILABILITY AND IMPLEMENTATION Both starPepDB and starPep toolbox are freely available at http://mobiosd-hub.com/starpep/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Longendri Aguilera-Mendoza
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico.,Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas (UCI), CP 17100, La Habana, Cuba
| | - Yovani Marrero-Ponce
- Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Translacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, CP 170901, Quito, Pichincha, Ecuador.,Grupo de Investigación Ambiental (GIA), Programas Ambientales, Facultad de Ingenierías, Fundacion Universitaria Tecnologico Comfenalco - Cartagena, Cr 44 D N° 30A - 91, CP 130015, Cartagena, Bolívar, Colombia
| | - Jesus A Beltran
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| | - Roberto Tellez Ibarra
- Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas (UCI), CP 17100, La Habana, Cuba
| | - Hugo A Guillen-Ramirez
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Mexico
| |
Collapse
|
13
|
Struck A, Walsh B, Buchanan A, Lee JA, Spangler R, Stuart JM, Ellrott K. Exploring Integrative Analysis Using the BioMedical Evidence Graph. JCO Clin Cancer Inform 2020; 4:147-159. [PMID: 32097025 PMCID: PMC7049249 DOI: 10.1200/cci.19.00110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/16/2020] [Indexed: 12/22/2022] Open
Abstract
PURPOSE The analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis. METHODS We introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations. RESULTS The construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query-based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross-data set analysis to show the utility of the system. CONCLUSION The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.
Collapse
Affiliation(s)
- Adam Struck
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Brian Walsh
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Alexander Buchanan
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Jordan A. Lee
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Ryan Spangler
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| | - Joshua M. Stuart
- Biomolecular Engineering Department, University of California, Santa Cruz, Santa Cruz, CA
- University of California Santa Cruz Genomics Institute, University of California, Santa Cruz Santa Cruz, CA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health and Science University, Portland OR
| |
Collapse
|
14
|
|
15
|
3D-PP: A Tool for Discovering Conserved Three-Dimensional Protein Patterns. Int J Mol Sci 2019; 20:ijms20133174. [PMID: 31261733 PMCID: PMC6651053 DOI: 10.3390/ijms20133174] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2019] [Revised: 06/19/2019] [Accepted: 06/20/2019] [Indexed: 01/25/2023] Open
Abstract
Discovering conserved three-dimensional (3D) patterns among protein structures may provide valuable insights into protein classification, functional annotations or the rational design of multi-target drugs. Thus, several computational tools have been developed to discover and compare protein 3D-patterns. However, most of them only consider previously known 3D-patterns such as orthosteric binding sites or structural motifs. This fact makes necessary the development of new methods for the identification of all possible 3D-patterns that exist in protein structures (allosteric sites, enzyme-cofactor interaction motifs, among others). In this work, we present 3D-PP, a new free access web server for the discovery and recognition all similar 3D amino acid patterns among a set of proteins structures (independent of their sequence similarity). This new tool does not require any previous structural knowledge about ligands, and all data are organized in a high-performance graph database. The input can be a text file with the PDB access codes or a zip file of PDB coordinates regardless of the origin of the structural data: X-ray crystallographic experiments or in silico homology modeling. The results are presented as lists of sequence patterns that can be further analyzed within the web page. We tested the accuracy and suitability of 3D-PP using two sets of proteins coming from the Protein Data Bank: (a) Zinc finger containing and (b) Serotonin target proteins. We also evaluated its usefulness for the discovering of new 3D-patterns, using a set of protein structures coming from in silico homology modeling methodologies, all of which are overexpressed in different types of cancer. Results indicate that 3D-PP is a reliable, flexible and friendly-user tool to identify conserved structural motifs, which could be relevant to improve the knowledge about protein function or classification. The web server can be freely utilized at https://appsbio.utalca.cl/3d-pp/.
Collapse
|