1
|
Mulero-Hernández J, Mironov V, Miñarro-Giménez JA, Kuiper M, Fernández-Breis J. Integration of chromosome locations and functional aspects of enhancers and topologically associating domains in knowledge graphs enables versatile queries about gene regulation. Nucleic Acids Res 2024; 52:e69. [PMID: 38967009 PMCID: PMC11347148 DOI: 10.1093/nar/gkae566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 06/12/2024] [Accepted: 06/19/2024] [Indexed: 07/06/2024] Open
Abstract
Knowledge about transcription factor binding and regulation, target genes, cis-regulatory modules and topologically associating domains is not only defined by functional associations like biological processes or diseases but also has a determinative genome location aspect. Here, we exploit these location and functional aspects together to develop new strategies to enable advanced data querying. Many databases have been developed to provide information about enhancers, but a schema that allows the standardized representation of data, securing interoperability between resources, has been lacking. In this work, we use knowledge graphs for the standardized representation of enhancers and topologically associating domains, together with data about their target genes, transcription factors, location on the human genome, and functional data about diseases and gene ontology annotations. We used this schema to integrate twenty-five enhancer datasets and two domain datasets, creating the most powerful integrative resource in this field to date. The knowledge graphs have been implemented using the Resource Description Framework and integrated within the open-access BioGateway knowledge network, generating a resource that contains an interoperable set of knowledge graphs (enhancers, TADs, genes, proteins, diseases, GO terms, and interactions between domains). We show how advanced queries, which combine functional and location restrictions, can be used to develop new hypotheses about functional aspects of gene expression regulation.
Collapse
Affiliation(s)
- Juan Mulero-Hernández
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| | - Vladimir Mironov
- Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| | - José Antonio Miñarro-Giménez
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| | - Jesualdo Tomás Fernández-Breis
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| |
Collapse
|
2
|
Louarn M, Collet G, Barré È, Fest T, Dameron O, Siegel A, Chatonnet F. Regulus infers signed regulatory relations from few samples' information using discretization and likelihood constraints. PLoS Comput Biol 2024; 20:e1011816. [PMID: 38252636 PMCID: PMC10833539 DOI: 10.1371/journal.pcbi.1011816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/01/2024] [Accepted: 01/08/2024] [Indexed: 01/24/2024] Open
Abstract
MOTIVATION Transcriptional regulation is performed by transcription factors (TF) binding to DNA in context-dependent regulatory regions and determines the activation or inhibition of gene expression. Current methods of transcriptional regulatory circuits inference, based on one or all of TF, regions and genes activity measurements require a large number of samples for ranking the candidate TF-gene regulation relations and rarely predict whether they are activations or inhibitions. We hypothesize that transcriptional regulatory circuits can be inferred from fewer samples by (1) fully integrating information on TF binding, gene expression and regulatory regions accessibility, (2) reducing data complexity and (3) using biology-based likelihood constraints to determine the global consistency between a candidate TF-gene relation and patterns of genes expressions and region activations, as well as qualify regulations as activations or inhibitions. RESULTS We introduce Regulus, a method which computes TF-gene relations from gene expressions, regulatory region activities and TF binding sites data, together with the genomic locations of all entities. After aggregating gene expressions and region activities into patterns, data are integrated into a RDF (Resource Description Framework) endpoint. A dedicated SPARQL (SPARQL Protocol and RDF Query Language) query retrieves all potential relations between expressed TF and genes involving active regulatory regions. These TF-region-gene relations are then filtered using biological likelihood constraints allowing to qualify them as activation or inhibition. Regulus provides signed relations consistent with public databases and, when applied to biological data, identifies both known and potential new regulators. Regulus is devoted to context-specific transcriptional circuits inference in human settings where samples are scarce and cell populations are closely related, using discretization into patterns and likelihood reasoning to decipher the most robust regulatory relations.
Collapse
Affiliation(s)
- Marine Louarn
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
| | | | - Ève Barré
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
| | - Thierry Fest
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
- Laboratoire d’Hématologie, Pôle de Biologie, CHU de Rennes, Rennes, France
| | | | - Anne Siegel
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, Rennes, France
| | - Fabrice Chatonnet
- UMR_S 1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, France
- Laboratoire d’Hématologie, Pôle de Biologie, CHU de Rennes, Rennes, France
| |
Collapse
|
3
|
Falda M, Atzori M, Corbetta M. Semantic wikis as flexible database interfaces for biomedical applications. Sci Rep 2023; 13:1095. [PMID: 36658254 PMCID: PMC9851594 DOI: 10.1038/s41598-023-27743-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 01/06/2023] [Indexed: 01/20/2023] Open
Abstract
Several challenges prevent extracting knowledge from biomedical resources, including data heterogeneity and the difficulty to obtain and collaborate on data and annotations by medical doctors. Therefore, flexibility in their representation and interconnection is required; it is also essential to be able to interact easily with such data. In recent years, semantic tools have been developed: semantic wikis are collections of wiki pages that can be annotated with properties and so combine flexibility and expressiveness, two desirable aspects when modeling databases, especially in the dynamic biomedical domain. However, semantics and collaborative analysis of biomedical data is still an unsolved challenge. The aim of this work is to create a tool for easing the design and the setup of semantic databases and to give the possibility to enrich them with biostatistical applications. As a side effect, this will also make them reproducible, fostering their application by other research groups. A command-line software has been developed for creating all structures required by Semantic MediaWiki. Besides, a way to expose statistical analyses as R Shiny applications in the interface is provided, along with a facility to export Prolog predicates for reasoning with external tools. The developed software allowed to create a set of biomedical databases for the Neuroscience Department of the University of Padova in a more automated way. They can be extended with additional qualitative and statistical analyses of data, including for instance regressions, geographical distribution of diseases, and clustering. The software is released as open source-code and published under the GPL-3 license at https://github.com/mfalda/tsv2swm .
Collapse
Affiliation(s)
- Marco Falda
- Neuroscience Department, University of Padova, Padova, Italy.
| | - Manfredo Atzori
- Neuroscience Department, University of Padova, Padova, Italy
- Institute of Information Systems, University of Applied Sciences Western Switzerland (HES-SO Valais), Sierre, Switzerland
- Padova Neuroscience Center (PNC), Clinica Neurologica, and Venetian Institute of Molecular Medicine, VIMM, Padova, Italy
| | - Maurizio Corbetta
- Neuroscience Department, University of Padova, Padova, Italy
- Padova Neuroscience Center (PNC), Clinica Neurologica, and Venetian Institute of Molecular Medicine, VIMM, Padova, Italy
- Department of Neurology, Radiology, Neuroscience Washington University School of Medicine, St. Louis, MO, USA
| |
Collapse
|
4
|
Lou P, Wang C, Guo R, Yao L, Zhang G, Yang J, Yuan Y, Dong Y, Gao Z, Gong T, Li C. HistoML, a markup language for representation and exchange of histopathological features in pathology images. Sci Data 2022; 9:387. [PMID: 35803960 PMCID: PMC9270329 DOI: 10.1038/s41597-022-01505-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 06/23/2022] [Indexed: 11/09/2022] Open
Abstract
The study of histopathological phenotypes is vital for cancer research and medicine as it links molecular mechanisms to disease prognosis. It typically involves integration of heterogenous histopathological features in whole-slide images (WSI) to objectively characterize a histopathological phenotype. However, the large-scale implementation of phenotype characterization has been hindered by the fragmentation of histopathological features, resulting from the lack of a standardized format and a controlled vocabulary for structured and unambiguous representation of semantics in WSIs. To fill this gap, we propose the Histopathology Markup Language (HistoML), a representation language along with a controlled vocabulary (Histopathology Ontology) based on Semantic Web technologies. Multiscale features within a WSI, from single-cell features to mesoscopic features, could be represented using HistoML which is a crucial step towards the goal of making WSIs findable, accessible, interoperable and reusable (FAIR). We pilot HistoML in representing WSIs of kidney cancer as well as thyroid carcinoma and exemplify the uses of HistoML representations in semantic queries to demonstrate the potential of HistoML-powered applications for phenotype characterization.
Collapse
Affiliation(s)
- Peiliang Lou
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China
| | - Chunbao Wang
- Department of Pathology, The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an, Shaanxi, China
| | - Ruifeng Guo
- Division of Anatomic Pathology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA
| | - Lixia Yao
- Department of Health Services Administration and Policy, Temple University, Philadelphia, PA, USA
| | - Guanjun Zhang
- Department of Pathology, The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an, Shaanxi, China
| | - Jun Yang
- Department of Pathology, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 3, Shang Qin Road, Xi'an, Shaanxi, China
| | - Yong Yuan
- Department of Pathology, Shaanxi Provincial Tumor Hospital, Xi'an Jiaotong University, 309 Yanta West Road, Xi'an, Shaanxi, China
| | - Yuxin Dong
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China
| | - Zeyu Gao
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China
| | - Tieliang Gong
- Key Laboratory of Intelligent Networks and Network Security (Xi'an Jiaotong University), Ministry of Education, Xi'an, Shaanxi, 710049, China
| | - Chen Li
- National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China.
| |
Collapse
|
5
|
Louarn M, Chatonnet F, Garnier X, Fest T, Siegel A, Faron C, Dameron O. Improving reusability along the data life cycle: a regulatory circuits case study. J Biomed Semantics 2022; 13:11. [PMID: 35346379 PMCID: PMC8962212 DOI: 10.1186/s13326-022-00266-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 03/07/2022] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. RESULTS We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. CONCLUSION The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.
Collapse
Affiliation(s)
- Marine Louarn
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
| | - Fabrice Chatonnet
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
- Laboratoire d’Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033 France
| | - Xavier Garnier
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| | - Thierry Fest
- UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000 France
- Laboratoire d’Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033 France
| | - Anne Siegel
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| | - Catherine Faron
- Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France
| | - Olivier Dameron
- Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000 France
| |
Collapse
|
6
|
Irshad O, Ghani Khan MU. Formalization and Semantic Integration of Heterogeneous Omics Annotations for Exploratory Searches. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200127122818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aim:
To facilitate researchers and practitioners for unveiling the mysterious functional aspects of human cellular system through performing exploratory searching on semantically integrated heterogeneous and geographically dispersed omics annotations.
Background:
Improving health standards of life is one of the motives which continuously instigates researchers and practitioners to strive for uncovering the mysterious aspects of human cellular system. Inferring new knowledge from known facts always requires reasonably large amount of data in well-structured, integrated and unified form. Due to the advent of especially high throughput and sensor technologies, biological data is growing heterogeneously and geographically at astronomical rate. Several data integration systems have been deployed to cope with the issues of data heterogeneity and global dispersion. Systems based on semantic data integration models are more flexible and expandable than syntax-based ones but still lack aspect-based data integration, persistence and querying. Furthermore, these systems do not fully support to warehouse biological entities in the form of semantic associations as naturally possessed by the human cell.
Objective:
To develop aspect-oriented formal data integration model for semantically integrating heterogeneous and geographically dispersed omics annotations for providing exploratory querying on integrated data.
Method:
We propose an aspect-oriented formal data integration model which uses web semantics standards to formally specify its each construct. Proposed model supports aspect-oriented representation of biological entities while addressing the issues of data heterogeneity and global dispersion. It associates and warehouses biological entities in the way they relate with
Result:
To show the significance of proposed model, we developed a data warehouse and information retrieval system based on proposed model compliant multi-layered and multi-modular software architecture. Results show that our model supports well for gathering, associating, integrating, persisting and querying each entity with respect to its all possible aspects within or across the various associated omics layers.
Conclusion:
Formal specifications better facilitate for addressing data integration issues by providing formal means for understanding omics data based on meaning instead of syntax
Collapse
Affiliation(s)
- Omer Irshad
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore,Pakistan
| | - Muhammad Usman Ghani Khan
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore,Pakistan
| |
Collapse
|
7
|
SCALEUS-FD: A FAIR Data Tool for Biomedical Applications. BIOMED RESEARCH INTERNATIONAL 2020; 2020:3041498. [PMID: 32908882 PMCID: PMC7471816 DOI: 10.1155/2020/3041498] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/18/2020] [Indexed: 11/17/2022]
Abstract
The Semantic Web and Linked Data concepts and technologies have empowered the scientific community with solutions to take full advantage of the increasingly available distributed and heterogeneous data in distinct silos. Additionally, FAIR Data principles established guidelines for data to be Findable, Accessible, Interoperable, and Reusable, and they are gaining traction in data stewardship. However, to explore their full potential, we must be able to transform legacy solutions smoothly into the FAIR Data ecosystem. In this paper, we introduce SCALEUS-FD, a FAIR Data extension of a legacy semantic web tool successfully used for data integration and semantic annotation and enrichment. The core functionalities of the solution follow the Semantic Web and Linked Data principles, offering a FAIR REST API for machine-to-machine operations. We applied a set of metrics to evaluate its “FAIRness” and created an application scenario in the rare diseases domain.
Collapse
|
8
|
Sima AC, Stockinger K, de Farias TM, Gil M. Semantic Integration and Enrichment of Heterogeneous Biological Databases. Methods Mol Biol 2020; 1910:655-690. [PMID: 31278681 DOI: 10.1007/978-1-4939-9074-0_22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Biological databases are growing at an exponential rate, currently being among the major producers of Big Data, almost on par with commercial generators, such as YouTube or Twitter. While traditionally biological databases evolved as independent silos, each purposely built by a different research group in order to answer specific research questions; more recently significant efforts have been made toward integrating these heterogeneous sources into unified data access systems or interoperable systems using the FAIR principles of data sharing. Semantic Web technologies have been key enablers in this process, opening the path for new insights into the unified data, which were not visible at the level of each independent database. In this chapter, we first provide an introduction into two of the most used database models for biological data: relational databases and RDF stores. Next, we discuss ontology-based data integration, which serves to unify and enrich heterogeneous data sources. We present an extensive timeline of milestones in data integration based on Semantic Web technologies in the field of life sciences. Finally, we discuss some of the remaining challenges in making ontology-based data access (OBDA) systems easily accessible to a larger audience. In particular, we introduce natural language search interfaces, which alleviate the need for database users to be familiar with technical query languages. We illustrate the main theoretical concepts of data integration through concrete examples, using two well-known biological databases: a gene expression database, Bgee, and an orthology database, OMA.
Collapse
Affiliation(s)
- Ana Claudia Sima
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland. .,University of Lausanne, Lausanne, Switzerland.
| | - Kurt Stockinger
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland
| | - Tarcisio Mendes de Farias
- University of Lausanne, Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Manuel Gil
- ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
9
|
Cejovic J, Radenkovic J, Mladenovic V, Stanojevic A, Miletic M, Radanovic S, Bajcic D, Djordjevic D, Jelic F, Nesic M, Lau J, Grady P, Groves-Kirkby N, Kural D, Davis-Dusenbery B. Using Semantic Web Technologies to Enable Cancer Genomics Discovery at Petabyte Scale. Cancer Inform 2018; 17:1176935118774787. [PMID: 30283230 PMCID: PMC6166304 DOI: 10.1177/1176935118774787] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 08/03/2017] [Indexed: 11/17/2022] Open
Abstract
Increased efforts in cancer genomics research and bioinformatics are producing tremendous amounts of data. These data are diverse in origin, format, and content. As the amount of available sequencing data increase, technologies that make them discoverable and usable are critically needed. In response, we have developed a Semantic Web-based Data Browser, a tool allowing users to visually build and execute ontology-driven queries. This approach simplifies access to available data and improves the process of using them in analyses on the Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org). The Data Browser makes large data sets easily explorable and simplifies the retrieval of specific data of interest. Although initially implemented on top of The Cancer Genome Atlas (TCGA) data set, the Data Browser's architecture allows for seamless integration of other data sets. By deploying it on the CGC, we have enabled remote researchers to access data and perform collaborative investigations.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Filip Jelic
- Seven Bridges Genomics Inc., Cambridge, MA,
USA
| | - Milos Nesic
- Seven Bridges Genomics Inc., Cambridge, MA,
USA
| | - Jessica Lau
- Seven Bridges Genomics Inc., Cambridge, MA,
USA
| | | | | | - Deniz Kural
- Seven Bridges Genomics Inc., Cambridge, MA,
USA
| | | |
Collapse
|
10
|
Aite M, Chevallier M, Frioux C, Trottier C, Got J, Cortés MP, Mendoza SN, Carrier G, Dameron O, Guillaudeux N, Latorre M, Loira N, Markov GV, Maass A, Siegel A. Traceability, reproducibility and wiki-exploration for "à-la-carte" reconstructions of genome-scale metabolic models. PLoS Comput Biol 2018; 14:e1006146. [PMID: 29791443 PMCID: PMC5988327 DOI: 10.1371/journal.pcbi.1006146] [Citation(s) in RCA: 65] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2017] [Revised: 06/05/2018] [Accepted: 04/17/2018] [Indexed: 11/27/2022] Open
Abstract
Genome-scale metabolic models have become the tool of choice for the global analysis of microorganism metabolism, and their reconstruction has attained high standards of quality and reliability. Improvements in this area have been accompanied by the development of some major platforms and databases, and an explosion of individual bioinformatics methods. Consequently, many recent models result from "à la carte" pipelines, combining the use of platforms, individual tools and biological expertise to enhance the quality of the reconstruction. Although very useful, introducing heterogeneous tools, that hardly interact with each other, causes loss of traceability and reproducibility in the reconstruction process. This represents a real obstacle, especially when considering less studied species whose metabolic reconstruction can greatly benefit from the comparison to good quality models of related organisms. This work proposes an adaptable workspace, AuReMe, for sustainable reconstructions or improvements of genome-scale metabolic models involving personalized pipelines. At each step, relevant information related to the modifications brought to the model by a method is stored. This ensures that the process is reproducible and documented regardless of the combination of tools used. Additionally, the workspace establishes a way to browse metabolic models and their metadata through the automatic generation of ad-hoc local wikis dedicated to monitoring and facilitating the process of reconstruction. AuReMe supports exploration and semantic query based on RDF databases. We illustrate how this workspace allowed handling, in an integrated way, the metabolic reconstructions of non-model organisms such as an extremophile bacterium or eukaryote algae. Among relevant applications, the latter reconstruction led to putative evolutionary insights of a metabolic pathway.
Collapse
Affiliation(s)
| | - Marie Chevallier
- IRISA, Univ Rennes, Inria, CNRS, Rennes, France
- ECOBIO, Univ Rennes, CNRS, Rennes, France
| | | | - Camille Trottier
- IRISA, Univ Rennes, Inria, CNRS, Rennes, France
- UMR 6004 ComBi, Université de Nantes, CNRS, Nantes, France
| | - Jeanne Got
- IRISA, Univ Rennes, Inria, CNRS, Rennes, France
| | - María Paz Cortés
- Centro de Modelamiento Matemático, Universidad de Chile, Santiago, Chile
- Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Santiago, Chile
- Centro para la Regulación del Genoma (Fondap 15090007), Universidad de Chile, Santiago, Chile
| | - Sebastián N. Mendoza
- Centro de Modelamiento Matemático, Universidad de Chile, Santiago, Chile
- Centro para la Regulación del Genoma (Fondap 15090007), Universidad de Chile, Santiago, Chile
| | - Grégory Carrier
- Laboratoire de Physiologie et de Biotechnologie des Algues, IFREMER, Nantes, France
| | | | | | - Mauricio Latorre
- Centro de Modelamiento Matemático, Universidad de Chile, Santiago, Chile
- Centro para la Regulación del Genoma (Fondap 15090007), Universidad de Chile, Santiago, Chile
- Instituto de ciencias de la ingeniería, Universidad de O'Higgins, Rancagua, Chile
- Instituto de Nutrición y Tecnología de los Alimentos, Universidad de Chile, Santiago, Chile
| | - Nicolás Loira
- Centro de Modelamiento Matemático, Universidad de Chile, Santiago, Chile
- Centro para la Regulación del Genoma (Fondap 15090007), Universidad de Chile, Santiago, Chile
| | - Gabriel V. Markov
- UMR 8227, Integrative Biology of Marine Models, Station biologique de Roscoff, Sorbonne Université, CNRS, Roscoff, France
| | - Alejandro Maass
- Centro de Modelamiento Matemático, Universidad de Chile, Santiago, Chile
- Centro para la Regulación del Genoma (Fondap 15090007), Universidad de Chile, Santiago, Chile
| | - Anne Siegel
- IRISA, Univ Rennes, Inria, CNRS, Rennes, France
| |
Collapse
|
11
|
Kawashima S, Katayama T, Hatanaka H, Kushida T, Takagi T. NBDC RDF portal: a comprehensive repository for semantic data in life sciences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5255118. [PMID: 30576482 PMCID: PMC6301334 DOI: 10.1093/database/bay123] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 10/15/2018] [Indexed: 11/28/2022]
Abstract
In the life sciences, researchers increasingly want to access multiple databases in an integrated way. However, different databases currently use different formats and vocabularies, hindering the proper integration of heterogeneous life science data. Adopting the Resource Description Framework (RDF) has the potential to address such issues by improving database interoperability, leading to advances in automatic data processing. Based on this idea, we have advised many Japanese database development groups to expose their databases in RDF. To further promote such activities, we have developed an RDF-based life science dataset repository called the National Bioscience Database Center (NBDC) RDF portal. All the datasets in this repository have been reviewed by the NBDC to ensure interoperability and queryability. As of July 2018, the service includes 21 RDF datasets, comprising over 45.5 billion triples. It provides SPARQL endpoints for all datasets, useful metadata and the ability to download RDF files. The NBDC RDF portal can be accessed at https://integbio.jp/rdf/.
Collapse
Affiliation(s)
- Shuichi Kawashima
- Database Center for Life Science, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba, Japan
| | - Toshiaki Katayama
- Database Center for Life Science, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba, Japan
| | - Hideki Hatanaka
- National Bioscience Database Center, Japan Science and Technology Agency, 5-3 Yonbancho, Chiyoda-ku, Tokyo, Japan
| | - Tatsuya Kushida
- National Bioscience Database Center, Japan Science and Technology Agency, 5-3 Yonbancho, Chiyoda-ku, Tokyo, Japan
| | - Toshihisa Takagi
- National Bioscience Database Center, Japan Science and Technology Agency, 5-3 Yonbancho, Chiyoda-ku, Tokyo, Japan.,DNA Data Bank of Japan Center, National Institute of Genetics, Shizuoka, Japan.,Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 2-11-16 Yayoi, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
12
|
Venkatesan A, Kim JH, Talo F, Ide-Smith M, Gobeill J, Carter J, Batista-Navarro R, Ananiadou S, Ruch P, McEntyre J. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data. Wellcome Open Res 2017. [PMID: 28948232 DOI: 10.12688/wellcomeopenres.10210.1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The tremendous growth in biological data has resulted in an increase in the number of research papers being published. This presents a great challenge for scientists in searching and assimilating facts described in those papers. Particularly, biological databases depend on curators to add highly precise and useful information that are usually extracted by reading research articles. Therefore, there is an urgent need to find ways to improve linking literature to the underlying data, thereby minimising the effort in browsing content and identifying key biological concepts. As part of the development of Europe PMC, we have developed a new platform, SciLite, which integrates text-mined annotations from different sources and overlays those outputs on research articles. The aim is to aid researchers and curators using Europe PMC in finding key concepts more easily and provide links to related resources or tools, bridging the gap between literature and biological data.
Collapse
Affiliation(s)
- Aravind Venkatesan
- Literature Service group, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Jee-Hyub Kim
- Literature Service group, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Francesco Talo
- Literature Service group, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Michele Ide-Smith
- Literature Service group, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Julien Gobeill
- SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Jacob Carter
- National Centre for Text Mining (NaCTeM), Manchester Institute of Biotechnology, Manchester, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining (NaCTeM), Manchester Institute of Biotechnology, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining (NaCTeM), Manchester Institute of Biotechnology, Manchester, UK
| | - Patrick Ruch
- SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland.,Bibliomics and Text Mining Group (BiTeM), HES-SO, Geneva, Switzerland
| | - Johanna McEntyre
- Literature Service group, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| |
Collapse
|
13
|
Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Semantics 2017; 8:35. [PMID: 28870259 PMCID: PMC5584330 DOI: 10.1186/s13326-017-0136-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 08/01/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biological sciences are characterised not only by an increasing amount but also the extreme complexity of its data. This stresses the need for efficient ways of integrating these data in a coherent description of biological systems. In many cases, biological data needs organization before integration. This is not seldom a collaborative effort, and it is thus important that tools for data integration support a collaborative way of working. Wiki systems with support for structured semantic data authoring, such as Semantic MediaWiki, provide a powerful solution for collaborative editing of data combined with machine-readability, so that data can be handled in an automated fashion in any downstream analyses. Semantic MediaWiki lacks a built-in data import function though, which hinders efficient round-tripping of data between interoperable Semantic Web formats such as RDF and the internal wiki format. RESULTS To solve this deficiency, the RDFIO suite of tools is presented, which supports importing of RDF data into Semantic MediaWiki, with metadata needed to export it again in the same RDF format, or ontology. Additionally, the new functionality enables mash-ups of automated data imports combined with manually created data presentations. The application of the suite of tools is demonstrated by importing drug discovery related data about rare diseases from Orphanet and acid dissociation constants from Wikidata. The RDFIO suite of tools is freely available for download via pharmb.io/project/rdfio . CONCLUSIONS Through a set of biomedical demonstrators, it is demonstrated how the new functionality enables a number of usage scenarios where the interoperability of SMW and the wider Semantic Web is leveraged for biomedical data sets, to create an easy to use and flexible platform for exploring and working with biomedical data.
Collapse
Affiliation(s)
- Samuel Lampa
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, SE-751 24, Sweden.
| | - Egon Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, P.O. Box 616, UNS50 Box 19, Maastricht, NL-6200, MD, The Netherlands
| | - Pekka Kohonen
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, SE-171 77, Sweden.,Division of Toxicology, Misvik Biology Oy, Turku, Finland
| | | | | | - Roland Grafström
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, SE-171 77, Sweden.,Division of Toxicology, Misvik Biology Oy, Turku, Finland
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, SE-751 24, Sweden
| |
Collapse
|
14
|
Matiasz NJ, Wood J, Wang W, Silva AJ, Hsu W. Computer-Aided Experiment Planning toward Causal Discovery in Neuroscience. Front Neuroinform 2017; 11:12. [PMID: 28243197 PMCID: PMC5304468 DOI: 10.3389/fninf.2017.00012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Accepted: 01/26/2017] [Indexed: 11/13/2022] Open
Abstract
Computers help neuroscientists to analyze experimental results by automating the application of statistics; however, computer-aided experiment planning is far less common, due to a lack of similar quantitative formalisms for systematically assessing evidence and uncertainty. While ontologies and other Semantic Web resources help neuroscientists to assimilate required domain knowledge, experiment planning requires not only ontological but also epistemological (e.g., methodological) information regarding how knowledge was obtained. Here, we outline how epistemological principles and graphical representations of causality can be used to formalize experiment planning toward causal discovery. We outline two complementary approaches to experiment planning: one that quantifies evidence per the principles of convergence and consistency, and another that quantifies uncertainty using logical representations of constraints on causal structure. These approaches operationalize experiment planning as the search for an experiment that either maximizes evidence or minimizes uncertainty. Despite work in laboratory automation, humans must still plan experiments and will likely continue to do so for some time. There is thus a great need for experiment-planning frameworks that are not only amenable to machine computation but also useful as aids in human reasoning.
Collapse
Affiliation(s)
- Nicholas J Matiasz
- Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los AngelesLos Angeles, CA, USA; Silva Laboratory, Departments of Neurobiology, Psychiatry, and Psychology, Integrative Center for Learning and Memory, Brain Research Institute, University of California, Los AngelesLos Angeles, CA, USA
| | - Justin Wood
- Silva Laboratory, Departments of Neurobiology, Psychiatry, and Psychology, Integrative Center for Learning and Memory, Brain Research Institute, University of California, Los AngelesLos Angeles, CA, USA; Department of Computer Science, Scalable Analytics Institute, University of California, Los AngelesLos Angeles, CA, USA
| | - Wei Wang
- Department of Computer Science, Scalable Analytics Institute, University of California, Los Angeles Los Angeles, CA, USA
| | - Alcino J Silva
- Silva Laboratory, Departments of Neurobiology, Psychiatry, and Psychology, Integrative Center for Learning and Memory, Brain Research Institute, University of California, Los Angeles Los Angeles, CA, USA
| | - William Hsu
- Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles Los Angeles, CA, USA
| |
Collapse
|
15
|
Fernandez JD, Lenzerini M, Masseroli M, Venco F, Ceri S. Ontology-Based Search of Genomic Metadata. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:233-247. [PMID: 26529777 DOI: 10.1109/tcbb.2015.2495179] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The Encyclopedia of DNA Elements (ENCODE) is a huge and still expanding public repository of more than 4,000 experiments and 25,000 data files, assembled by a large international consortium since 2007; unknown biological knowledge can be extracted from these huge and largely unexplored data, leading to data-driven genomic, transcriptomic, and epigenomic discoveries. Yet, search of relevant datasets for knowledge discovery is limitedly supported: metadata describing ENCODE datasets are quite simple and incomplete, and not described by a coherent underlying ontology. Here, we show how to overcome this limitation, by adopting an ENCODE metadata searching approach which uses high-quality ontological knowledge and state-of-the-art indexing technologies. Specifically, we developed S.O.S. GeM (http://www.bioinformatics.deib.polimi.it/SOSGeM/), a system supporting effective semantic search and retrieval of ENCODE datasets. First, we constructed a Semantic Knowledge Base by starting with concepts extracted from ENCODE metadata, matched to and expanded on biomedical ontologies integrated in the well-established Unified Medical Language System. We prove that this inference method is sound and complete. Then, we leveraged the Semantic Knowledge Base to semantically search ENCODE data from arbitrary biologists' queries. This allows correctly finding more datasets than those extracted by a purely syntactic search, as supported by the other available systems. We empirically show the relevance of found datasets to the biologists' queries.
Collapse
|
16
|
A survey on knowledge representation in materials science and engineering: An ontological perspective. COMPUT IND 2015. [DOI: 10.1016/j.compind.2015.07.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
17
|
Abstract
The life sciences field is entering an era of big data with the breakthroughs of science and technology. More and more big data-related projects and activities are being performed in the world. Life sciences data generated by new technologies are continuing to grow in not only size but also variety and complexity, with great speed. To ensure that big data has a major influence in the life sciences, comprehensive data analysis across multiple data sources and even across disciplines is indispensable. The increasing volume of data and the heterogeneous, complex varieties of data are two principal issues mainly discussed in life science informatics. The ever-evolving next-generation Web, characterized as the Semantic Web, is an extension of the current Web, aiming to provide information for not only humans but also computers to semantically process large-scale data. The paper presents a survey of big data in life sciences, big data related projects and Semantic Web technologies. The paper introduces the main Semantic Web technologies and their current situation, and provides a detailed analysis of how Semantic Web technologies address the heterogeneous variety of life sciences big data. The paper helps to understand the role of Semantic Web technologies in the big data era and how they provide a promising solution for the big data in life sciences.
Collapse
Affiliation(s)
- Hongyan Wu
- Database Center for Life Science, Research Organization of Information and Systems
| | | |
Collapse
|
18
|
Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav028. [PMID: 25877637 PMCID: PMC4397996 DOI: 10.1093/database/bav028] [Citation(s) in RCA: 677] [Impact Index Per Article: 67.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Accepted: 03/09/2015] [Indexed: 11/25/2022]
Abstract
DisGeNET is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation. DisGeNET data can also be analysed via the DisGeNET Cytoscape plugin, and enriched with the annotations of other plugins of this popular network analysis software suite. Finally, the information contained in DisGeNET can be expanded and complemented using Semantic Web technologies and linked to a variety of resources already present in the Linked Data cloud. Hence, DisGeNET offers one of the most comprehensive collections of human gene-disease associations and a valuable set of tools for investigating the molecular mechanisms underlying diseases of genetic origin, designed to fulfill the needs of different user profiles, including bioinformaticians, biologists and health-care practitioners. Database URL: http://www.disgenet.org/
Collapse
Affiliation(s)
- Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Jordi Deu-Pons
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Anna Bauer-Mehren
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Martin Baron
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Ferran Sanz
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| |
Collapse
|
19
|
Chiba H, Nishide H, Uchiyama I. Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data. PLoS One 2015; 10:e0122802. [PMID: 25875762 PMCID: PMC4395280 DOI: 10.1371/journal.pone.0122802] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Accepted: 02/13/2015] [Indexed: 12/30/2022] Open
Abstract
Recently, various types of biological data, including genomic sequences, have been rapidly accumulating. To discover biological knowledge from such growing heterogeneous data, a flexible framework for data integration is necessary. Ortholog information is a central resource for interlinking corresponding genes among different organisms, and the Semantic Web provides a key technology for the flexible integration of heterogeneous data. We have constructed an ortholog database using the Semantic Web technology, aiming at the integration of numerous genomic data and various types of biological information. To formalize the structure of the ortholog information in the Semantic Web, we have constructed the Ortholog Ontology (OrthO). While the OrthO is a compact ontology for general use, it is designed to be extended to the description of database-specific concepts. On the basis of OrthO, we described the ortholog information from our Microbial Genome Database for Comparative Analysis (MBGD) in the form of Resource Description Framework (RDF) and made it available through the SPARQL endpoint, which accepts arbitrary queries specified by users. In this framework based on the OrthO, the biological data of different organisms can be integrated using the ortholog information as a hub. Besides, the ortholog information from different data sources can be compared with each other using the OrthO as a shared ontology. Here we show some examples demonstrating that the ortholog information described in RDF can be used to link various biological data such as taxonomy information and Gene Ontology. Thus, the ortholog database using the Semantic Web technology can contribute to biological knowledge discovery through integrative data analysis.
Collapse
Affiliation(s)
- Hirokazu Chiba
- Laboratory of Genome Informatics, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
| | - Hiroyo Nishide
- Data Integration and Analysis Facility, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
| | - Ikuo Uchiyama
- Laboratory of Genome Informatics, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
- Data Integration and Analysis Facility, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
- * E-mail:
| |
Collapse
|
20
|
Chen HS, Hutter CM, Mechanic LE, Amos CI, Bafna V, Hauser ER, Hernandez RD, Li C, Liberles DA, McAllister K, Moore JH, Paltoo DN, Papanicolaou GJ, Peng B, Ritchie MD, Rosenfeld G, Witte JS, Gillanders EM, Feuer EJ. Genetic simulation tools for post-genome wide association studies of complex diseases. Genet Epidemiol 2014; 39:11-19. [PMID: 25371374 DOI: 10.1002/gepi.21870] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 09/02/2014] [Accepted: 09/26/2014] [Indexed: 01/12/2023]
Abstract
Genetic simulation programs are used to model data under specified assumptions to facilitate the understanding and study of complex genetic systems. Standardized data sets generated using genetic simulation are essential for the development and application of novel analytical tools in genetic epidemiology studies. With continuing advances in high-throughput genomic technologies and generation and analysis of larger, more complex data sets, there is a need for updating current approaches in genetic simulation modeling. To provide a forum to address current and emerging challenges in this area, the National Cancer Institute (NCI) sponsored a workshop, entitled "Genetic Simulation Tools for Post-Genome Wide Association Studies of Complex Diseases" at the National Institutes of Health (NIH) in Bethesda, Maryland on March 11-12, 2014. The goals of the workshop were to (1) identify opportunities, challenges, and resource needs for the development and application of genetic simulation models; (2) improve the integration of tools for modeling and analysis of simulated data; and (3) foster collaborations to facilitate development and applications of genetic simulation. During the course of the meeting, the group identified challenges and opportunities for the science of simulation, software and methods development, and collaboration. This paper summarizes key discussions at the meeting, and highlights important challenges and opportunities to advance the field of genetic simulation.
Collapse
Affiliation(s)
- Huann-Sheng Chen
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Carolyn M Hutter
- Division of Genomic Medicine, National Human Genome Research Institute, NIH, Bethesda, MD 20892
| | - Leah E Mechanic
- Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Christopher I Amos
- Division of Community, Family Medicine, Dartmouth College, Lebanon, NH 03755
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093
| | | | - Ryan D Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143
| | - Chun Li
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37235
| | - David A Liberles
- Department of Molecular Biology, University of Wyoming, Laramie, WY 82071
| | - Kimberly McAllister
- Susceptibility and Population Health Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709
| | - Jason H Moore
- Department of Genetics, Dartmouth College, Lebanon, NH 03755
| | - Dina N Paltoo
- Office of Director, National Institutes of Health, Bethesda, MD 20892
| | - George J Papanicolaou
- Division of Cardiovascular Sciences, Prevention and Population Sciences Program, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892
| | - Bo Peng
- Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030
| | - Marylyn D Ritchie
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802
| | - Gabriel Rosenfeld
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - John S Witte
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA 94107
| | - Elizabeth M Gillanders
- Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Eric J Feuer
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| |
Collapse
|
21
|
Butler WE, Atai N, Carter B, Hochberg F. Informatic system for a global tissue-fluid biorepository with a graph theory-oriented graphical user interface. J Extracell Vesicles 2014; 3:24247. [PMID: 25317275 PMCID: PMC4172698 DOI: 10.3402/jev.v3.24247] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Revised: 06/13/2014] [Accepted: 06/15/2014] [Indexed: 12/12/2022] Open
Abstract
The Richard Floor Biorepository supports collaborative studies of extracellular vesicles (EVs) found in human fluids and tissue specimens. The current emphasis is on biomarkers for central nervous system neoplasms but its structure may serve as a template for collaborative EV translational studies in other fields. The informatic system provides specimen inventory tracking with bar codes assigned to specimens and containers and projects, is hosted on globalized cloud computing resources, and embeds a suite of shared documents, calendars, and video-conferencing features. Clinical data are recorded in relation to molecular EV attributes and may be tagged with terms drawn from a network of externally maintained ontologies thus offering expansion of the system as the field matures. We fashioned the graphical user interface (GUI) around a web-based data visualization package. This system is now in an early stage of deployment, mainly focused on specimen tracking and clinical, laboratory, and imaging data capture in support of studies to optimize detection and analysis of brain tumour-specific mutations. It currently includes 4,392 specimens drawn from 611 subjects, the majority with brain tumours. As EV science evolves, we plan biorepository changes which may reflect multi-institutional collaborations, proteomic interfaces, additional biofluids, changes in operating procedures and kits for specimen handling, novel procedures for detection of tumour-specific EVs, and for RNA extraction and changes in the taxonomy of EVs. We have used an ontology-driven data model and web-based architecture with a graph theory-driven GUI to accommodate and stimulate the semantic web of EV science.
Collapse
Affiliation(s)
- William E. Butler
- Neurosurgical Service, Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Boston, MA, USA
| | - Nadia Atai
- Neurosurgical Service, Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Boston, MA, USA
- Department of Cell Biology and Histology, University of Amsterdam, Amsterdam, The Netherlands
| | - Bob Carter
- Department of Neurosurgery, University of San Diego Medical School, San Diego, CA, USA
| | | |
Collapse
|
22
|
Hettne KM, Dharuri H, Zhao J, Wolstencroft K, Belhajjame K, Soiland-Reyes S, Mina E, Thompson M, Cruickshank D, Verdes-Montenegro L, Garrido J, de Roure D, Corcho O, Klyne G, van Schouwen R, ‘t Hoen PAC, Bechhofer S, Goble C, Roos M. Structuring research methods and data with the research object model: genomics workflows as a case study. J Biomed Semantics 2014; 5:41. [PMID: 25276335 PMCID: PMC4177597 DOI: 10.1186/2041-1480-5-41] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 07/29/2014] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. RESULTS We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as "which particular data was input to a particular workflow to test a particular hypothesis?", and "which particular conclusions were drawn from a particular workflow?". CONCLUSIONS Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. AVAILABILITY The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro.
Collapse
Affiliation(s)
- Kristina M Hettne
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Harish Dharuri
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jun Zhao
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Katherine Wolstencroft
- />School of Computer Science, University of Manchester, Manchester, UK
- />Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
| | - Khalid Belhajjame
- />School of Computer Science, University of Manchester, Manchester, UK
| | | | - Eleni Mina
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Mark Thompson
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | | | - David de Roure
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Oscar Corcho
- />Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
| | - Graham Klyne
- />Department of Zoology, University of Oxford, Oxford, UK
| | - Reinout van Schouwen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter A C ‘t Hoen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Sean Bechhofer
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Carole Goble
- />School of Computer Science, University of Manchester, Manchester, UK
| | - Marco Roos
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
23
|
Stucky BJ, Deck J, Conlin T, Ziemba L, Cellinese N, Guralnick R. The BiSciCol Triplifier: bringing biodiversity data to the Semantic Web. BMC Bioinformatics 2014; 15:257. [PMID: 25073721 PMCID: PMC4124153 DOI: 10.1186/1471-2105-15-257] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 07/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent years have brought great progress in efforts to digitize the world's biodiversity data, but integrating data from many different providers, and across research domains, remains challenging. Semantic Web technologies have been widely recognized by biodiversity scientists for their potential to help solve this problem, yet these technologies have so far seen little use for biodiversity data. Such slow uptake has been due, in part, to the relative complexity of Semantic Web technologies along with a lack of domain-specific software tools to help non-experts publish their data to the Semantic Web. RESULTS The BiSciCol Triplifier is new software that greatly simplifies the process of converting biodiversity data in standard, tabular formats, such as Darwin Core-Archives, into Semantic Web-ready Resource Description Framework (RDF) representations. The Triplifier uses a vocabulary based on the popular Darwin Core standard, includes both Web-based and command-line interfaces, and is fully open-source software. CONCLUSIONS Unlike most other RDF conversion tools, the Triplifier does not require detailed familiarity with core Semantic Web technologies, and it is tailored to a widely popular biodiversity data format and vocabulary standard. As a result, the Triplifier can often fully automate the conversion of biodiversity data to RDF, thereby making the Semantic Web much more accessible to biodiversity scientists who might otherwise have relatively little knowledge of Semantic Web technologies. Easy availability of biodiversity data as RDF will allow researchers to combine data from disparate sources and analyze them with powerful linked data querying tools. However, before software like the Triplifier, and Semantic Web technologies in general, can reach their full potential for biodiversity science, the biodiversity informatics community must address several critical challenges, such as the widespread failure to use robust, globally unique identifiers for biodiversity data.
Collapse
Affiliation(s)
- Brian J Stucky
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Colorado, USA.
| | | | | | | | | | | |
Collapse
|
24
|
Wu H, Fujiwara T, Yamamoto Y, Bolleman J, Yamaguchi A. BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data. J Biomed Semantics 2014; 5:32. [PMID: 25089180 PMCID: PMC4118313 DOI: 10.1186/2041-1480-5-32] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 04/27/2014] [Indexed: 12/21/2022] Open
Abstract
Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
Collapse
Affiliation(s)
- Hongyan Wu
- Database Center for Life Science, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba 277-0871, Japan
| | | | - Yasunori Yamamoto
- Database Center for Life Science, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba 277-0871, Japan
| | - Jerven Bolleman
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel Servet, 1211 Geneva 4, Switzerland
| | - Atsuko Yamaguchi
- Database Center for Life Science, Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba 277-0871, Japan
| |
Collapse
|
25
|
Bölling C, Weidlich M, Holzhütter HG. SEE: structured representation of scientific evidence in the biomedical domain using Semantic Web techniques. J Biomed Semantics 2014; 5:S1. [PMID: 25093070 PMCID: PMC4108886 DOI: 10.1186/2041-1480-5-s1-s1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accounts of evidence are vital to evaluate and reproduce scientific findings and integrate data on an informed basis. Currently, such accounts are often inadequate, unstandardized and inaccessible for computational knowledge engineering even though computational technologies, among them those of the semantic web, are ever more employed to represent, disseminate and integrate biomedical data and knowledge. RESULTS We present SEE (Semantic EvidencE), an RDF/OWL based approach for detailed representation of evidence in terms of the argumentative structure of the supporting background for claims even in complex settings. We derive design principles and identify minimal components for the representation of evidence. We specify the Reasoning and Discourse Ontology (RDO), an OWL representation of the model of scientific claims, their subjects, their provenance and their argumentative relations underlying the SEE approach. We demonstrate the application of SEE and illustrate its design patterns in a case study by providing an expressive account of the evidence for certain claims regarding the isolation of the enzyme glutamine synthetase. CONCLUSIONS SEE is suited to provide coherent and computationally accessible representations of evidence-related information such as the materials, methods, assumptions, reasoning and information sources used to establish a scientific finding by adopting a consistently claim-based perspective on scientific results and their evidence. SEE allows for extensible evidence representations, in which the level of detail can be adjusted and which can be extended as needed. It supports representation of arbitrary many consecutive layers of interpretation and attribution and different evaluations of the same data. SEE and its underlying model could be a valuable component in a variety of use cases that require careful representation or examination of evidence for data presented on the semantic web or in other formats.
Collapse
Affiliation(s)
- Christian Bölling
- Institute of Biochemistry, Charité Universitätsmedizin Berlin, Berlin, Germany
| | - Michael Weidlich
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany
| | | |
Collapse
|
26
|
Boja ES, Rodriguez H. Proteogenomic convergence for understanding cancer pathways and networks. Clin Proteomics 2014; 11:22. [PMID: 24994965 PMCID: PMC4067069 DOI: 10.1186/1559-0275-11-22] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Accepted: 03/31/2014] [Indexed: 11/21/2022] Open
Abstract
During the past several decades, the understanding of cancer at the molecular level has been primarily focused on mechanisms on how signaling molecules transform homeostatically balanced cells into malignant ones within an individual pathway. However, it is becoming more apparent that pathways are dynamic and crosstalk at different control points of the signaling cascades, making the traditional linear signaling models inadequate to interpret complex biological systems. Recent technological advances in high throughput, deep sequencing for the human genomes and proteomic technologies to comprehensively characterize the human proteomes in conjunction with multiplexed targeted proteomic assays to measure panels of proteins involved in biologically relevant pathways have made significant progress in understanding cancer at the molecular level. It is undeniable that proteomic profiling of differentially expressed proteins under many perturbation conditions, or between normal and "diseased" states is important to capture a first glance at the overall proteomic landscape, which has been a main focus of proteomics research during the past 15-20 years. However, the research community is gradually shifting its heavy focus from that initial discovery step to protein target verification using multiplexed quantitative proteomic assays, capable of measuring changes in proteins and their interacting partners, isoforms, and post-translational modifications (PTMs) in response to stimuli in the context of signaling pathways and protein networks. With a critical link to genotypes (i.e., high throughput genomics and transcriptomics data), new and complementary information can be gleaned from multi-dimensional omics data to (1) assess the effect of genomic and transcriptomic aberrations on such complex molecular machinery in the context of cell signaling architectures associated with pathological diseases such as cancer (i.e., from genotype to proteotype to phenotype); and (2) target pathway- and network-driven changes and map the fluctuations of these functional units (proteins) responsible for cellular activities in response to perturbation in a spatiotemporal fashion to better understand cancer biology as a whole system.
Collapse
Affiliation(s)
- Emily S Boja
- Office of Cancer Clinical Proteomics Research, National Cancer Institute, National Institutes of Health, 31 Center Drive, MSC 2580, 20892 Bethesda, MD, USA
| | - Henry Rodriguez
- Office of Cancer Clinical Proteomics Research, National Cancer Institute, National Institutes of Health, 31 Center Drive, MSC 2580, 20892 Bethesda, MD, USA
| |
Collapse
|
27
|
Berman HM, Kleywegt GJ, Nakamura H, Markley JL. How community has shaped the Protein Data Bank. Structure 2014; 21:1485-91. [PMID: 24010707 DOI: 10.1016/j.str.2013.07.010] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Revised: 07/12/2013] [Accepted: 07/17/2013] [Indexed: 11/19/2022]
Abstract
Following several years of community discussion, the Protein Data Bank (PDB) was established in 1971 as a public repository for the coordinates of three-dimensional models of biological macromolecules. Since then, the number, size, and complexity of structural models have continued to grow, reflecting the productivity of structural biology. Managed by the Worldwide PDB organization, the PDB has been able to meet increasing demands for the quantity of structural information and of quality. In addition to providing unrestricted access to structural information, the PDB also works to promote data standards and to raise the profile of structural biology with broader audiences. In this perspective, we describe the history of PDB and the many ways in which the community continues to shape the archive.
Collapse
Affiliation(s)
- Helen M Berman
- RCSB PDB, Center for Integrative Proteomics Research and Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ USA 08854.
| | | | | | | |
Collapse
|
28
|
Santra T, Kolch W, Kholodenko BN. Navigating the multilayered organization of eukaryotic signaling: a new trend in data integration. PLoS Comput Biol 2014; 10:e1003385. [PMID: 24550716 PMCID: PMC3923657 DOI: 10.1371/journal.pcbi.1003385] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The ever-increasing capacity of biological molecular data acquisition outpaces our ability to understand the meaningful relationships between molecules in a cell. Multiple databases were developed to store and organize these molecular data. However, emerging fundamental questions about concerted functions of these molecules in hierarchical cellular networks are poorly addressed. Here we review recent advances in the development of publically available databases that help us analyze the signal integration and processing by multilayered networks that specify biological responses in model organisms and human cells
Collapse
Affiliation(s)
- Tapesh Santra
- Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland
| | - Walter Kolch
- Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland
- Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin, Ireland
- School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Ireland
| | - Boris N. Kholodenko
- Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland
- Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin, Ireland
- School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Ireland
- * E-mail:
| |
Collapse
|
29
|
Mayer G, Jones AR, Binz PA, Deutsch EW, Orchard S, Montecchi-Palazzi L, Vizcaíno JA, Hermjakob H, Oveillero D, Julian R, Stephan C, Meyer HE, Eisenacher M. Controlled vocabularies and ontologies in proteomics: overview, principles and practice. BIOCHIMICA ET BIOPHYSICA ACTA 2014; 1844:98-107. [PMID: 23429179 PMCID: PMC3898906 DOI: 10.1016/j.bbapap.2013.02.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Revised: 02/05/2013] [Accepted: 02/09/2013] [Indexed: 11/30/2022]
Abstract
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the "mapping files" used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
Collapse
Affiliation(s)
- Gerhard Mayer
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Andrew R. Jones
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Pierre-Alain Binz
- SIB Swiss Institute of Bioinformatics, Swiss-Prot group, Rue Michel-Servet 1, CH-1211 Geneva 4, Switzerland
| | - Eric W. Deutsch
- Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Sandra Orchard
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | | | - Henning Hermjakob
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - David Oveillero
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | - Christian Stephan
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
- Kairos GmbH, Universitätsstraße 136, D-44799 Bochum, Germany
| | - Helmut E. Meyer
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Martin Eisenacher
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| |
Collapse
|
30
|
Rebholz-Schuhmann D, Grabmüller C, Kavaliauskas S, Croset S, Woollard P, Backofen R, Filsell W, Clark D. A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. Drug Discov Today 2013; 19:882-9. [PMID: 24201223 DOI: 10.1016/j.drudis.2013.10.024] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2012] [Revised: 09/24/2013] [Accepted: 10/28/2013] [Indexed: 10/26/2022]
Abstract
In the Semantic Enrichment of the Scientific Literature (SESL) project, researchers from academia and from life science and publishing companies collaborated in a pre-competitive way to integrate and share information for type 2 diabetes mellitus (T2DM) in adults. This case study exposes benefits from semantic interoperability after integrating the scientific literature with biomedical data resources, such as UniProt Knowledgebase (UniProtKB) and the Gene Expression Atlas (GXA). We annotated scientific documents in a standardized way, by applying public terminological resources for diseases and proteins, and other text-mining approaches. Eventually, we compared the genetic causes of T2DM across the data resources to demonstrate the benefits from the SESL triple store. Our solution enables publishers to distribute their content with little overhead into remote data infrastructures, such as into any Virtual Knowledge Broker.
Collapse
Affiliation(s)
- Dietrich Rebholz-Schuhmann
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; Computerlinguistik, Universität Zürich, Binzmühlestrasse 14, 8050 Zürich, Switzerland.
| | - Christoph Grabmüller
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Silvestras Kavaliauskas
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Samuel Croset
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Peter Woollard
- GlaxoSmithKline, GlaxoSmithKline Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, UK
| | - Rolf Backofen
- Albert-Ludwigs-University Freiburg, Fahnenbergplatz, D-79085 Freiburg, Germany
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook MK44 1LQ, UK
| | - Dominic Clark
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
31
|
Kamdar MR, Zeginis D, Hasnain A, Decker S, Deus HF. ReVeaLD: a user-driven domain-specific interactive search platform for biomedical research. J Biomed Inform 2013; 47:112-30. [PMID: 24135450 DOI: 10.1016/j.jbi.2013.10.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Revised: 09/22/2013] [Accepted: 10/01/2013] [Indexed: 10/26/2022]
Abstract
Bioinformatics research relies heavily on the ability to discover and correlate data from various sources. The specialization of life sciences over the past decade, coupled with an increasing number of biomedical datasets available through standardized interfaces, has created opportunities towards new methods in biomedical discovery. Despite the popularity of semantic web technologies in tackling the integrative bioinformatics challenge, there are many obstacles towards its usage by non-technical research audiences. In particular, the ability to fully exploit integrated information needs using improved interactive methods intuitive to the biomedical experts. In this report we present ReVeaLD (a Real-time Visual Explorer and Aggregator of Linked Data), a user-centered visual analytics platform devised to increase intuitive interaction with data from distributed sources. ReVeaLD facilitates query formulation using a domain-specific language (DSL) identified by biomedical experts and mapped to a self-updated catalogue of elements from external sources. ReVeaLD was implemented in a cancer research setting; queries included retrieving data from in silico experiments, protein modeling and gene expression. ReVeaLD was developed using Scalable Vector Graphics and JavaScript and a demo with explanatory video is available at http://www.srvgal78.deri.ie:8080/explorer. A set of user-defined graphic rules controls the display of information through media-rich user interfaces. Evaluation of ReVeaLD was carried out as a game: biomedical researchers were asked to assemble a set of 5 challenge questions and time and interactions with the platform were recorded. Preliminary results indicate that complex queries could be formulated under less than two minutes by unskilled researchers. The results also indicate that supporting the identification of the elements of a DSL significantly increased intuitiveness of the platform and usability of semantic web technologies by domain users.
Collapse
Affiliation(s)
- Maulik R Kamdar
- Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland.
| | - Dimitris Zeginis
- Centre for Research and Technology Hellas, Thessaloniki, Greece; Information Systems Lab, University of Macedonia, Thessaloniki, Greece.
| | - Ali Hasnain
- Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland.
| | - Stefan Decker
- Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland.
| | - Helena F Deus
- Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland.
| |
Collapse
|
32
|
Callahan A, Cruz-Toledo J, Dumontier M. Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics 2013; 4 Suppl 1:S1. [PMID: 23735196 PMCID: PMC3632999 DOI: 10.1186/2041-1480-4-s1-s1] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND A key activity for life scientists in this post "-omics" age involves searching for and integrating biological data from a multitude of independent databases. However, our ability to find relevant data is hampered by non-standard web and database interfaces backed by an enormous variety of data formats. This heterogeneity presents an overwhelming barrier to the discovery and reuse of resources which have been developed at great public expense.To address this issue, the open-source Bio2RDF project promotes a simple convention to integrate diverse biological data using Semantic Web technologies. However, querying Bio2RDF remains difficult due to the lack of uniformity in the representation of Bio2RDF datasets. RESULTS We describe an update to Bio2RDF that includes tighter integration across 19 new and updated RDF datasets. All available open-source scripts were first consolidated to a single GitHub repository and then redeveloped using a common API that generates normalized IRIs using a centralized dataset registry. We then mapped dataset specific types and relations to the Semanticscience Integrated Ontology (SIO) and demonstrate simplified federated queries across multiple Bio2RDF endpoints. CONCLUSIONS This coordinated release marks an important milestone for the Bio2RDF open source linked data framework. Principally, it improves the quality of linked data in the Bio2RDF network and makes it easier to access or recreate the linked data locally. We hope to continue improving the Bio2RDF network of linked data by identifying priority databases and increasing the vocabulary coverage to additional dataset vocabularies beyond SIO.
Collapse
Affiliation(s)
- Alison Callahan
- Department of Biology, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada
| | - José Cruz-Toledo
- Department of Biology, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada
| | - Michel Dumontier
- Department of Biology, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada
- Institute of Biochemistry, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada
- School of Computer Science Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada
| |
Collapse
|
33
|
Callahan A, Cruz-Toledo J, Ansell P, Dumontier M. Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data. THE SEMANTIC WEB: SEMANTICS AND BIG DATA 2013. [DOI: 10.1007/978-3-642-38288-8_14] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
34
|
Wang Z, Sagotsky J, Taylor T, Shironoshita P, Deisboeck TS. Accelerating cancer systems biology research through Semantic Web technology. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2012. [PMID: 23188758 DOI: 10.1002/wsbm.1200] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Cancer systems biology is an interdisciplinary, rapidly expanding research field in which collaborations are a critical means to advance the field. Yet the prevalent database technologies often isolate data rather than making it easily accessible. The Semantic Web has the potential to help facilitate web-based collaborative cancer research by presenting data in a manner that is self-descriptive, human and machine readable, and easily sharable. We have created a semantically linked online Digital Model Repository (DMR) for storing, managing, executing, annotating, and sharing computational cancer models. Within the DMR, distributed, multidisciplinary, and inter-organizational teams can collaborate on projects, without forfeiting intellectual property. This is achieved by the introduction of a new stakeholder to the collaboration workflow, the institutional licensing officer, part of the Technology Transfer Office. Furthermore, the DMR has achieved silver level compatibility with the National Cancer Institute's caBIG, so users can interact with the DMR not only through a web browser but also through a semantically annotated and secure web service. We also discuss the technology behind the DMR leveraging the Semantic Web, ontologies, and grid computing to provide secure inter-institutional collaboration on cancer modeling projects, online grid-based execution of shared models, and the collaboration workflow protecting researchers' intellectual property.
Collapse
Affiliation(s)
- Zhihui Wang
- Department of Pathology, University of New Mexico, Albuquerque, NM, USA
| | | | | | | | | |
Collapse
|
35
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|