1
|
Welter D, Juty N, Rocca-Serra P, Xu F, Henderson D, Gu W, Strubel J, Giessmann RT, Emam I, Gadiya Y, Abbassi-Daloii T, Alharbi E, Gray AJG, Courtot M, Gribbon P, Ioannidis V, Reilly DS, Lynch N, Boiten JW, Satagopam V, Goble C, Sansone SA, Burdett T. FAIR in action - a flexible framework to guide FAIRification. Sci Data 2023; 10:291. [PMID: 37208349 PMCID: PMC10199076 DOI: 10.1038/s41597-023-02167-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 03/28/2023] [Indexed: 05/21/2023] Open
Abstract
The COVID-19 pandemic has highlighted the need for FAIR (Findable, Accessible, Interoperable, and Reusable) data more than any other scientific challenge to date. We developed a flexible, multi-level, domain-agnostic FAIRification framework, providing practical guidance to improve the FAIRness for both existing and future clinical and molecular datasets. We validated the framework in collaboration with several major public-private partnership projects, demonstrating and delivering improvements across all aspects of FAIR and across a variety of datasets and their contexts. We therefore managed to establish the reproducibility and far-reaching applicability of our approach to FAIRification tasks.
Collapse
Affiliation(s)
- Danielle Welter
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Nick Juty
- University of Manchester, Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Philippe Rocca-Serra
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX13QG, Oxford, UK
| | - Fuqi Xu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - David Henderson
- Bayer AG, Business Development & Licensing & OI, Muellerstrasse 178, 13353, Berlin, Germany
| | - Wei Gu
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Jolanda Strubel
- The Hyve BV, Arthur van Schendelstraat 650, 3511 MJ, Utrecht, The Netherlands
| | - Robert T Giessmann
- Bayer AG, Business Development & Licensing & OI, Muellerstrasse 178, 13353, Berlin, Germany
- Institute for Globally Distributed Open Research and Education (IGDORE), Gothenburg, Sweden
| | - Ibrahim Emam
- Data Science Institute, Imperial College, London, UK
| | - Yojana Gadiya
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP) and Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590, Frankfurt, Germany
| | - Tooba Abbassi-Daloii
- Department of Bioinformatics (BiGCaT), NUTRIM, FHML, Maastricht University, Maastricht, The Netherlands
| | - Ebtisam Alharbi
- College of Computer and Information Systems, Umm Al-Qura University, Mecca, Saudi Arabia
| | - Alasdair J G Gray
- Department of Computer Science, Heriot-Watt University, Edinburgh, EH14 4AS, Scotland, UK
| | - Melanie Courtot
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
- Ontario Institute for Cancer Research MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, M5G 0A3, Canada
| | - Philip Gribbon
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP) and Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590, Frankfurt, Germany
| | - Vassilios Ioannidis
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Dorothy S Reilly
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Basel, Switzerland
| | | | | | - Venkata Satagopam
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Carole Goble
- University of Manchester, Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Susanna-Assunta Sansone
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX13QG, Oxford, UK
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| |
Collapse
|
2
|
Ohmann C, David R, Abadia MC, Bietrix F, Boiten JW, Canham S, Chiusano ML, Dastrù W, Laroquette A, Longo D, Mayrhofer MT, Panagiotopoulou M, Richard A, Verde PE. Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences. Data Intelligence 2022. [DOI: 10.1162/dint_a_00126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
Sharing sensitive data is a specific challenge for research infrastructures in the field of life sciences. For that reason a toolbox has been developed, providing resources for researchers who wish to share and use sensitive data, to support the workflows for handling these kinds of digital objects. Common and community approved annotations are required to be compliant with FAIR principles (Findability, Accessibility, Interoperability, Reusability). The toolbox makes use of a tagging (categorisation) system, allowing consistent labelling and categorisation of digital objects, in terms relevant to data sharing tasks and activities. A pilot study was performed within the Horizon 2020 project EOSC-Life, in which 2 experts from 6 life sciences research infrastructures were recruited to independently assign tags to the same set of 10 to 25 resources related to sensitive data management and data sharing (in total 110). Summary statistics of agreement and observer variation per research infrastructure are provided. The pilot study has shown that experts were able to attribute tags but in most cases with a considerable observer variation between experts. In the context of CWFR (Canonical Workflow Frameworks for Research), this indicates the necessity for careful definition, evaluation and validation of parameters and processes related to workflow descriptions. The results from this pilot study were used to tackle this issue by revising the categorisation system and providing an updated version.
Collapse
Affiliation(s)
- Christian Ohmann
- European Clinical Research Infrastructure Network (ECRIN), Paris 7501 3, France
| | - Romain David
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), Brussels 1000, Belgium
| | - Mónica Cano Abadia
- Biobanking and Biomolecular Resources Research Infrastructure (BBMRI), Graz 8010, Austria
| | - Florence Bietrix
- European Infrastructure for Translational Medicine (EATRIS), Amsterdam 1081 HZ, The Netherlands
| | - Jan-Willem Boiten
- European Advanced Translational Research Infrastructure (EATRIS)/ Lygature, Utrecht 3521 AL, The Netherlands
| | - Steve Canham
- European Clinical Research Infrastructure Network (ECRIN), Paris 7501 3, France
| | - Maria Luisa Chiusano
- European Marine Biological Resource Centre (EMBRC)—Department of Agricultural Sciences, University Federico II of Naples via Università, Naples 80138, Italy
| | - Walter Dastrù
- Department of Molecular Biotechnology and Health Sciences, Molecular Imaging Center, University of Torino, Torino I-10125, Italy
| | - Arnaud Laroquette
- European Marine Biological Resource Centre (EMBRC), Paris 75252, France
| | - Dario Longo
- European Research Infrastructure for Biological and Biomedical Imaging (Euro-BioImaging), Torino 10126, Italy
| | | | | | - Audrey Richard
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), Brussels 1000, Belgium
| | - Pablo Emilio Verde
- Coordination Centre for Clinical Trials, Heinrich Heine University Düsseldorf Ringgold standard institution, Nordrhein-Westfalen 40225, Germany
| |
Collapse
|
3
|
Zhang C, Bijlard J, Staiger C, Scollen S, van Enckevort D, Hoogstrate Y, Senf A, Hiltemann S, Repo S, Pipping W, Bierkens M, Payralbe S, Stringer B, Heringa J, Stubbs A, Bonino Da Silva Santos LO, Belien J, Weistra W, Azevedo R, van Bochove K, Meijer G, Boiten JW, Rambla J, Fijneman R, Spalding JD, Abeln S. Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data. F1000Res 2017; 6. [PMID: 29123641 PMCID: PMC5657030 DOI: 10.12688/f1000research.12168.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/14/2017] [Indexed: 01/11/2023] Open
Abstract
The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
Collapse
Affiliation(s)
- Chao Zhang
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands
| | - Jochem Bijlard
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands.,The Hyve, Utrecht, 3511 MJ, Netherlands
| | | | | | - David van Enckevort
- Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, 9712 CP, Netherlands
| | - Youri Hoogstrate
- Department of Bioinformatics, Erasmus University Medical Center, Rotterdam, 3015 CE, Netherlands
| | | | - Saskia Hiltemann
- Department of Bioinformatics, Erasmus University Medical Center, Rotterdam, 3015 CE, Netherlands
| | | | | | | | | | - Bas Stringer
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands
| | - Jaap Heringa
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands
| | - Andrew Stubbs
- Department of Bioinformatics, Erasmus University Medical Center, Rotterdam, 3015 CE, Netherlands
| | | | - Jeroen Belien
- Department of Pathology, VU University Medical Center Amsterdam, Amsterdam, 1081 HV, Netherlands
| | | | | | | | - Gerrit Meijer
- Netherlands Cancer Institute, Amsterdam, 1066 CX, Netherlands
| | | | - Jordi Rambla
- Centre for Genomic Regulation (CRG), Barcelona, 08003, Spain
| | - Remond Fijneman
- Netherlands Cancer Institute, Amsterdam, 1066 CX, Netherlands
| | | | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands
| |
Collapse
|
4
|
Hoogstrate Y, Zhang C, Senf A, Bijlard J, Hiltemann S, van Enckevort D, Repo S, Heringa J, Jenster G, J A Fijneman R, Boiten JW, A Meijer G, Stubbs A, Rambla J, Spalding D, Abeln S. Integration of EGA secure data access into Galaxy. F1000Res 2017; 5. [PMID: 28232859 PMCID: PMC5302147 DOI: 10.12688/f1000research.10221.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/30/2016] [Indexed: 12/31/2022] Open
Abstract
High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure
http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed:
https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.
Collapse
Affiliation(s)
- Youri Hoogstrate
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | - Chao Zhang
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| | - Alexander Senf
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Saskia Hiltemann
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | | | | | - Jaap Heringa
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| | - Guido Jenster
- Department of Urology, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | | | | | - Gerrit A Meijer
- Diagnostic Oncology, Netherlands Cancer Institute, Amsterdam, Netherlands
| | - Andrew Stubbs
- Department of Bioinformatics, ErasmusMC Rotterdam, Rotterdam, Netherlands
| | - Jordi Rambla
- Centre for Genomic Regulation, Parc de Recerca Biomédica de Barcelona, Barcelona, Spain
| | - Dylan Spalding
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| |
Collapse
|
5
|
Bierkens M, van der Linden W, Weistra W, van Bochove K, Beliën JA, Fijneman RJ, Azevedo R, Boiten JW, Meijer GA. Abstract 3166: Querying, viewing and analyzing colorectal cancer translational research studies in tranSMART. Cancer Res 2016. [DOI: 10.1158/1538-7445.am2016-3166] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
All translational research projects basically share the same design, which is correlating variation in disease phenotype to variation in underlying biology. Typical questions to be addressed are: ‘Which (out of thousands) biomarkers predict good/bad outcome?’ and ‘Which (out of thousands) biomarkers predict whether a patient will benefit from a particular therapy?’. In line with this concept, many research teams have generated large amounts of experimental molecular data from patient samples, yet generally this information is inaccessible for examination due to local storage of both (meta)data and the data processing workflows used. Alternatively, data stored in central databases may only be available for exploration and interpretation by data specialists, provided that the processing workflow has been published and is available. Thus, if data is not recorded and easily retrievable, validation of obtained results (e.g. promising biomarkers) may be virtually impossible. Additionally, it will be difficult to query existing data sets to answer new questions, which may lead to experiments being unnecessarily repeated and biological materials being wasted. In the Netherlands, the Translational Research IT (TraIT) project initiated by the Center for Translational Molecular Medicine (CTMM, www.ctmm-trait.nl) aims to provide IT solutions to support translational research from start to end, including sustainable management and analysis of data. These data types involve the clinical, biomedical imaging, biobanking, (molecular) experimental data domains and their associated workflows. In addition to domain-specific solutions to manage these data types, the processed or ‘final’ data of these different domains will become available in the data-integration platform tranSMART for querying, visualization and analysis. To ensure sustainable data stewardship and provide easy access to existing data for the CTMM colorectal cancer project ‘Decrease in Colorectal Cancer Death (DeCoDe)’, data has been made available in tranSMART and more datasets are being added. The data encompasses both clinical and (molecular) experimental data from non-high-throughput molecular profiling (NHTMP) assays and array-based profiling techniques.
It is now possible with tranSMART to explore the respective DeCoDe datasets and perform various analyses (survival, Fisher-exact tests, ANOVA, aCGH tests etc.), without needing deep bioinformatics expertise. Through metadata tags it is possible to trace back raw or pre-processed data in other tools (e.g. clinical data in OpenClinica, array-data in GEO, or histological images of tissue microarrays). Data can be examined in more detail by domain experts using tools they are familiar with or other tools provided by CTMM-TraIT. Thus, existing rich data are made findable, accessible, interoperable and reusable (FAIR) and source data can be traced back for customized processing.
Citation Format: Mariska Bierkens, Wim van der Linden, Ward Weistra, Kees van Bochove, Jeroen A.M. Beliën, Remond J.A. Fijneman, Rita Azevedo, Jan-Willem Boiten, Gerrit A. Meijer. Querying, viewing and analyzing colorectal cancer translational research studies in tranSMART. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 3166.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Jan-Willem Boiten
- 6Center for Translational Molecular Medicine, Eindhoven, Netherlands
| | | |
Collapse
|
6
|
Cavelaars M, Rousseau J, Parlayan C, de Ridder S, Verburg A, Ross R, Visser G, Rotte A, Azevedo R, Boiten JW, Meijer GA, Belien JAM, Verheul H. OpenClinica. J Clin Bioinforma 2015. [PMCID: PMC4460614 DOI: 10.1186/2043-9113-5-s1-s2] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
7
|
|
8
|
Baede EJ, den Bekker E, Boiten JW, Cronin D, van Gammeren R, de Vlieg J. Integrated Project Views: Decision Support Platform for Drug Discovery Project Teams. J Chem Inf Model 2012; 52:1438-49. [DOI: 10.1021/ci200253g] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Eric J. Baede
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| | - Ernest den Bekker
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| | - Jan-Willem Boiten
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| | - Deborah Cronin
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| | - Rob van Gammeren
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| | - Jacob de Vlieg
- Discovery Informatics, Molecular
Design and Informatics Department, MSD, Molenstraat 110, 5342 CC Oss,
The Netherlands
| |
Collapse
|
9
|
Lusher SJ, McGuire R, Azevedo R, Boiten JW, van Schaik RC, de Vlieg J. A molecular informatics view on best practice in multi-parameter compound optimization. Drug Discov Today 2011; 16:555-68. [DOI: 10.1016/j.drudis.2011.05.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2010] [Revised: 02/25/2011] [Accepted: 05/06/2011] [Indexed: 01/30/2023]
|
10
|
|