1
|
Ohmann C, Khorchani T, Cracanel A, Brüning J, Verde PE. An open source statistical web application for validation and analysis of virtual cohorts. Sci Rep 2025; 15:15744. [PMID: 40328940 PMCID: PMC12056029 DOI: 10.1038/s41598-025-99720-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Accepted: 04/22/2025] [Indexed: 05/08/2025] Open
Abstract
The conventional approach to developing medical treatments and medical devices usually covers pre-clinical and in-vitro investigations, in-vivo animal studies and clinical trials with humans. In-silico trials and virtual cohorts present a promising avenue for addressing the challenges inherent in clinical research and improving its efficiency. Despite considerable advancements in the field of in-silico trials, several notable gaps and challenges still need to be addressed, one is the limited availability of open and user-friendly statistical tools to support the specific analysis of virtual cohorts and in-silico trials. In the EU-Horizon funded project SIMCor we have developed a web application, providing a R-statistical environment supporting the validation of virtual cohorts and the application of validated cohorts for in-silico trials. It provides a practical platform for validating cohorts and has implemented existing statistical techniques that can be applied to compare virtual cohorts with real datasets. It is fully open, generic and menu driven and provides user guidance and help ( https://github.com/ecrin-github/SIMCor , https://zenodo.org/records/14718597 ).The tool has been developed according to specified user requirements and has been extensively tested and validated. Important next steps are to gain more experience with the tool in other domains and research environments and to extend its functionality.
Collapse
Affiliation(s)
- Christian Ohmann
- European Clinical Research Infrastructures Network (ECRIN), Kaiserswerther, Strasse 70, 40477, Düsseldorf, Germany.
| | - Takoua Khorchani
- European Clinical Research Infrastructure Network (ECRIN), 30 Bd Saint-Jacques, 75014, Paris, France
| | - Alexandru Cracanel
- Automation and Information Technology, Transilvania University of Brasov, Mihai Viteazu nr. 5, 5000174, Brasov, Romania
| | - Jan Brüning
- Institut Für Kardiovaskuläre Computer-Assistierte Medizin, Charité - Universitätsmedizin Berlin, Augustenburger Pl. 1, 13353, Berlin, Germany
| | - Pablo Emilio Verde
- Coordination Centre for Clinical Trials, Heinrich Heine University Düsseldorf, Moorenstrasse 5, 40225, Düsseldorf, Nordrhein-Westfalen, Germany
| |
Collapse
|
2
|
Pasculli G, Virgolin M, Myles P, Vidovszky A, Fisher C, Biasin E, Mourby M, Pappalardo F, D'Amico S, Torchia M, Chebykin A, Carbone V, Emili L, Roeshammar D. Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues. CPT Pharmacometrics Syst Pharmacol 2025; 14:840-852. [PMID: 40193292 PMCID: PMC12072219 DOI: 10.1002/psp4.70021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Revised: 03/01/2025] [Accepted: 03/10/2025] [Indexed: 04/09/2025] Open
Abstract
With the recent and evolving regulatory frameworks regarding the usage of Artificial Intelligence (AI) in both drug and medical device development, the differentiation between data derived from observed ('true' or 'real') sources and artificial data obtained using process-driven and/or (data-driven) algorithmic processes is emerging as a critical consideration in clinical research and regulatory discourse. We conducted a critical literature review that revealed evidence of the current ambivalent usage of the term "synthetic" (along with derivative terms) to refer to "true/observed" data in the context of clinical trials and AI-generated data (or "artificial" data). This paper, stemming from a critical evaluation of different perspectives captured from the scientific literature and recent regulatory endeavors, seeks to elucidate this distinction, exploring their respective utilities, regulatory stances, and upcoming needs, as well as the potential for both data types in advancing medical science and therapeutic development.
Collapse
Affiliation(s)
| | - Marco Virgolin
- InSilicoTrials Technologies B.V.s‐Hertogenboschthe Netherlands
| | - Puja Myles
- Medicines and Healthcare products Regulatory AgencyLondonUK
| | | | | | | | - Miranda Mourby
- Centre for Health, Law, and Emerging Technologies (HeLEX), Faculty of LawUniversity of OxfordOxfordUK
| | | | - Saverio D'Amico
- Humanitas Clinical and Research Center‐IRCCSMilanItaly
- Train s.r.l.MilanItaly
| | | | | | | | - Luca Emili
- InSilicoTrials Technologies S.p.A.TriesteItaly
| | | |
Collapse
|
3
|
Lyte JM, Seyoum MM, Ayala D, Kers JG, Caputi V, Johnson T, Zhang L, Rehberger J, Zhang G, Dridi S, Hale B, De Oliveira JE, Grum D, Smith AH, Kogut M, Ricke SC, Ballou A, Potter B, Proszkowiec-Weglarz M. Do we need a standardized 16S rRNA gene amplicon sequencing analysis protocol for poultry microbiota research? Poult Sci 2025; 104:105242. [PMID: 40334389 DOI: 10.1016/j.psj.2025.105242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2025] [Revised: 04/30/2025] [Accepted: 04/30/2025] [Indexed: 05/09/2025] Open
Abstract
Bacteria are the major component of poultry gastrointestinal tract (GIT) microbiota and play an important role in host health, nutrition, physiology regulation, intestinal development, and growth. Bacterial community profiling based on the 16S ribosomal RNA (rRNA) gene amplicon sequencing approach has become the most popular method to determine the taxonomic composition and diversity of the poultry microbiota. The 16S rRNA gene profiling involves numerous steps, including sample collection and storage, DNA isolation, 16S rRNA gene primer selection, Polymerase Chain Reaction (PCR), library preparation, sequencing, raw sequencing reads processing, taxonomic classification, α- and β-diversity calculations, and statistical analysis. However, there is currently no standardized protocol for 16S rRNA gene analysis profiling and data deposition for poultry microbiota studies. Variations in DNA storage and isolation, primer design, and library preparation are known to introduce biases, affecting community structure and microbial population analysis leading to over- or under-representation of individual bacteria within communities. Additionally, different sequencing platforms, bioinformatics pipeline, and taxonomic database selection can affect classification and determination of the microbial taxa. Moreover, detailed experimental design and DNA processing and sequencing methods are often inadequately reported in poultry 16S rRNA gene sequencing studies. Consequently, poultry microbiota results are often difficult to reproduce and compare across studies. This manuscript reviews current practices in profiling poultry microbiota using 16S rRNA gene amplicon sequencing and proposes the development of guidelines for protocol for 16S rRNA gene sequencing that spans from sample collection through data deposition to achieve more reliable data comparisons across studies and allow for comparisons and/or interpretations of poultry studies conducted worldwide.
Collapse
Affiliation(s)
- Joshua M Lyte
- United States Department of Agriculture, Agricultural Research Service, Southeast Area, Poultry Production and Product Safety Research, Fayetteville 72701, AR, United States
| | - Mitiku M Seyoum
- Center of Excellence for Poultry Science, University of Arkansas, Fayetteville 72701, AR, United States
| | - Diana Ayala
- Purina Animal Nutrition Center, Land O'Lakes, Gray Summit 63039, MO, United States
| | - Jannigje G Kers
- Faculty of Veterinary Medicine, Utrecht University, and Laboratory of Microbiology, Wageningen University & Research, The Netherlands
| | - Valentina Caputi
- United States Department of Agriculture, Agricultural Research Service, Southeast Area, Poultry Production and Product Safety Research, Fayetteville 72701, AR, United States
| | - Timothy Johnson
- University of Minnesota, Saint Paul 55108, MN, United States
| | - Li Zhang
- Mississippi State University, Mississippi State 39762, MS, United States
| | - Joshua Rehberger
- Arm and Hammer Animal Nutrition, Waukesha 53186, WI, United States
| | - Guolong Zhang
- Department of Animal and Food Sciences, Oklahoma State University, Stillwater 74078, OK, United States
| | - Sami Dridi
- Center of Excellence for Poultry Science, University of Arkansas, Fayetteville 72701, AR, United States
| | - Brett Hale
- AgriGro, Doniphan 6393, MO, United States
| | | | - Daniel Grum
- Purina Animal Nutrition Center, Land O'Lakes, Gray Summit 63039, MO, United States
| | - Alexandra H Smith
- Mississippi State University, Mississippi State 39762, MS, United States
| | - Michael Kogut
- United States Department of Agriculture, Agricultural Research Service, Southern Plains Agricultural Research Center, College Station 77845, TX, United States
| | - Steven C Ricke
- Department of Animal and Dairy Sciences, University of Wisconsin, Madison, 53706, WI, United States
| | - Anne Ballou
- Iluma Alliance, Durham 27703, NC, United States
| | - Bill Potter
- Center of Excellence for Poultry Science, University of Arkansas, Fayetteville 72701, AR, United States
| | - Monika Proszkowiec-Weglarz
- United States Department of Agriculture, Agricultural Research Service, Northeast Area, Beltsville Agriculture Research Center, Animal Biosciences and Biotechnology Laboratory, Beltsville 20705, MD, United States.
| |
Collapse
|
4
|
Clarke DJ, Evangelista JE, Xie Z, Marino GB, Byrd AI, Maurya MR, Srinivasan S, Yu K, Petrosyan V, Roth ME, Milinkov M, King CH, Vora JK, Keeney J, Nemarich C, Khan W, Lachmann A, Ahmed N, Agris A, Pan J, Ramachandran S, Fahy E, Esquivel E, Mihajlovic A, Jevtic B, Milinovic V, Kim S, McNeely P, Wang T, Wenger E, Brown MA, Sickler A, Zhu Y, Jenkins SL, Blood PD, Taylor DM, Resnick AC, Mazumder R, Milosavljevic A, Subramaniam S, Ma’ayan A. Playbook workflow builder: Interactive construction of bioinformatics workflows. PLoS Comput Biol 2025; 21:e1012901. [PMID: 40179105 PMCID: PMC11967941 DOI: 10.1371/journal.pcbi.1012901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Accepted: 02/24/2025] [Indexed: 04/05/2025] Open
Abstract
The Playbook Workflow Builder (PWB) is a web-based platform to dynamically construct and execute bioinformatics workflows by utilizing a growing network of input datasets, semantically annotated API endpoints, and data visualization tools contributed by an ecosystem of collaborators. Via a user-friendly user interface, workflows can be constructed from contributed building-blocks without technical expertise. The output of each step of the workflow is added into reports containing textual descriptions, figures, tables, and references. To construct workflows, users can click on cards that represent each step in a workflow, or construct workflows via a chat interface that is assisted by a large language model (LLM). Completed workflows are compatible with Common Workflow Language (CWL) and can be published as research publications, slideshows, and posters. To demonstrate how the PWB generates meaningful hypotheses that draw knowledge from across multiple resources, we present several use cases. For example, one of these use cases prioritizes drug targets for individual cancer patients using data from the NIH Common Fund programs GTEx, LINCS, Metabolomics, GlyGen, and ExRNA. The workflows created with PWB can be repurposed to tackle similar use cases using different inputs. The PWB platform is available from: https://playbook-workflow-builder.cloud/.
Collapse
Affiliation(s)
- Daniel J.B. Clarke
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - John Erol Evangelista
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Zhuorui Xie
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Giacomo B. Marino
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Anna I. Byrd
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Mano R. Maurya
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Sumana Srinivasan
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Keyang Yu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Varduhi Petrosyan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Matthew E. Roth
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | | | - Charles Hadley King
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Jeet Kiran Vora
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Jonathon Keeney
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Christopher Nemarich
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - William Khan
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Alexander Lachmann
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Nasheath Ahmed
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Alexandra Agris
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Juncheng Pan
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Srinivasan Ramachandran
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Eoin Fahy
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Emmanuel Esquivel
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | | | - Bosko Jevtic
- Persida Inc., Brooklyn, New York, United States of America
| | - Vuk Milinovic
- Persida Inc., Brooklyn, New York, United States of America
| | - Sean Kim
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Patrick McNeely
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Tianyi Wang
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Eric Wenger
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Miguel A. Brown
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Alexander Sickler
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Yuankun Zhu
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Sherry L. Jenkins
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Philip D. Blood
- Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Deanne M. Taylor
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Adam C. Resnick
- Department of Biomedical and Health Informatics; Department of Pediatrics, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America
- Center for Data Driven Discovery in Biomedicine, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, The George Washington School of Medicine and Health Sciences, Washington, DC, United States of America
| | - Aleksandar Milosavljevic
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Shankar Subramaniam
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Avi Ma’ayan
- Department of Pharmacological Sciences, Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| |
Collapse
|
5
|
Stroggilos R, Tserga A, Zoidakis J, Vlahou A, Makridakis M. Tissue proteomics repositories for data reanalysis. MASS SPECTROMETRY REVIEWS 2024; 43:1270-1284. [PMID: 37534389 DOI: 10.1002/mas.21860] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/17/2023] [Accepted: 07/18/2023] [Indexed: 08/04/2023]
Abstract
We are approaching the third decade since the establishment of the very first proteomics repositories back in the mid-'00s. New experimental approaches and technologies continuously enrich the field while producing vast amounts of mass spectrometry data. Together with initiatives to establish standard terminology and file formats, proteomics is rapidly transforming into a mature component of systems biology. Here we describe the ProteomeXchange consortium repositories. We specifically search, collect and evaluate public human tissue datasets (categorized as "complete" by the repository) submitted in 2015-2022, to both map the existing information and assess the data set reusability. Human tissue data are variably represented in the repositories reviewed, ranging between 10% and 25% of the total data submitted, with cancers being the most represented, followed by neuronal and cardiovascular diseases. About half of the retrieved data sets were found to lack annotations or metadata necessary to directly replicate the analysis. This poses a rough challenge to data reusability and highlights the need to increase awareness of the mage-tab file format for metadata in the community. Overall, proteomics repositories have evolved greatly over the past 7 years, as they have grown in size and become equipped with various powerful applications and tools that enable data searching and analytical tasks. However, to make the most of this potential, priority must be given to finding ways to secure detailed metadata for each submission, which is likely the next major milestone for proteomics repositories.
Collapse
Affiliation(s)
- Rafael Stroggilos
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Aggeliki Tserga
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Jerome Zoidakis
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Antonia Vlahou
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Manousos Makridakis
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| |
Collapse
|
6
|
Vahdati S, Khosravi B, Mahmoudi E, Zhang K, Rouzrokh P, Faghani S, Moassefi M, Tahmasebi A, Andriole KP, Chang P, Farahani K, Flores MG, Folio L, Houshmand S, Giger ML, Gichoya JW, Erickson BJ. A Guideline for Open-Source Tools to Make Medical Imaging Data Ready for Artificial Intelligence Applications: A Society of Imaging Informatics in Medicine (SIIM) Survey. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:2015-2024. [PMID: 38558368 PMCID: PMC11522208 DOI: 10.1007/s10278-024-01083-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 02/29/2024] [Accepted: 03/08/2024] [Indexed: 04/04/2024]
Abstract
In recent years, the role of Artificial Intelligence (AI) in medical imaging has become increasingly prominent, with the majority of AI applications approved by the FDA being in imaging and radiology in 2023. The surge in AI model development to tackle clinical challenges underscores the necessity for preparing high-quality medical imaging data. Proper data preparation is crucial as it fosters the creation of standardized and reproducible AI models while minimizing biases. Data curation transforms raw data into a valuable, organized, and dependable resource and is a fundamental process to the success of machine learning and analytical projects. Considering the plethora of available tools for data curation in different stages, it is crucial to stay informed about the most relevant tools within specific research areas. In the current work, we propose a descriptive outline for different steps of data curation while we furnish compilations of tools collected from a survey applied among members of the Society of Imaging Informatics (SIIM) for each of these stages. This collection has the potential to enhance the decision-making process for researchers as they select the most appropriate tool for their specific tasks.
Collapse
Affiliation(s)
- Sanaz Vahdati
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Bardia Khosravi
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Elham Mahmoudi
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Kuan Zhang
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Pouria Rouzrokh
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Shahriar Faghani
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Mana Moassefi
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA
| | - Aylin Tahmasebi
- Department of Radiology, Thomas Jefferson University, Philadelphia, PA, USA
| | - Katherine P Andriole
- Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Peter Chang
- Department of Radiological Sciences, Irvine Medical Center, University of California, Orange, CA, USA
| | - Keyvan Farahani
- Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA
| | | | - Les Folio
- Diagnostic Imaging & Interventional Radiology Moffitt Cancer Center, Tampa, FL, USA
| | - Sina Houshmand
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA, USA
| | - Maryellen L Giger
- Department of Radiology, The University of Chicago, Chicago, IL, USA
| | - Judy W Gichoya
- Department of Radiology, Emory University School of Medicine, Atlanta, GA, USA
| | - Bradley J Erickson
- Artificial Intelligence Laboratory, Department of Radiology, Mayo Clinic, 200 1st Street, SW, Rochester, MN, 55905, USA.
| |
Collapse
|
7
|
Baykal PI, Łabaj PP, Markowetz F, Schriml LM, Stekhoven DJ, Mangul S, Beerenwinkel N. Genomic reproducibility in the bioinformatics era. Genome Biol 2024; 25:213. [PMID: 39123217 PMCID: PMC11312195 DOI: 10.1186/s13059-024-03343-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/23/2024] [Indexed: 08/12/2024] Open
Abstract
In biomedical research, validating a scientific discovery hinges on the reproducibility of its experimental results. However, in genomics, the definition and implementation of reproducibility remain imprecise. We argue that genomic reproducibility, defined as the ability of bioinformatics tools to maintain consistent results across technical replicates, is essential for advancing scientific knowledge and medical applications. Initially, we examine different interpretations of reproducibility in genomics to clarify terms. Subsequently, we discuss the impact of bioinformatics tools on genomic reproducibility and explore methods for evaluating these tools regarding their effectiveness in ensuring genomic reproducibility. Finally, we recommend best practices to improve genomic reproducibility.
Collapse
Affiliation(s)
- Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Paweł Piotr Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, 30-387, Gronostajowa 7A, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Muthgasse 18, 1190, Vienna, Austria
| | - Florian Markowetz
- Cancer Research UK Cambridge Research Institute, Cambridge, CB2 0RE, UK
- Department of Oncology, University of Cambridge, Cambridge, CB2 2XZ, UK
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, HSFIII, 670 W. Baltimore St, Baltimore, MD, 21201, USA
| | - Daniel J Stekhoven
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, 8952, Zurich, Switzerland
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA, 90033, USA.
- Department of Quantitative and Computational Biology, University of Southern California Dornsife College of Letters, Arts, and Sciences, Los Angeles, CA, 90089, USA.
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland.
| |
Collapse
|
8
|
Conrad TOF, Ferrer E, Mietchen D, Pusch L, Stegmüller J, Schubotz M. Making Mathematical Research Data FAIR: Pathways to Improved Data Sharing. Sci Data 2024; 11:676. [PMID: 38909043 PMCID: PMC11193822 DOI: 10.1038/s41597-024-03480-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 06/05/2024] [Indexed: 06/24/2024] Open
Abstract
The sharing and citation of research data is becoming increasingly recognized as an essential building block in scientific research across various fields and disciplines. Sharing research data allows other researchers to reproduce results, replicate findings, and build on them. Ultimately, this will foster faster cycles in knowledge generation. Some disciplines, such as astronomy or bioinformatics, already have a long history of sharing data; many others do not. The current landscape of available systems for sharing research data is diverse. In this article, we conduct a detailed analysis of existing web-based systems, specifically focusing on mathematical research data.
Collapse
Affiliation(s)
| | | | - Daniel Mietchen
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Berlin, Germany
- Ronin Institute for Independent Scholarship, Montclair, USA
| | | | - Johannes Stegmüller
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Berlin, Germany
| | - Moritz Schubotz
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Berlin, Germany.
| |
Collapse
|
9
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
10
|
Seep L, Grein S, Splichalova I, Ran D, Mikhael M, Hildebrand S, Lauterbach M, Hiller K, Ribeiro DJS, Sieckmann K, Kardinal R, Huang H, Yu J, Kallabis S, Behrens J, Till A, Peeva V, Strohmeyer A, Bruder J, Blum T, Soriano-Arroquia A, Tischer D, Kuellmer K, Li Y, Beyer M, Gellner AK, Fromme T, Wackerhage H, Klingenspor M, Fenske WK, Scheja L, Meissner F, Schlitzer A, Mass E, Wachten D, Latz E, Pfeifer A, Hasenauer J. From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists. Sci Data 2024; 11:524. [PMID: 38778016 PMCID: PMC11111677 DOI: 10.1038/s41597-024-03349-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 05/08/2024] [Indexed: 05/25/2024] Open
Abstract
Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.
Collapse
Affiliation(s)
- Lea Seep
- Computational Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Stephan Grein
- Computational Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Iva Splichalova
- Developmental Biology of the Immune System, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Danli Ran
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
| | - Mickel Mikhael
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
| | - Staffan Hildebrand
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
| | - Mario Lauterbach
- Department of Bioinformatics and Biochemistry, Technical University Braunschweig, Braunschweig, Germany
| | - Karsten Hiller
- Department of Bioinformatics and Biochemistry, Technical University Braunschweig, Braunschweig, Germany
| | | | - Katharina Sieckmann
- Institute of Innate Immunity, University Hospital Bonn, University of Bonn, Bonn, Germany
| | - Ronja Kardinal
- Institute of Innate Immunity, University Hospital Bonn, University of Bonn, Bonn, Germany
| | - Hao Huang
- Developmental Biology of the Immune System, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Jiangyan Yu
- Computational Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
- Quantitative Systems Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Sebastian Kallabis
- Systems Immunology and Proteomics, Institute of Innate Immunity, Medical Faculty, University of Bonn, Bonn, Germany
| | - Janina Behrens
- Department of Biochemistry and Molecular Cell Biology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Andreas Till
- Department of Internal Medicine I, Division of Endocrinology, Diabetes and Metabolism, University Medical Center Bonn, Bonn, Germany
| | - Viktoriya Peeva
- Department of Internal Medicine I, Division of Endocrinology, Diabetes and Metabolism, University Medical Center Bonn, Bonn, Germany
| | - Akim Strohmeyer
- Chair of Molecular Nutritional Medicine, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Johanna Bruder
- Chair of Molecular Nutritional Medicine, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Tobias Blum
- Immunology and Environment, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Ana Soriano-Arroquia
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
| | - Dominik Tischer
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
| | - Katharina Kuellmer
- Chair of Molecular Nutritional Medicine, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Yuanfang Li
- Immunogenomics & Neurodegeneration, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany
| | - Marc Beyer
- Immunogenomics & Neurodegeneration, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany
- PRECISE, Platform for Single Cell Genomics and Epigenomics at the German Center for Neurodegenerative Diseases and the University of Bonn, Bonn, Germany
| | - Anne-Kathrin Gellner
- Department of Psychiatry and Psychotherapy, University Hospital Bonn, Bonn, Germany
- Institute of Physiology II, Medical Faculty, University of Bonn, Bonn, Germany
| | - Tobias Fromme
- Chair of Molecular Nutritional Medicine, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Henning Wackerhage
- School for Medicine and Health, Faculty of Sport and Health Sciences, Technical University of Munich, Munich, Germany
| | - Martin Klingenspor
- Chair of Molecular Nutritional Medicine, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
- EKFZ-Else Kröner-Fresenius Center for Nutritional Medicine, Technical University of Munich, Freising, Germany
- ZIEL Institute for Food & Health, Technical University of Munich, Freising, Germany
| | - Wiebke K Fenske
- Department of Internal Medicine I, Division of Endocrinology, Diabetes and Metabolism, University Medical Center Bonn, Bonn, Germany
- Department of Internal Medicine I - Endocrinology, Diabetology and Metabolism, Gastroenterology and Hepatology, University Hospital Bergmannsheil, Bochum, Germany
| | - Ludger Scheja
- Department of Biochemistry and Molecular Cell Biology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Felix Meissner
- Systems Immunology and Proteomics, Institute of Innate Immunity, Medical Faculty, University of Bonn, Bonn, Germany
- Experimental Systems Immunology, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Andreas Schlitzer
- Quantitative Systems Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Elvira Mass
- Developmental Biology of the Immune System, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany
| | - Dagmar Wachten
- Institute of Innate Immunity, University Hospital Bonn, University of Bonn, Bonn, Germany
| | - Eicke Latz
- Institute of Innate Immunity, University Hospital Bonn, University of Bonn, Bonn, Germany
| | - Alexander Pfeifer
- Institute of Pharmacology and Toxicology, University Hospital, University of Bonn, Bonn, Germany
- PharmaCenter Bonn, University of Bonn, Bonn, Germany
| | - Jan Hasenauer
- Computational Biology, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany.
- Helmholtz Center Munich, German Research Center for Environmental Health, Computational Health Center, Munich, Germany.
| |
Collapse
|
11
|
LeRoy NJ, Khoroshevskyi O, O’Brien A, Stepień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.15.551388. [PMID: 37645717 PMCID: PMC10462087 DOI: 10.1101/2023.08.15.551388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself. Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data. Availability https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Aaron O’Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Rafał Stepień
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, 22908, Charlottesville VA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Child Health Research Center, School of Medicine, University of Virginia, 22908, Charlottesville VA
| |
Collapse
|
12
|
Kumar B, Lorusso E, Fosso B, Pesole G. A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions. Front Microbiol 2024; 15:1343572. [PMID: 38419630 PMCID: PMC10900530 DOI: 10.3389/fmicb.2024.1343572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/29/2024] [Indexed: 03/02/2024] Open
Abstract
Metagenomics, Metabolomics, and Metaproteomics have significantly advanced our knowledge of microbial communities by providing culture-independent insights into their composition and functional potential. However, a critical challenge in this field is the lack of standard and comprehensive metadata associated with raw data, hindering the ability to perform robust data stratifications and consider confounding factors. In this comprehensive review, we categorize publicly available microbiome data into five types: shotgun sequencing, amplicon sequencing, metatranscriptomic, metabolomic, and metaproteomic data. We explore the importance of metadata for data reuse and address the challenges in collecting standardized metadata. We also, assess the limitations in metadata collection of existing public repositories collecting metagenomic data. This review emphasizes the vital role of metadata in interpreting and comparing datasets and highlights the need for standardized metadata protocols to fully leverage metagenomic data's potential. Furthermore, we explore future directions of implementation of Machine Learning (ML) in metadata retrieval, offering promising avenues for a deeper understanding of microbial communities and their ecological roles. Leveraging these tools will enhance our insights into microbial functional capabilities and ecological dynamics in diverse ecosystems. Finally, we emphasize the crucial metadata role in ML models development.
Collapse
Affiliation(s)
- Bablu Kumar
- Università degli Studi di Milano, Milan, Italy
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Erika Lorusso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| | - Bruno Fosso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Graziano Pesole
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| |
Collapse
|
13
|
Samuel S, Mietchen D. Computational reproducibility of Jupyter notebooks from biomedical publications. Gigascience 2024; 13:giad113. [PMID: 38206590 PMCID: PMC10783158 DOI: 10.1093/gigascience/giad113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 08/09/2023] [Accepted: 12/08/2023] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. APPROACH We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion. RESULTS Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. CONCLUSIONS We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.
Collapse
Affiliation(s)
- Sheeba Samuel
- Heinz-Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Jena 07743, Germany
- Michael Stifel Center Jena, Jena 07743, Germany
| | - Daniel Mietchen
- Ronin Institute, Montclair 07043-2314, NJ, United States
- Institute for Globally Distributed Open Research and Education (IGDORE)
- FIZ Karlsruhe—Leibniz Institute for Information Infrastructure, Berlin 76344, Germany
| |
Collapse
|
14
|
LeRoy NJ, Khoroshevskyi O, O’Brien A, Stępień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. Gigascience 2024; 13:giae033. [PMID: 38991851 PMCID: PMC11238423 DOI: 10.1093/gigascience/giae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 02/07/2024] [Accepted: 05/21/2024] [Indexed: 07/13/2024] Open
Abstract
BACKGROUND As biological data increase, we need additional infrastructure to share them and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important and in some ways has a wider scope than sharing data themselves. RESULTS Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural-language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural-language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data or to share new data. AVAILABILITY https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Aaron O’Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Rafał Stępień
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
15
|
Cerk K, Ugalde‐Salas P, Nedjad CG, Lecomte M, Muller C, Sherman DJ, Hildebrand F, Labarthe S, Frioux C. Community-scale models of microbiomes: Articulating metabolic modelling and metagenome sequencing. Microb Biotechnol 2024; 17:e14396. [PMID: 38243750 PMCID: PMC10832553 DOI: 10.1111/1751-7915.14396] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 11/27/2023] [Accepted: 12/20/2023] [Indexed: 01/21/2024] Open
Abstract
Building models is essential for understanding the functions and dynamics of microbial communities. Metabolic models built on genome-scale metabolic network reconstructions (GENREs) are especially relevant as a means to decipher the complex interactions occurring among species. Model reconstruction increasingly relies on metagenomics, which permits direct characterisation of naturally occurring communities that may contain organisms that cannot be isolated or cultured. In this review, we provide an overview of the field of metabolic modelling and its increasing reliance on and synergy with metagenomics and bioinformatics. We survey the means of assigning functions and reconstructing metabolic networks from (meta-)genomes, and present the variety and mathematical fundamentals of metabolic models that foster the understanding of microbial dynamics. We emphasise the characterisation of interactions and the scaling of model construction to large communities, two important bottlenecks in the applicability of these models. We give an overview of the current state of the art in metagenome sequencing and bioinformatics analysis, focusing on the reconstruction of genomes in microbial communities. Metagenomics benefits tremendously from third-generation sequencing, and we discuss the opportunities of long-read sequencing, strain-level characterisation and eukaryotic metagenomics. We aim at providing algorithmic and mathematical support, together with tool and application resources, that permit bridging the gap between metagenomics and metabolic modelling.
Collapse
Affiliation(s)
- Klara Cerk
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | | | - Chabname Ghassemi Nedjad
- Inria, University of Bordeaux, INRAETalenceFrance
- University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800TalenceFrance
| | - Maxime Lecomte
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE STLO¸University of RennesRennesFrance
| | | | | | - Falk Hildebrand
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | - Simon Labarthe
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE, University of Bordeaux, BIOGECO, UMR 1202CestasFrance
| | | |
Collapse
|
16
|
Mackenzie A, Lewis E, Loveland J. Successes and challenges in extracting information from DICOM image databases for audit and research. Br J Radiol 2023; 96:20230104. [PMID: 37698251 PMCID: PMC10607388 DOI: 10.1259/bjr.20230104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 05/05/2023] [Accepted: 05/11/2023] [Indexed: 09/13/2023] Open
Abstract
In radiography, much valuable associated data (metadata) is generated during image acquisition. The current setup of picture archive and communication systems (PACS) can make extraction of this metadata difficult, especially as it is typically stored with the image. The aim of this work is to examine the current challenges in extracting image metadata and to discuss the potential benefits of using this rich information. This work focuses on breast screening, though the conclusions are applicable to other modalities.The data stored in PACS contain information, currently underutilised, and is of great benefit for auditing and improving imaging and radiographic practice. From the literature, we present examples of the potential clinical benefit such as audits of dose, and radiographic practice, as well as more advanced research highlighting the effects of radiographic practice, e.g. cancer detection rates affected by imaging technology.This review considers the challenges in extracting data, namely,• The search tools for data on most PACS are inadequate being both time-consuming and limited in elements that can be searched.• Security and information governance considerations• Anonymisation of data if required• Data curationThe review describes some solutions that have been successfully implemented.• Retrospective extraction: direct query on PACS• Extracting data prospectively• Use of structured reports• Use of trusted research environmentsUltimately, the data access process will be made easier by inclusion during PACS procurement. Auditing data from PACS can be used to improve quality of imaging and workflow, all of which will be a clinical benefit to patients.
Collapse
Affiliation(s)
| | | | - John Loveland
- NCCPM, Royal Surrey NHS Foundation Trust, Guildford, United Kingdom
| |
Collapse
|
17
|
Yang J, Liu Y, Shang J, Chen Q, Chen Q, Ren L, Zhang N, Yu Y, Li Z, Song Y, Yang S, Scherer A, Tong W, Hong H, Xiao W, Shi L, Zheng Y. The Quartet Data Portal: integration of community-wide resources for multiomics quality control. Genome Biol 2023; 24:245. [PMID: 37884999 PMCID: PMC10601216 DOI: 10.1186/s13059-023-03091-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 10/17/2023] [Indexed: 10/28/2023] Open
Abstract
The Quartet Data Portal facilitates community access to well-characterized reference materials, reference datasets, and related resources established based on a family of four individuals with identical twins from the Quartet Project. Users can request DNA, RNA, protein, and metabolite reference materials, as well as datasets generated across omics, platforms, labs, protocols, and batches. Reproducible analysis tools allow for objective performance assessment of user-submitted data, while interactive visualization tools support rapid exploration of reference datasets. A closed-loop "distribution-collection-evaluation-integration" workflow enables updates and integration of community-contributed multiomics data. Ultimately, this portal helps promote the advancement of reference datasets and multiomics quality control.
Collapse
Affiliation(s)
- Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zhihui Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yueqiang Song
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Shengpeng Yang
- Intelligent Storage, Alibaba Cloud, Alibaba Group, Hangzhou, Zhejiang, China
| | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Wenming Xiao
- Office of Oncological Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
18
|
Xue B, Khoroshevskyi O, Gomez RA, Sheffield NC. Opportunities and challenges in sharing and reusing genomic interval data. Front Genet 2023; 14:1155809. [PMID: 37020996 PMCID: PMC10067617 DOI: 10.3389/fgene.2023.1155809] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 03/07/2023] [Indexed: 03/22/2023] Open
Affiliation(s)
- Bingjie Xue
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - R. Ariel Gomez
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- *Correspondence: Nathan C. Sheffield,
| |
Collapse
|
19
|
Pokutnaya D, Childers B, Arcury-Quandt AE, Hochheiser H, Van Panhuis WG. An implementation framework to improve the transparency and reproducibility of computational models of infectious diseases. PLoS Comput Biol 2023; 19:e1010856. [PMID: 36928042 PMCID: PMC10019712 DOI: 10.1371/journal.pcbi.1010856] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023] Open
Abstract
Computational models of infectious diseases have become valuable tools for research and the public health response against epidemic threats. The reproducibility of computational models has been limited, undermining the scientific process and possibly trust in modeling results and related response strategies, such as vaccination. We translated published reproducibility guidelines from a wide range of scientific disciplines into an implementation framework for improving reproducibility of infectious disease computational models. The framework comprises 22 elements that should be described, grouped into 6 categories: computational environment, analytical software, model description, model implementation, data, and experimental protocol. The framework can be used by scientific communities to develop actionable tools for sharing computational models in a reproducible way.
Collapse
Affiliation(s)
- Darya Pokutnaya
- University of Pittsburgh, Department of Epidemiology, Pittsburgh, Pennsylvania, United States of America
| | - Bruce Childers
- University of Pittsburgh, Department of Computer Science, Pittsburgh, Pennsylvania, United States of America
| | - Alice E. Arcury-Quandt
- University of Pittsburgh, Department of Epidemiology, Pittsburgh, Pennsylvania, United States of America
| | - Harry Hochheiser
- University of Pittsburgh, Department of Biomedical Informatics and Intelligent Systems Program, Pittsburgh, Pennsylvania, United States of America
| | - Willem G. Van Panhuis
- University of Pittsburgh, Department of Epidemiology, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
20
|
Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, Rasmussen LV, Savidge TC, Starren J, Wu Q, Xin J, Yeaman MR, Zhou X, Su AI, Wu C, Brown L, Shabman RS, Hughes LD. Developing a standardized but extendable framework to increase the findability of infectious disease datasets. Sci Data 2023; 10:99. [PMID: 36823157 PMCID: PMC9950378 DOI: 10.1038/s41597-023-01968-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/13/2023] [Indexed: 02/25/2023] Open
Abstract
Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt open science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| | - Marco A Alvarado Cano
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - José Bento
- Department of Computer Science, Boston College, 245 Beacon St, Chestnut Hill, MA, 02467, USA
| | - Candice Czech
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Mengjia Kang
- Division of Pulmonary and Critical Care, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Lars Pache
- Infectious and Inflammatory Disease Center, Immunity and Pathogenesis Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Tor C Savidge
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Qinglong Wu
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jiwen Xin
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Michael R Yeaman
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Divisions of Molecular Medicine and Infectious Diseases, Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
- Lundquist Institute for Infection & Immunity at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Liliana Brown
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Reed S Shabman
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Laura D Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| |
Collapse
|
21
|
Duong-Trung N, Born S, Kim JW, Schermeyer MT, Paulick K, Borisyak M, Cruz-Bournazou MN, Werner T, Scholz R, Schmidt-Thieme L, Neubauer P, Martinez E. When Bioprocess Engineering Meets Machine Learning: A Survey from the Perspective of Automated Bioprocess Development. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
22
|
Lubbock ALR, Lopez CF. Microbench: automated metadata management for systems biology benchmarking and reproducibility in Python. Bioinformatics 2022; 38:4823-4825. [PMID: 36000837 PMCID: PMC9563693 DOI: 10.1093/bioinformatics/btac580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 02/21/2022] [Accepted: 08/23/2022] [Indexed: 11/21/2022] Open
Abstract
MOTIVATION Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments. This can introduce differences in performance and reproducibility. Capturing metadata (e.g. package versions, GPU model) currently requires repetitious code and is difficult to store centrally for analysis. Even where virtual environments and containers are used, updates over time mean that versioning metadata should still be captured within analysis pipelines to guarantee reproducibility. RESULTS Microbench is a simple and extensible Python package to automate metadata capture to a file or Redis database. Captured metadata can include execution time, software package versions, environment variables, hardware information, Python version and more, with plugins. We present three case studies demonstrating Microbench usage to benchmark code execution and examine environment metadata for reproducibility purposes. AVAILABILITY AND IMPLEMENTATION Install from the Python Package Index using pip install microbench. Source code is available from https://github.com/alubbock/microbench. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexander L R Lubbock
- Department of Biochemistry, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, TN 37232, USA
| | - Carlos F Lopez
- Department of Biochemistry, Vanderbilt University, Nashville, TN 37232, USA
- Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, TN 37232, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
23
|
Lortie CJ, Vargas Poulsen C, Brun J, Kui L. Tabular strategies for metadata in ecology, evolution, and the environmental sciences. Ecol Evol 2022; 12:e9245. [PMID: 36035265 PMCID: PMC9405493 DOI: 10.1002/ece3.9245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 08/04/2022] [Indexed: 11/07/2022] Open
Abstract
Data support knowledge development and theory advances in ecology and evolution. We are increasingly reusing data within our teams and projects and through the global, openly archived datasets of others. Metadata can be challenging to write and interpret, but it is always crucial for reuse. The value metadata cannot be overstated-even as a relatively independent research object because it describes the work that has been done in a structured format. We advance a new perspective and classify methods for metadata curation and development with tables. Tables with templates can be effectively used to capture all components of an experiment or project in a single, easy-to-read file familiar to most scientists. If coupled with the R programming language, metadata from tables can then be rapidly and reproducibly converted to publication formats including extensible markup language files suitable for data repositories. Tables can also be used to summarize existing metadata and store metadata across many datasets. A case study is provided and the added benefits of tables for metadata, a priori, are developed to ensure a more streamlined publishing process for many data repositories used in ecology, evolution, and the environmental sciences. In ecology and evolution, researchers are often highly tabular thinkers from experimental data collection in the lab and/or field, and representations of metadata as a table will provide novel research and reuse insights.
Collapse
Affiliation(s)
- C. J. Lortie
- National Center for Ecological Analysis and Synthesis, UCSBSanta BarbaraCaliforniaUSA
- Department of BiologyYork UniversityTorontoOntarioCanada
| | | | - Julien Brun
- National Center for Ecological Analysis and Synthesis, UCSBSanta BarbaraCaliforniaUSA
| | - Li Kui
- Marine Science Institute, UCSBSanta BarbaraCaliforniaUSA
| |
Collapse
|
24
|
Cui X, Lu J, Han Y. A Novel Unified Data Modeling Method for Equipment Lifecycle Integrated Logistics Support. SENSORS 2022; 22:s22114265. [PMID: 35684887 PMCID: PMC9185433 DOI: 10.3390/s22114265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/31/2022] [Accepted: 05/31/2022] [Indexed: 11/24/2022]
Abstract
Integrated logistics support (ILS) is of great significance for maintaining equipment operational capability in the whole lifecycle. Numerous segments and complex product objects exist in the process of equipment ILS, which gives ILS data multi-source, heterogeneous, and multidimensional characteristics. The present ILS data cannot satisfy the demand for efficient utilization. Therefore, the unified modeling of ILS data is extremely urgent and significant. In this paper, a unified data modeling method is proposed to solve the consistent and comprehensive expression problem of ILS data. Firstly, a four-tier unified data modeling framework is constructed based on the analysis of ILS data characteristics. Secondly, the Core unified data model, Domain unified data model, and Instantiated unified data model are built successively. Then, the expressions of ILS data in the three dimensions of time, product, and activity are analyzed. Thirdly, the Lifecycle ILS unified data model is constructed, and the multidimensional information retrieval methods are discussed. Based on these, different systems in the equipment ILS process can share a set of data models and provide ILS designers with relevant data through different views. Finally, the practical ILS data models are constructed based on the developed unified data modeling software prototype, which verifies the feasibility of the proposed method.
Collapse
|
25
|
Alharbi E, Gadiya Y, Henderson D, Zaliani A, Delfin-Rossaro A, Cambon-Thomsen A, Kohler M, Witt G, Welter D, Juty N, Jay C, Engkvist O, Goble C, Reilly DS, Satagopam V, Ioannidis V, Gu W, Gribbon P. Selection of data sets for FAIRification in drug discovery and development: Which, why, and how? Drug Discov Today 2022; 27:2080-2085. [PMID: 35595012 PMCID: PMC9236643 DOI: 10.1016/j.drudis.2022.05.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/28/2022] [Accepted: 05/10/2022] [Indexed: 11/30/2022]
Abstract
Research organisations are focussed on quantifying the costs and benefits of implementing FAIR. Criteria used for the selection of data for FAIRification can be opaque and inconsistent. FAIRification effort depends on individual skills, competencies, resources, and time available. FAIRification should satisfy reuse scenarios, and lead to scientific and economic impacts. Organisational challenges include providing training to individuals and developing a FAIR organisation culture.
Despite the intuitive value of adopting the Findable, Accessible, Interoperable, and Reusable (FAIR) principles in both academic and industrial sectors, challenges exist in resourcing, balancing long- versus short-term priorities, and achieving technical implementation. This situation is exacerbated by the unclear mechanisms by which costs and benefits can be assessed when decisions on FAIR are made. Scientific and research and development (R&D) leadership need reliable evidence of the potential benefits and information on effective implementation mechanisms and remediating strategies. In this article, we describe procedures for cost–benefit evaluation, and identify best-practice approaches to support the decision-making process involved in FAIR implementation.
Collapse
Affiliation(s)
- Ebtisam Alharbi
- Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
| | - Yojana Gadiya
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590 Frankfurt, Germany; Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Theodor Stern Kai 7, 60590 Frankfurt, Germany
| | - David Henderson
- Bayer AG, Research & Development, Pharmaceuticals, Müllerstrasse 178, 13353 Berlin, Germany
| | - Andrea Zaliani
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590 Frankfurt, Germany; Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Theodor Stern Kai 7, 60590 Frankfurt, Germany
| | | | | | - Manfred Kohler
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590 Frankfurt, Germany; Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Theodor Stern Kai 7, 60590 Frankfurt, Germany
| | - Gesa Witt
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590 Frankfurt, Germany; Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Theodor Stern Kai 7, 60590 Frankfurt, Germany
| | - Danielle Welter
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367 Belval, Luxembourg
| | - Nick Juty
- Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
| | - Caroline Jay
- Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
| | - Ola Engkvist
- Discovery Sciences, R&D, AstraZeneca, SE-43183 Mölndal, Sweden
| | - Carole Goble
- Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
| | - Dorothy S Reilly
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Basel, Switzerland
| | - Venkata Satagopam
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367 Belval, Luxembourg
| | - Vassilios Ioannidis
- SIB Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Amphipole, 1015 Lausanne, Switzerland.
| | - Wei Gu
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Basel, Switzerland.
| | - Philip Gribbon
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590 Frankfurt, Germany; Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Theodor Stern Kai 7, 60590 Frankfurt, Germany.
| |
Collapse
|
26
|
Soiland-Reyes S, Bayarri G, Andrio P, Long R, Lowe D, Niewielska A, Hospital A, Groth P. Making Canonical Workflow Building Blocks Interoperable across Workflow Languages. DATA INTELLIGENCE 2022. [DOI: 10.1162/dint_a_00135] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
We introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilised in a reproducible manner from multiple workflow languages and execution platforms. The concept is implemented and demonstrated with the BioExcel Building Blocks library (BioBB), a collection of tool wrappers in the field of computational biomolecular simulation. Interoperability across different workflow languages is showcased through a protein Molecular Dynamics setup transversal workflow, built using this library and run with 5 different Workflow Manager Systems (WfMS). We argue such practice is a necessary requirement for FAIR Computational Workflows and an element of Canonical Workflow Frameworks for Research (CWFR) in order to improve widespread adoption and reuse of computational methods across workflow language barriers.
Collapse
Affiliation(s)
- Stian Soiland-Reyes
- Department of Computer Science, The University of Manchester, Manchester, Manchester M13 9PL, UK
- Informatics Institute, University of Amsterdam, Amsterdam 1000 GG, The Nehterlands
| | - Genís Bayarri
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology (BIST), Barcelona 08028, Spain
| | - Pau Andrio
- The Spanish National Bioinformatics Institute (INB), Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Robin Long
- Data Science Institute, Lancaster University, Lancaster, Lancashire LA1 4YW, UK
- Research IT, IT Services, The University of Manchester, Manchester, Manchester M13 9PL, UK
| | - Douglas Lowe
- Research IT, IT Services, The University of Manchester, Manchester, Manchester M13 9PL, UK
| | - Ania Niewielska
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Adam Hospital
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology (BIST), Barcelona 08028, Spain
| | - Paul Groth
- Informatics Institute, University of Amsterdam, Amsterdam 1000 GG, The Nehterlands
| |
Collapse
|
27
|
The ATCC Genome Portal: Microbial Genome Reference Standards with Data Provenance. Microbiol Resour Announc 2021; 10:e0081821. [PMID: 34817215 PMCID: PMC8612085 DOI: 10.1128/mra.00818-21] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Lack of data provenance negatively impacts scientific reproducibility and the reliability of genomic data. The ATCC Genome Portal (https://genomes.atcc.org) addresses this by providing data provenance information for microbial whole-genome assemblies originating from authenticated biological materials. To date, we have sequenced 1,579 complete genomes, including 466 type strains and 1,156 novel genomes.
Collapse
|