1
|
Ziemann M, Poulain P, Bora A. The five pillars of computational reproducibility: bioinformatics and beyond. Brief Bioinform 2023; 24:bbad375. [PMID: 37870287 PMCID: PMC10591307 DOI: 10.1093/bib/bbad375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 09/26/2023] [Accepted: 09/30/2023] [Indexed: 10/24/2023] Open
Abstract
Computational reproducibility is a simple premise in theory, but is difficult to achieve in practice. Building upon past efforts and proposals to maximize reproducibility and rigor in bioinformatics, we present a framework called the five pillars of reproducible computational research. These include (1) literate programming, (2) code version control and sharing, (3) compute environment control, (4) persistent data sharing and (5) documentation. These practices will ensure that computational research work can be reproduced quickly and easily, long into the future. This guide is designed for bioinformatics data analysts and bioinformaticians in training, but should be relevant to other domains of study.
Collapse
Affiliation(s)
- Mark Ziemann
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
- Burnet Institute, Melbourne, Australia
| | - Pierre Poulain
- Université Paris Cité, CNRS, Institut Jacques Monod, Paris, France
| | - Anusuiya Bora
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
| |
Collapse
|
2
|
Ziemski M, Adamov A, Kim L, Flörl L, Bokulich NA. Reproducible acquisition, management and meta-analysis of nucleotide sequence (meta)data using q2-fondue. Bioinformatics 2022; 38:5081-5091. [PMID: 36130056 PMCID: PMC9665871 DOI: 10.1093/bioinformatics/btac639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 09/08/2022] [Accepted: 09/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The volume of public nucleotide sequence data has blossomed over the past two decades and is ripe for re- and meta-analyses to enable novel discoveries. However, reproducible re-use and management of sequence datasets and associated metadata remain critical challenges. We created the open source Python package q2-fondue to enable user-friendly acquisition, re-use and management of public sequence (meta)data while adhering to open data principles. RESULTS q2-fondue allows fully provenance-tracked programmatic access to and management of data from the NCBI Sequence Read Archive (SRA). Unlike other packages allowing download of sequence data from the SRA, q2-fondue enables full data provenance tracking from data download to final visualization, integrates with the QIIME 2 ecosystem, prevents data loss upon space exhaustion and allows download of (meta)data given a publication library. To highlight its manifold capabilities, we present executable demonstrations using publicly available amplicon, whole genome and metagenome datasets. AVAILABILITY AND IMPLEMENTATION q2-fondue is available as an open-source BSD-3-licensed Python package at https://github.com/bokulich-lab/q2-fondue. Usage tutorials are available in the same repository. All Jupyter notebooks used in this article are available under https://github.com/bokulich-lab/q2-fondue-examples. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Lina Kim
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Zürich 8092, Switzerland
| | - Lena Flörl
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Zürich 8092, Switzerland
| | | |
Collapse
|
3
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. Informatics in Medicine Unlocked 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
4
|
Mammoliti A, Smirnov P, Nakano M, Safikhani Z, Eeles C, Seo H, Nair SK, Mer AS, Smith I, Ho C, Beri G, Kusko R, Lin E, Yu Y, Martin S, Hafner M, Haibe-Kains B; Massive Analysis Quality Control (MAQC) Society Board of Directors. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat Commun 2021; 12:5797. [PMID: 34608132 DOI: 10.1038/s41467-021-25974-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 09/08/2021] [Indexed: 11/08/2022] Open
Abstract
Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, scrutinized, challenged, and built upon. However, the intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner. To overcome these issues, we created a cloud-based platform called ORCESTRA ( orcestra.ca ), which provides a flexible framework for the reproducible processing of multimodal biomedical data. It enables processing of clinical, genomic and perturbation profiles of cancer samples through automated processing pipelines that are user-customizable. ORCESTRA creates integrated and fully documented data objects with persistent identifiers (DOI) and manages multiple dataset versions, which can be shared for future studies.
Collapse
|
5
|
Ma L, Peterson EA, Shin IJ, Muesse J, Marino K, Steliga MA, Johann DJ. NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science. Front Big Data 2021; 4:725095. [PMID: 34647017 PMCID: PMC8503682 DOI: 10.3389/fdata.2021.725095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/07/2021] [Indexed: 11/13/2022] Open
Abstract
Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Results: Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.
Collapse
Affiliation(s)
- Li Ma
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
- Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, United States
| | - Erich A. Peterson
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Ik Jae Shin
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Jason Muesse
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Katy Marino
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Matthew A. Steliga
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Donald J. Johann
- Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
6
|
Charbonnier B, Hadida M, Marchat D. Additive manufacturing pertaining to bone: Hopes, reality and future challenges for clinical applications. Acta Biomater 2021; 121:1-28. [PMID: 33271354 DOI: 10.1016/j.actbio.2020.11.039] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 11/06/2020] [Accepted: 11/24/2020] [Indexed: 12/12/2022]
Abstract
For the past 20 years, the democratization of additive manufacturing (AM) technologies has made many of us dream of: low cost, waste-free, and on-demand production of functional parts; fully customized tools; designs limited by imagination only, etc. As every patient is unique, the potential of AM for the medical field is thought to be considerable: AM would allow the division of dedicated patient-specific healthcare solutions entirely adapted to the patients' clinical needs. Pertinently, this review offers an extensive overview of bone-related clinical applications of AM and ongoing research trends, from 3D anatomical models for patient and student education to ephemeral structures supporting and promoting bone regeneration. Today, AM has undoubtably improved patient care and should facilitate many more improvements in the near future. However, despite extensive research, AM-based strategies for bone regeneration remain the only bone-related field without compelling clinical proof of concept to date. This may be due to a lack of understanding of the biological mechanisms guiding and promoting bone formation and due to the traditional top-down strategies devised to solve clinical issues. Indeed, the integrated holistic approach recommended for the design of regenerative systems (i.e., fixation systems and scaffolds) has remained at the conceptual state. Challenged by these issues, a slower but incremental research dynamic has occurred for the last few years, and recent progress suggests notable improvement in the years to come, with in view the development of safe, robust and standardized patient-specific clinical solutions for the regeneration of large bone defects.
Collapse
|
7
|
Lenz S, Farhadyar K, Hess M, Binder H, Knaus J. Encoding Medical Experiments in Jupyter Notebooks. Systems Medicine 2021. [DOI: 10.1016/b978-0-12-801238-3.11687-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
|
8
|
Demin KA, Lakstygal AM, Volgin AD, de Abreu MS, Genario R, Alpyshov ET, Serikuly N, Wang D, Wang J, Yan D, Wang M, Yang L, Hu G, Bytov M, Zabegalov KN, Zhdanov A, Harvey BH, Costa F, Rosemberg DB, Leonard BE, Fontana BD, Cleal M, Parker MO, Wang J, Song C, Amstislavskaya TG, Kalueff AV. Cross-species Analyses of Intra-species Behavioral Differences in Mammals and Fish. Neuroscience 2020; 429:33-45. [DOI: 10.1016/j.neuroscience.2019.12.035] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 12/15/2019] [Accepted: 12/20/2019] [Indexed: 12/28/2022]
|
9
|
Brancato V, Cavaliere C, Salvatore M, Monti S. Non-Gaussian models of diffusion weighted imaging for detection and characterization of prostate cancer: a systematic review and meta-analysis. Sci Rep 2019; 9:16837. [PMID: 31728007 PMCID: PMC6856159 DOI: 10.1038/s41598-019-53350-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Accepted: 10/28/2019] [Indexed: 12/24/2022] Open
Abstract
The importance of Diffusion Weighted Imaging (DWI) in prostate cancer (PCa) diagnosis have been widely handled in literature. In the last decade, due to the mono-exponential model limitations, several studies investigated non-Gaussian DWI models and their utility in PCa diagnosis. Since their results were often inconsistent and conflicting, we performed a systematic review of studies from 2012 examining the most commonly used Non-Gaussian DWI models for PCa detection and characterization. A meta-analysis was conducted to assess the ability of each Non-Gaussian model to detect PCa lesions and distinguish between low and intermediate/high grade lesions. Weighted mean differences and 95% confidence intervals were calculated and the heterogeneity was estimated using the I2 statistic. 29 studies were selected for the systematic review, whose results showed inconsistence and an unclear idea about the actual usefulness and the added value of the Non-Gaussian model parameters. 12 studies were considered in the meta-analyses, which showed statistical significance for several non-Gaussian parameters for PCa detection, and to a lesser extent for PCa characterization. Our findings showed that Non-Gaussian model parameters may potentially play a role in the detection and characterization of PCa but further studies are required to identify a standardized DWI acquisition protocol for PCa diagnosis.
Collapse
|
10
|
Abstract
With the advent of big data and database-driven research, the need for reproducible methods has become especially relevant. Given the rise of evidence-based practice, it is crucial to ensure that findings making use of big data can be consistently replicated by other physician-scientists. A call for transparency and reproducibility must occur at the individual, institutional, and national levels. Given the rising popularity of national and large databases in research, the responsibility of authors to ensure reproducibility of clinical research merits renewed discussion. In this article, the authors offer strategies to increase clinical research reproducibility at both the individual and institutional levels, within the context of plastic surgery.
Collapse
|
11
|
Ling A, Hay EH, Aggrey SE, Rekaya R. A Bayesian approach for analysis of ordered categorical responses subject to misclassification. PLoS One 2018; 13:e0208433. [PMID: 30543662 PMCID: PMC6292639 DOI: 10.1371/journal.pone.0208433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2018] [Accepted: 11/10/2018] [Indexed: 11/18/2022] Open
Abstract
Ordinal categorical responses are frequently collected in survey studies, human medicine, and animal and plant improvement programs, just to mention a few. Errors in this type of data are neither rare nor easy to detect. These errors tend to bias the inference, reduce the statistical power and ultimately the efficiency of the decision-making process. Contrarily to the binary situation where misclassification occurs between two response classes, noise in ordinal categorical data is more complex due to the increased number of categories, diversity and asymmetry of errors. Although several approaches have been presented for dealing with misclassification in binary data, only limited practical methods have been proposed to analyze noisy categorical responses. A latent variable model implemented within a Bayesian framework was proposed to analyze ordinal categorical data subject to misclassification using simulated and real datasets. The simulated scenario consisted of a discrete response with three categories and a symmetric error rate of 5% between any two classes. The real data consisted of calving ease records of beef cows. Using real and simulated data, ignoring misclassification resulted in substantial bias in the estimation of genetic parameters and reduction of the accuracy of predicted breeding values. Using our proposed approach, a significant reduction in bias and increase in accuracy ranging from 11% to 17% was observed. Furthermore, most of the misclassified observations (in the simulated data) were identified with a substantially higher probability. Similar results were observed for a scenario with asymmetric misclassification. While the extension to traits with more categories between adjacent classes is straightforward, it could be computationally costly. For traits with high heritability, the performance of the methodology would be expected to improve.
Collapse
Affiliation(s)
- Ashley Ling
- Department of Anismal and Dairy Science, University of Georgia, Athens, Georgia, United States of America
- * E-mail:
| | - El Hamidi Hay
- USDA Agricultural Research Service, Fort Keogh Livestock and Range Research Laboratory, Miles City, Montana, United States of America
| | - Samuel E. Aggrey
- Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America
- Department of Poultry Science, University of Georgia, Athens, Georgia, United States of America
| | - Romdhane Rekaya
- Department of Anismal and Dairy Science, University of Georgia, Athens, Georgia, United States of America
- Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America
- Department of Statistics, University of Georgia, Athens, Georgia, United States of America
| |
Collapse
|
12
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
13
|
Finak G, Mayer B, Fulp W, Obrecht P, Sato A, Chung E, Holman D, Gottardo R. DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis. Gates Open Res 2018; 2:31. [PMID: 30234197 PMCID: PMC6139382 DOI: 10.12688/gatesopenres.12832.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/06/2018] [Indexed: 01/02/2023] Open
Abstract
A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool
DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the
DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.
Collapse
Affiliation(s)
- Greg Finak
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Bryan Mayer
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - William Fulp
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Paul Obrecht
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Alicia Sato
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Eva Chung
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Drienna Holman
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Raphael Gottardo
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Vaccine Immunology Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.,Statistical Center For HIV AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| |
Collapse
|
14
|
Finak G, Mayer B, Fulp W, Obrecht P, Sato A, Chung E, Holman D, Gottardo R. DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis. Gates Open Res 2018. [DOI: 10.12688/gatesopenres.12832.1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.
Collapse
|
15
|
Spidlen J, Brinkman RR. Use FlowRepository to share your clinical data upon study publication. Cytometry B Clin Cytom 2016; 94:196-198. [PMID: 27342384 DOI: 10.1002/cyto.b.21393] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 06/23/2016] [Indexed: 01/01/2023]
Abstract
A fundamental tenet of scientific research is that published results including underlying data should be open to independent validation and refutation. Data sharing encourages collaboration, facilitates quality and reduces redundancy in data production. Authors submitting manuscripts to several journals have already adopted the habit of sharing their underlying flow cytometry data by deposition to FlowRepository-a data repository that is jointly supported by the International Society for Advancement of Cytometry, the International Clinical Cytometry Society and the European Society for Clinical Cell Analysis. De-identification is required for publishing data from clinical studies and we discuss ways to satisfy data sharing requirements and patient privacy requirements simultaneously. Scientific communities in the fields of microarray, proteomics, and sequencing have been benefiting from reuse and re-exploration of data in public repositories for over decade. We believe it is time that clinicians follow suit and that de-identified clinical data also become routinely available along with published cytometry-based findings. © 2016 International Clinical Cytometry Society.
Collapse
Affiliation(s)
- Josef Spidlen
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, British Columbia, Canada
| | - Ryan R Brinkman
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, British Columbia, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
16
|
Wilke S, Groenveld D, Grittner U, List J, Flöel A. cSPider - Evaluation of a Free and Open-Source Automated Tool to Analyze Corticomotor Silent Period. PLoS One 2016; 11:e0156066. [PMID: 27249017 PMCID: PMC4889140 DOI: 10.1371/journal.pone.0156066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 05/09/2016] [Indexed: 11/19/2022] Open
Abstract
Background The corticomotor silent period (CSP), as assessed noninvasively by transcranial magnetic stimulation (TMS) in the primary motor cortex, has been found to reflect intracortical inhibitory mechanisms. Analysis of CSP is mostly conducted manually. However, this approach is time-consuming, and comparison of results from different laboratories may be compromised by inter-rater variability in analysis. No open source program for automated analysis is currently available. Methods/Results Here, we describe cross-validation with the manual analysis of an in-house written automated tool to assess CSP (cSPider). Results from automated routine were compared with results of the manual evaluation. We found high inter-method reliability between automated and manual analysis (p<0.001), and significantly reduced time for CSP analysis (median = 10.3 sec for automated analysis of 10 CSPs vs. median = 270 sec for manual analysis of 10 CSPs). cSPider can be downloaded free of charge. Conclusion cSPider allows automated analysis of CSP in a reliable and time-efficient manner. Use of this open-source tool may help to improve comparison of data from different laboratories.
Collapse
Affiliation(s)
- Skadi Wilke
- Department of Neurology, Charité-Universitätsmedizin Berlin, Berlin, Germany
- * E-mail: (SW); (AF)
| | - Dennis Groenveld
- Department of Neurology, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Department of Biomedical Engineering, University of Twente, Enschede, Netherlands
| | - Ulrike Grittner
- Center for Stroke Research Berlin, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Department for Biostatistics and Clinical Epidemiology, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Jonathan List
- Department of Neurology, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Agnes Flöel
- Department of Neurology, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Center for Stroke Research Berlin, Charité-Universitätsmedizin Berlin, Berlin, Germany
- NeuroCure Cluster of Excellence, Charité-Universitätsmedizin Berlin, Berlin, Germany
- * E-mail: (SW); (AF)
| |
Collapse
|
17
|
Stitz H, Luger S, Streit M, Gehlenborg N. AVOCADO: Visualization of Workflow-Derived Data Provenance for Reproducible Biomedical Research. Comput Graph Forum 2016; 35:481-490. [PMID: 29973745 PMCID: PMC6027754 DOI: 10.1111/cgf.12924] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
A major challenge in data-driven biomedical research lies in the collection and representation of data provenance information to ensure that findings are reproducibile. In order to communicate and reproduce multi-step analysis workflows executed on datasets that contain data for dozens or hundreds of samples, it is crucial to be able to visualize the provenance graph at different levels of aggregation. Most existing approaches are based on node-link diagrams, which do not scale to the complexity of typical data provenance graphs. In our proposed approach, we reduce the complexity of the graph using hierarchical and motif-based aggregation. Based on user action and graph attributes, a modular degree-of-interest (DoI) function is applied to expand parts of the graph that are relevant to the user. This interest-driven adaptive approach to provenance visualization allows users to review and communicate complex multi-step analyses, which can be based on hundreds of files that are processed by numerous workflows. We have integrated our approach into an analysis platform that captures extensive data provenance information, and demonstrate its effectiveness by means of a biomedical usage scenario.
Collapse
Affiliation(s)
- H Stitz
- Johannes Kepler University Linz, Austria
| | - S Luger
- Johannes Kepler University Linz, Austria
| | - M Streit
- Johannes Kepler University Linz, Austria
| | - N Gehlenborg
- Harvard Medical School, United States of America
| |
Collapse
|
18
|
Zheng K, Vydiswaran VGV, Liu Y, Wang Y, Stubbs A, Uzuner Ö, Gururaj AE, Bayer S, Aberdeen J, Rumshisky A, Pakhomov S, Liu H, Xu H. Ease of adoption of clinical natural language processing software: An evaluation of five systems. J Biomed Inform 2015; 58 Suppl:S189-S196. [PMID: 26210361 PMCID: PMC4974203 DOI: 10.1016/j.jbi.2015.07.008] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Revised: 06/09/2015] [Accepted: 07/06/2015] [Indexed: 12/19/2022]
Abstract
OBJECTIVE In recognition of potential barriers that may inhibit the widespread adoption of biomedical software, the 2014 i2b2 Challenge introduced a special track, Track 3 - Software Usability Assessment, in order to develop a better understanding of the adoption issues that might be associated with the state-of-the-art clinical NLP systems. This paper reports the ease of adoption assessment methods we developed for this track, and the results of evaluating five clinical NLP system submissions. MATERIALS AND METHODS A team of human evaluators performed a series of scripted adoptability test tasks with each of the participating systems. The evaluation team consisted of four "expert evaluators" with training in computer science, and eight "end user evaluators" with mixed backgrounds in medicine, nursing, pharmacy, and health informatics. We assessed how easy it is to adopt the submitted systems along the following three dimensions: communication effectiveness (i.e., how effective a system is in communicating its designed objectives to intended audience), effort required to install, and effort required to use. We used a formal software usability testing tool, TURF, to record the evaluators' interactions with the systems and 'think-aloud' data revealing their thought processes when installing and using the systems and when resolving unexpected issues. RESULTS Overall, the ease of adoption ratings that the five systems received are unsatisfactory. Installation of some of the systems proved to be rather difficult, and some systems failed to adequately communicate their designed objectives to intended adopters. Further, the average ratings provided by the end user evaluators on ease of use and ease of interpreting output are -0.35 and -0.53, respectively, indicating that this group of users generally deemed the systems extremely difficult to work with. While the ratings provided by the expert evaluators are higher, 0.6 and 0.45, respectively, these ratings are still low indicating that they also experienced considerable struggles. DISCUSSION The results of the Track 3 evaluation show that the adoptability of the five participating clinical NLP systems has a great margin for improvement. Remedy strategies suggested by the evaluators included (1) more detailed and operation system specific use instructions; (2) provision of more pertinent onscreen feedback for easier diagnosis of problems; (3) including screen walk-throughs in use instructions so users know what to expect and what might have gone wrong; (4) avoiding jargon and acronyms in materials intended for end users; and (5) packaging prerequisites required within software distributions so that prospective adopters of the software do not have to obtain each of the third-party components on their own.
Collapse
Affiliation(s)
- Kai Zheng
- School of Public Health Department of Health Management and Policy, University of Michigan, Ann Arbor, MI, USA; School of Information, University of Michigan, Ann Arbor, MI, USA.
| | - V G Vinod Vydiswaran
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yang Liu
- School of Information, University of Michigan, Ann Arbor, MI, USA
| | - Yue Wang
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
| | - Amber Stubbs
- School of Library and Information Science, Simmons College, Boston, MA, USA
| | - Özlem Uzuner
- Department of Information Studies, University at Albany, SUNY, Albany, NY, USA
| | - Anupama E Gururaj
- The University of Texas School of Biomedical Informatics at Houston, Houston, TX, USA
| | | | | | - Anna Rumshisky
- Department of Computer Science, University of Massachusetts, Lowell, MA, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Hua Xu
- The University of Texas School of Biomedical Informatics at Houston, Houston, TX, USA.
| |
Collapse
|
19
|
Stegmayer G, Pividori M, Milone DH. A very simple and fast way to access and validate algorithms in reproducible research. Brief Bioinform 2015. [DOI: 10.1093/bib/bbv054] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
|
20
|
Lawrence TJ, Kauffman KT, Amrine KCH, Carper DL, Lee RS, Becich PJ, Canales CJ, Ardell DH. FAST: FAST Analysis of Sequences Toolbox. Front Genet 2015; 6:172. [PMID: 26042145 PMCID: PMC4437040 DOI: 10.3389/fgene.2015.00172] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2015] [Accepted: 04/20/2015] [Indexed: 11/13/2022] Open
Abstract
FAST (FAST Analysis of Sequences Toolbox) provides simple, powerful open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can simplify the documentation and reproducibility of bioinformatic protocols, supporting better transparency in biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R, and GenBank help make FAST easy and rewarding to learn. FAST automates numerical, taxonomic, and text-based sorting, selection and transformation of sequence records and alignment sites based on content, index ranges, descriptive tags, annotated features, and in-line calculated analytics, including composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics make FAST useful for molecular evolutionary analysis. FAST is portable, easy to install and secure thanks to the relative maturity of its Perl and BioPerl foundations, with stable releases posted to CPAN. Development as well as a publicly accessible Cookbook and Wiki are available on the FAST GitHub repository at https://github.com/tlawrence3/FAST. The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FastQ formatted files are also supported. FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought.
Collapse
Affiliation(s)
- Travis J Lawrence
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA
| | - Kyle T Kauffman
- Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| | - Katherine C H Amrine
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA ; Department of Viticulture and Enology, University of California, Davis Davis, CA, USA
| | - Dana L Carper
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA
| | - Raymond S Lee
- School of Engineering, University of California, Merced Merced, CA, USA
| | - Peter J Becich
- Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| | - Claudia J Canales
- School of Engineering, University of California, Merced Merced, CA, USA
| | - David H Ardell
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA ; Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| |
Collapse
|
21
|
Beck TN, Chikwem AJ, Solanki NR, Golemis EA. Bioinformatic approaches to augment study of epithelial-to-mesenchymal transition in lung cancer. Physiol Genomics 2014; 46:699-724. [PMID: 25096367 PMCID: PMC4187119 DOI: 10.1152/physiolgenomics.00062.2014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 08/04/2014] [Indexed: 12/22/2022] Open
Abstract
Bioinformatic approaches are intended to provide systems level insight into the complex biological processes that underlie serious diseases such as cancer. In this review we describe current bioinformatic resources, and illustrate how they have been used to study a clinically important example: epithelial-to-mesenchymal transition (EMT) in lung cancer. Lung cancer is the leading cause of cancer-related deaths and is often diagnosed at advanced stages, leading to limited therapeutic success. While EMT is essential during development and wound healing, pathological reactivation of this program by cancer cells contributes to metastasis and drug resistance, both major causes of death from lung cancer. Challenges of studying EMT include its transient nature, its molecular and phenotypic heterogeneity, and the complicated networks of rewired signaling cascades. Given the biology of lung cancer and the role of EMT, it is critical to better align the two in order to advance the impact of precision oncology. This task relies heavily on the application of bioinformatic resources. Besides summarizing recent work in this area, we use four EMT-associated genes, TGF-β (TGFB1), NEDD9/HEF1, β-catenin (CTNNB1) and E-cadherin (CDH1), as exemplars to demonstrate the current capacities and limitations of probing bioinformatic resources to inform hypothesis-driven studies with therapeutic goals.
Collapse
Affiliation(s)
- Tim N Beck
- Developmental Therapeutics Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania; Program in Molecular and Cell Biology and Genetics, Drexel University College of Medicine, Philadelphia, Pennsylvania; and
| | - Adaeze J Chikwem
- Developmental Therapeutics Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania; Temple University School of Medicine, Philadelphia, Pennsylvania; and
| | - Nehal R Solanki
- Immune Cell Development and Host Defense Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania; Program in Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, Pennsylvania
| | - Erica A Golemis
- Developmental Therapeutics Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania; Temple University School of Medicine, Philadelphia, Pennsylvania; and Program in Molecular and Cell Biology and Genetics, Drexel University College of Medicine, Philadelphia, Pennsylvania; and
| |
Collapse
|
22
|
Atmanspacher H, Bezzola Lambert L, Folkers G, Schubiger PA. Relevance relations for the concept of reproducibility. J R Soc Interface 2014; 11:20131030. [PMID: 24554574 PMCID: PMC3973355 DOI: 10.1098/rsif.2013.1030] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 01/27/2014] [Indexed: 01/10/2023] Open
Abstract
The concept of reproducibility is widely considered a cornerstone of scientific methodology. However, recent problems with the reproducibility of empirical results in large-scale systems and in biomedical research have cast doubts on its universal and rigid applicability beyond the so-called basic sciences. Reproducibility is a particularly difficult issue in interdisciplinary work where the results to be reproduced typically refer to different levels of description of the system considered. In such cases, it is mandatory to distinguish between more and less relevant features, attributes or observables of the system, depending on the level at which they are described. For this reason, we propose a scheme for a general 'relation of relevance' between the level of complexity at which a system is considered and the granularity of its description. This relation implies relevance criteria for particular selected aspects of a system and its description, which can be operationally implemented by an interlevel relation called 'contextual emergence'. It yields a formally sound and empirically applicable procedure to translate between descriptive levels and thus construct level-specific criteria for reproducibility in an overall consistent fashion. Relevance relations merged with contextual emergence challenge the old idea of one fundamental ontology from which everything else derives. At the same time, our proposal is specific enough to resist the backlash into a relativist patchwork of unconnected model fragments.
Collapse
Affiliation(s)
- H. Atmanspacher
- Collegium Helveticum, Zurich, Switzerland
- Institute for Frontier Areas of Psychology, Freiburg, Germany
| | - L. Bezzola Lambert
- Collegium Helveticum, Zurich, Switzerland
- Department of English, University of Basel, Switzerland
| | - G. Folkers
- Collegium Helveticum, Zurich, Switzerland
| | | |
Collapse
|
23
|
Sariyar M, Hoffmann I, Binder H. Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data. BMC Bioinformatics 2014; 15:58. [PMID: 24571520 PMCID: PMC3945780 DOI: 10.1186/1471-2105-15-58] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/28/2014] [Indexed: 11/23/2022] Open
Abstract
Background Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. Results We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. Conclusion Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Collapse
Affiliation(s)
- Murat Sariyar
- Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, Mainz 55131, Germany.
| | | | | |
Collapse
|
24
|
Lourenço A, Coenye T, Goeres DM, Donelli G, Azevedo AS, Ceri H, Coelho FL, Flemming HC, Juhna T, Lopes SP, Oliveira R, Oliver A, Shirtliff ME, Sousa AM, Stoodley P, Pereira MO, Azevedo NF. Minimum information about a biofilm experiment (MIABiE): standards for reporting experiments and data on sessile microbial communities living at interfaces. Pathog Dis 2014; 70:250-6. [PMID: 24478124 DOI: 10.1111/2049-632x.12146] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Revised: 01/15/2014] [Accepted: 01/15/2014] [Indexed: 02/04/2023] Open
Abstract
The minimum information about a biofilm experiment (MIABiE) initiative has arisen from the need to find an adequate and scientifically sound way to control the quality of the documentation accompanying the public deposition of biofilm-related data, particularly those obtained using high-throughput devices and techniques. Thereby, the MIABiE consortium has initiated the identification and organization of a set of modules containing the minimum information that needs to be reported to guarantee the interpretability and independent verification of experimental results and their integration with knowledge coming from other fields. MIABiE does not intend to propose specific standards on how biofilms experiments should be performed, because it is acknowledged that specific research questions require specific conditions which may deviate from any standardization. Instead, MIABiE presents guidelines about the data to be recorded and published in order for the procedure and results to be easily and unequivocally interpreted and reproduced. Overall, MIABiE opens up the discussion about a number of particular areas of interest and attempts to achieve a broad consensus about which biofilm data and metadata should be reported in scientific journals in a systematic, rigorous and understandable manner.
Collapse
Affiliation(s)
- Anália Lourenço
- Departamento de Informática, Universidade de Vigo, ESEI - Escuela Superior de Ingeniería Informática, Ourense, Spain; IBB - Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Braga, Portugal
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Eckels J, Nathe C, Nelson EK, Shoemaker SG, Nostrand EV, Yates NL, Ashley VC, Harris LJ, Bollenbeck M, Fong Y, Tomaras GD, Piehler B. Quality control, analysis and secure sharing of Luminex® immunoassay data using the open source LabKey Server platform. BMC Bioinformatics 2013; 14:145. [PMID: 23631706 PMCID: PMC3671158 DOI: 10.1186/1471-2105-14-145] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2012] [Accepted: 03/27/2013] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Immunoassays that employ multiplexed bead arrays produce high information content per sample. Such assays are now frequently used to evaluate humoral responses in clinical trials. Integrated software is needed for the analysis, quality control, and secure sharing of the high volume of data produced by such multiplexed assays. Software that facilitates data exchange and provides flexibility to perform customized analyses (including multiple curve fits and visualizations of assay performance over time) could increase scientists' capacity to use these immunoassays to evaluate human clinical trials. RESULTS The HIV Vaccine Trials Network and the Statistical Center for HIV/AIDS Research and Prevention collaborated with LabKey Software to enhance the open source LabKey Server platform to facilitate workflows for multiplexed bead assays. This system now supports the management, analysis, quality control, and secure sharing of data from multiplexed immunoassays that leverage Luminex xMAP® technology. These assays may be custom or kit-based. Newly added features enable labs to: (i) import run data from spreadsheets output by Bio-Plex Manager™ software; (ii) customize data processing, curve fits, and algorithms through scripts written in common languages, such as R; (iii) select script-defined calculation options through a graphical user interface; (iv) collect custom metadata for each titration, analyte, run and batch of runs; (v) calculate dose-response curves for titrations; (vi) interpolate unknown concentrations from curves for titrated standards; (vii) flag run data for exclusion from analysis; (viii) track quality control metrics across runs using Levey-Jennings plots; and (ix) automatically flag outliers based on expected values. Existing system features allow researchers to analyze, integrate, visualize, export and securely share their data, as well as to construct custom user interfaces and workflows. CONCLUSIONS Unlike other tools tailored for Luminex immunoassays, LabKey Server allows labs to customize their Luminex analyses using scripting while still presenting users with a single, graphical interface for processing and analyzing data. The LabKey Server system also stands out among Luminex tools for enabling smooth, secure transfer of data, quality control information, and analyses between collaborators. LabKey Server and its Luminex features are freely available as open source software at http://www.labkey.com under the Apache 2.0 license.
Collapse
Affiliation(s)
| | | | | | - Sara G Shoemaker
- Statistical Center for HIV/AIDS Research & Prevention (SCHARP), Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | | | - Nicole L Yates
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
| | - Vicki C Ashley
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Surgery, Duke University Medical Center, Durham, NC, USA
| | - Linda J Harris
- Statistical Center for HIV/AIDS Research & Prevention (SCHARP), Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Mark Bollenbeck
- Statistical Center for HIV/AIDS Research & Prevention (SCHARP), Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Youyi Fong
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Georgia D Tomaras
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC, USA
- Department of Immunology, Duke University Medical Center, Durham, NC, USA
| | | |
Collapse
|
26
|
Nelson EK, Piehler B, Rauch A, Ramsay S, Holman D, Asare S, Asare A, Igra M. Ancillary study management systems: a review of needs. BMC Med Inform Decis Mak 2013; 13:5. [PMID: 23294514 PMCID: PMC3564696 DOI: 10.1186/1472-6947-13-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 12/28/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The valuable clinical data, specimens, and assay results collected during a primary clinical trial or observational study can enable researchers to answer additional, pressing questions with relatively small investments in new measurements. However, management of such follow-on, "ancillary" studies is complex. It requires coordinating across institutions, sites, repositories, and approval boards, as well as distributing, integrating, and analyzing diverse data types. General-purpose software systems that simplify the management of ancillary studies have not yet been explored in the research literature. METHODS We have identified requirements for ancillary study management primarily as part of our ongoing work with a number of large research consortia. These organizations include the Center for HIV/AIDS Vaccine Immunology (CHAVI), the Immune Tolerance Network (ITN), the HIV Vaccine Trials Network (HVTN), the U.S. Military HIV Research Program (MHRP), and the Network for Pancreatic Organ Donors with Diabetes (nPOD). We also consulted with researchers at a range of other disease research organizations regarding their workflows and data management strategies. Lastly, to enhance breadth, we reviewed process documents for ancillary study management from other organizations. RESULTS By exploring characteristics of ancillary studies, we identify differentiating requirements and scenarios for ancillary study management systems (ASMSs). Distinguishing characteristics of ancillary studies may include the collection of additional measurements (particularly new analyses of existing specimens); the initiation of studies by investigators unaffiliated with the original study; cross-protocol data pooling and analysis; pre-existing participant consent; and pre-existing data context and provenance. For an ASMS to address these characteristics, it would need to address both operational requirements (e.g., allocating existing specimens) and data management requirements (e.g., securely distributing and integrating primary and ancillary data). CONCLUSIONS The scenarios and requirements we describe can help guide the development of systems that make conducting ancillary studies easier, less expensive, and less error-prone. Given the relatively consistent characteristics and challenges of ancillary study management, general-purpose ASMSs are likely to be useful to a wide range of organizations. Using the requirements identified in this paper, we are currently developing an open-source, general-purpose ASMS based on LabKey Server (http://www.labkey.org) in collaboration with CHAVI, the ITN and nPOD.
Collapse
Affiliation(s)
| | | | | | - Sarah Ramsay
- Statistical Center for HIV/AIDS Research & Prevention (SCHARP), Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Drienna Holman
- Statistical Center for HIV/AIDS Research & Prevention (SCHARP), Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Smita Asare
- University of California, San Francisco, CA, USA
- Immune Tolerance Network, Bethesda, MD, USA
| | - Adam Asare
- University of California, San Francisco, CA, USA
- Immune Tolerance Network, Bethesda, MD, USA
| | | |
Collapse
|