1
|
Fouad K, Vavrek R, Surles-Zeigler MC, Huie JR, Radabaugh HL, Gurkoff GG, Visser U, Grethe JS, Martone ME, Ferguson AR, Gensel JC, Torres-Espin A. A practical guide to data management and sharing for biomedical laboratory researchers. Exp Neurol 2024; 378:114815. [PMID: 38762093 DOI: 10.1016/j.expneurol.2024.114815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 05/13/2024] [Accepted: 05/14/2024] [Indexed: 05/20/2024]
Abstract
Effective data management and sharing have become increasingly crucial in biomedical research; however, many laboratory researchers lack the necessary tools and knowledge to address this challenge. This article provides an introductory guide into research data management (RDM), and the importance of FAIR (Findable, Accessible, Interoperable, and Reusable) data-sharing principles for laboratory researchers produced by practicing scientists. We explore the advantages of implementing organized data management strategies and introduce key concepts such as data standards, data documentation, and the distinction between machine and human-readable data formats. Furthermore, we offer practical guidance for creating a data management plan and establishing efficient data workflows within the laboratory setting, suitable for labs of all sizes. This includes an examination of requirements analysis, the development of a data dictionary for routine data elements, the implementation of unique subject identifiers, and the formulation of standard operating procedures (SOPs) for seamless data flow. To aid researchers in implementing these practices, we present a simple organizational system as an illustrative example, which can be tailored to suit individual needs and research requirements. By presenting a user-friendly approach, this guide serves as an introduction to the field of RDM and offers practical tips to help researchers effortlessly meet the common data management and sharing mandates rapidly becoming prevalent in biomedical research.
Collapse
Affiliation(s)
- K Fouad
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada.
| | - R Vavrek
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada
| | - M C Surles-Zeigler
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
| | - J R Huie
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - H L Radabaugh
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - G G Gurkoff
- Center for Neuroscience, University of California Davis, Davis, CA, United States; Department of Neurological Surgery, University of California Davis, Davis, CA, United States; Northern California Veterans Affairs Healthcare System, Martinez, CA, United States
| | - U Visser
- Department of Computer Science, University of Miami, Coral Gables, FL, United States
| | - J S Grethe
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
| | - M E Martone
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - A R Ferguson
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - J C Gensel
- Spinal Cord and Brain Injury Research Center and Department of Physiology, University of Kentucky College of Medicine, Lexington, KY, United States.
| | - A Torres-Espin
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada; Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; School of Public Health Sciences, University of Waterloo, Waterloo, ON, Canada.
| |
Collapse
|
2
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.02.538537. [PMID: 37205542 PMCID: PMC10187166 DOI: 10.1101/2023.05.02.538537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
|
3
|
Bibik P, Alibai S, Pandini A, Dantu SC. PyCoM: a python library for large-scale analysis of residue-residue coevolution data. Bioinformatics 2024; 40:btae166. [PMID: 38532297 PMCID: PMC11009027 DOI: 10.1093/bioinformatics/btae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/02/2024] [Accepted: 03/25/2024] [Indexed: 03/28/2024] Open
Abstract
MOTIVATION Computational methods to detect correlated amino acid positions in proteins have become a valuable tool to predict intra- and inter-residue protein contacts, protein structures, and effects of mutation on protein stability and function. While there are many tools and webservers to compute coevolution scoring matrices, there is no central repository of alignments and coevolution matrices for large-scale studies and pattern detection leveraging on biological and structural annotations already available in UniProt. RESULTS We present a Python library, PyCoM, which enables users to query and analyze coevolution matrices and sequence alignments of 457 622 proteins, selected from UniProtKB/Swiss-Prot database (length ≤ 500 residues), from a precompiled coevolution matrix database (PyCoMdb). PyCoM facilitates the development of statistical analyses of residue coevolution patterns using filters on biological and structural annotations from UniProtKB/Swiss-Prot, with simple access to PyCoMdb for both novice and advanced users, supporting Jupyter Notebooks, Python scripts, and a web API access. The resource is open source and will help in generating data-driven computational models and methods to study and understand protein structures, stability, function, and design. AVAILABILITY AND IMPLEMENTATION PyCoM code is freely available from https://github.com/scdantu/pycom and PyCoMdb and the Jupyter Notebook tutorials are freely available from https://pycom.brunel.ac.uk.
Collapse
Affiliation(s)
- Philipp Bibik
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sabriyeh Alibai
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Alessandro Pandini
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sarath Chandra Dantu
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| |
Collapse
|
4
|
Emissah H, Ljungquist B, Ascoli GA. Bibliometric analysis of neuroscience publications quantifies the impact of data sharing. Bioinformatics 2023; 39:btad746. [PMID: 38070153 PMCID: PMC10733721 DOI: 10.1093/bioinformatics/btad746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/01/2023] [Accepted: 12/07/2023] [Indexed: 12/19/2023] Open
Abstract
SUMMARY Neural morphology, the branching geometry of brain cells, is an essential cellular substrate of nervous system function and pathology. Despite the accelerating production of digital reconstructions of neural morphology, the public accessibility of data remains a core issue in neuroscience. Deficiencies in the availability of existing data create redundancy of research efforts and limit synergy. We carried out a comprehensive bibliometric analysis of neural morphology publications to quantify the impact of data sharing in the neuroscience community. Our findings demonstrate that sharing digital reconstructions of neural morphology via NeuroMorpho.Org leads to a significant increase of citations to the original article, thus directly benefiting authors. The rate of data reusage remains constant for at least 16 years after sharing (the whole period analyzed), altogether nearly doubling the peer-reviewed discoveries in the field. Furthermore, the recent availability of larger and more numerous datasets fostered integrative applications, which accrue on average twice the citations of re-analyses of individual datasets. We also released an open-source citation tracking web-service allowing researchers to monitor reusage of their datasets in independent peer-reviewed reports. These results and tools can facilitate the recognition of shared data reuse for merit evaluations and funding decisions. AVAILABILITY AND IMPLEMENTATION The application is available at: http://cng-nmo-dev3.orc.gmu.edu:8181/. The source code at https://github.com/HerveEmissah/nmo-authors-app and https://github.com/HerveEmissah/nmo-bibliometric-analysis.
Collapse
Affiliation(s)
- Herve Emissah
- Bioinformatics Program, College of Science, George Mason University, Fairfax, VA 22030, United States
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| | - Bengt Ljungquist
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| | - Giorgio A Ascoli
- Bioinformatics Program, College of Science, George Mason University, Fairfax, VA 22030, United States
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| |
Collapse
|
5
|
Way GP, Sailem H, Shave S, Kasprowicz R, Carragher NO. Evolution and impact of high content imaging. SLAS DISCOVERY : ADVANCING LIFE SCIENCES R & D 2023; 28:292-305. [PMID: 37666456 DOI: 10.1016/j.slasd.2023.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 08/09/2023] [Accepted: 08/29/2023] [Indexed: 09/06/2023]
Abstract
The field of high content imaging has steadily evolved and expanded substantially across many industry and academic research institutions since it was first described in the early 1990's. High content imaging refers to the automated acquisition and analysis of microscopic images from a variety of biological sample types. Integration of high content imaging microscopes with multiwell plate handling robotics enables high content imaging to be performed at scale and support medium- to high-throughput screening of pharmacological, genetic and diverse environmental perturbations upon complex biological systems ranging from 2D cell cultures to 3D tissue organoids to small model organisms. In this perspective article the authors provide a collective view on the following key discussion points relevant to the evolution of high content imaging: • Evolution and impact of high content imaging: An academic perspective • Evolution and impact of high content imaging: An industry perspective • Evolution of high content image analysis • Evolution of high content data analysis pipelines towards multiparametric and phenotypic profiling applications • The role of data integration and multiomics • The role and evolution of image data repositories and sharing standards • Future perspective of high content imaging hardware and software.
Collapse
Affiliation(s)
- Gregory P Way
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Heba Sailem
- School of Cancer and Pharmaceutical Sciences, King's College London, UK
| | - Steven Shave
- GlaxoSmithKline Medicines Research Centre, Gunnels Wood Rd, Stevenage SG1 2NY, UK; Edinburgh Cancer Research, Cancer Research UK Scotland Centre, Institute of Genetics and Cancer, University of Edinburgh, UK
| | - Richard Kasprowicz
- GlaxoSmithKline Medicines Research Centre, Gunnels Wood Rd, Stevenage SG1 2NY, UK
| | - Neil O Carragher
- Edinburgh Cancer Research, Cancer Research UK Scotland Centre, Institute of Genetics and Cancer, University of Edinburgh, UK.
| |
Collapse
|
6
|
Emissah H, Ljungquist B, Ascoli GA. Bibliometric analysis of neuroscience publications quantifies the impact of data sharing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.12.557386. [PMID: 37745378 PMCID: PMC10515804 DOI: 10.1101/2023.09.12.557386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Motivation Neural morphology, the branching geometry of neurons and glia in the nervous system, is an essential cellular substrate of brain function and pathology. Despite the accelerating production of digital reconstructions of neural morphology in laboratories worldwide, the public accessibility of data remains a core issue in neuroscience. Deficiencies in the availability of existing data create redundancy of research efforts and prevent researchers from building on others' work. Data sharing complements the development of computational resources and literature mining tools to accelerate scientific discovery. Results We carried out a comprehensive bibliometric analysis of neural morphology publications to quantify the impact of data sharing in the neuroscience community. Our findings demonstrate that sharing digital reconstructions of neural morphology via the NeuroMorpho.Org online repository leads to a significant increase of citations to the original article, thus directly benefiting the authors. Moreover, the rate of data reusage remains constant for at least 16 years after sharing (the whole period analyzed), altogether nearly doubling the peer-reviewed discoveries in the field. Furthermore, the recent availability of larger and more numerous datasets fostered integrative meta-analysis applications, which accrue on average twice the citations of re-analyses of individual datasets. We also designed and deployed an open-source citation tracking web-service that allows researchers to monitor reusage of their datasets in independent peer-reviewed reports. These results and the released tool can facilitate the recognition of shared data reuse for promotion and tenure considerations, merit evaluations, and funding decisions.
Collapse
Affiliation(s)
- Herve Emissah
- Bioinformatics Program, College of Science, George Mason University
| | - Bengt Ljungquist
- Center for Neural Informatics, Structures, and Plasticity, College of Engineering & Computing, George Mason University
| | - Giorgio A. Ascoli
- Bioinformatics Program, College of Science, George Mason University
- Center for Neural Informatics, Structures, and Plasticity, College of Engineering & Computing, George Mason University
| |
Collapse
|
7
|
Kemmer I, Keppler A, Serrano-Solano B, Rybina A, Özdemir B, Bischof J, El Ghadraoui A, Eriksson JE, Mathur A. Building a FAIR image data ecosystem for microscopy communities. Histochem Cell Biol 2023; 160:199-209. [PMID: 37341795 PMCID: PMC10492678 DOI: 10.1007/s00418-023-02203-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2023] [Indexed: 06/22/2023]
Abstract
Bioimaging has now entered the era of big data with faster-than-ever development of complex microscopy technologies leading to increasingly complex datasets. This enormous increase in data size and informational complexity within those datasets has brought with it several difficulties in terms of common and harmonized data handling, analysis, and management practices, which are currently hampering the full potential of image data being realized. Here, we outline a wide range of efforts and solutions currently being developed by the microscopy community to address these challenges on the path towards FAIR bioimaging data. We also highlight how different actors in the microscopy ecosystem are working together, creating synergies that develop new approaches, and how research infrastructures, such as Euro-BioImaging, are fostering these interactions to shape the field.
Collapse
Affiliation(s)
- Isabel Kemmer
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Antje Keppler
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Beatriz Serrano-Solano
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Arina Rybina
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Buğra Özdemir
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Johanna Bischof
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Ayoub El Ghadraoui
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - John E Eriksson
- Euro-BioImaging ERIC Statutory Seat, Tykistökatu 6, P.O. Box 123, 20521, Turku, Finland
| | - Aastha Mathur
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany.
| |
Collapse
|
8
|
O'Connor LM, O'Connor BA, Lim SB, Zeng J, Lo CH. Integrative multi-omics and systems bioinformatics in translational neuroscience: A data mining perspective. J Pharm Anal 2023; 13:836-850. [PMID: 37719197 PMCID: PMC10499660 DOI: 10.1016/j.jpha.2023.06.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 06/20/2023] [Accepted: 06/25/2023] [Indexed: 09/19/2023] Open
Abstract
Bioinformatic analysis of large and complex omics datasets has become increasingly useful in modern day biology by providing a great depth of information, with its application to neuroscience termed neuroinformatics. Data mining of omics datasets has enabled the generation of new hypotheses based on differentially regulated biological molecules associated with disease mechanisms, which can be tested experimentally for improved diagnostic and therapeutic targeting of neurodegenerative diseases. Importantly, integrating multi-omics data using a systems bioinformatics approach will advance the understanding of the layered and interactive network of biological regulation that exchanges systemic knowledge to facilitate the development of a comprehensive human brain profile. In this review, we first summarize data mining studies utilizing datasets from the individual type of omics analysis, including epigenetics/epigenomics, transcriptomics, proteomics, metabolomics, lipidomics, and spatial omics, pertaining to Alzheimer's disease, Parkinson's disease, and multiple sclerosis. We then discuss multi-omics integration approaches, including independent biological integration and unsupervised integration methods, for more intuitive and informative interpretation of the biological data obtained across different omics layers. We further assess studies that integrate multi-omics in data mining which provide convoluted biological insights and offer proof-of-concept proposition towards systems bioinformatics in the reconstruction of brain networks. Finally, we recommend a combination of high dimensional bioinformatics analysis with experimental validation to achieve translational neuroscience applications including biomarker discovery, therapeutic development, and elucidation of disease mechanisms. We conclude by providing future perspectives and opportunities in applying integrative multi-omics and systems bioinformatics to achieve precision phenotyping of neurodegenerative diseases and towards personalized medicine.
Collapse
Affiliation(s)
- Lance M. O'Connor
- College of Biological Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Blake A. O'Connor
- School of Pharmacy, University of Wisconsin, Madison, WI, 53705, USA
| | - Su Bin Lim
- Department of Biochemistry and Molecular Biology, Ajou University School of Medicine, Suwon, 16499, South Korea
| | - Jialiu Zeng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| | - Chih Hung Lo
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| |
Collapse
|
9
|
Danis D, Jacobsen JOB, Wagner AH, Groza T, Beckwith MA, Rekerle L, Carmody LC, Reese J, Hegde H, Ladewig MS, Seitz B, Munoz-Torres M, Harris NL, Rambla J, Baudis M, Mungall CJ, Haendel MA, Robinson PN. Phenopacket-tools: Building and validating GA4GH Phenopackets. PLoS One 2023; 18:e0285433. [PMID: 37196000 PMCID: PMC10191354 DOI: 10.1371/journal.pone.0285433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/21/2023] [Indexed: 05/19/2023] Open
Abstract
The Global Alliance for Genomics and Health (GA4GH) is a standards-setting organization that is developing a suite of coordinated standards for genomics. The GA4GH Phenopacket Schema is a standard for sharing disease and phenotype information that characterizes an individual person or biosample. The Phenopacket Schema is flexible and can represent clinical data for any kind of human disease including rare disease, complex disease, and cancer. It also allows consortia or databases to apply additional constraints to ensure uniform data collection for specific goals. We present phenopacket-tools, an open-source Java library and command-line application for construction, conversion, and validation of phenopackets. Phenopacket-tools simplifies construction of phenopackets by providing concise builders, programmatic shortcuts, and predefined building blocks (ontology classes) for concepts such as anatomical organs, age of onset, biospecimen type, and clinical modifiers. Phenopacket-tools can be used to validate the syntax and semantics of phenopackets as well as to assess adherence to additional user-defined requirements. The documentation includes examples showing how to use the Java library and the command-line tool to create and validate phenopackets. We demonstrate how to create, convert, and validate phenopackets using the library or the command-line application. Source code, API documentation, comprehensive user guide and a tutorial can be found at https://github.com/phenopackets/phenopacket-tools. The library can be installed from the public Maven Central artifact repository and the application is available as a standalone archive. The phenopacket-tools library helps developers implement and standardize the collection and exchange of phenotypic and other clinical data for use in phenotype-driven genomic diagnostics, translational research, and precision medicine applications.
Collapse
Affiliation(s)
- Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Julius O. B. Jacobsen
- William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
| | - Alex H. Wagner
- Departments of Pediatrics and Biomedical Informatics, The Ohio State University College of Medicine, Columbus, OH, United States of America
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States of America
| | | | - Martha A. Beckwith
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Lauren Rekerle
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Leigh C. Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Justin Reese
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Harshad Hegde
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Markus S. Ladewig
- Department of Ophthalmology, Klinikum Saarbrücken, Saarbrücken, Germany
| | - Berthold Seitz
- Department of Ophthalmology, Saarland University Medical Center, Homburg/Saar, Germany
| | - Monica Munoz-Torres
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Nomi L. Harris
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Jordi Rambla
- European Genome-Phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Michael Baudis
- University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Melissa A. Haendel
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Peter N. Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Institute for Systems Genomics, University of Connecticut, Farmington, CT, United States of America
| |
Collapse
|
10
|
Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, Rasmussen LV, Savidge TC, Starren J, Wu Q, Xin J, Yeaman MR, Zhou X, Su AI, Wu C, Brown L, Shabman RS, Hughes LD. Developing a standardized but extendable framework to increase the findability of infectious disease datasets. Sci Data 2023; 10:99. [PMID: 36823157 PMCID: PMC9950378 DOI: 10.1038/s41597-023-01968-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/13/2023] [Indexed: 02/25/2023] Open
Abstract
Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt open science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| | - Marco A Alvarado Cano
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - José Bento
- Department of Computer Science, Boston College, 245 Beacon St, Chestnut Hill, MA, 02467, USA
| | - Candice Czech
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Mengjia Kang
- Division of Pulmonary and Critical Care, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Lars Pache
- Infectious and Inflammatory Disease Center, Immunity and Pathogenesis Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Tor C Savidge
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Qinglong Wu
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jiwen Xin
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Michael R Yeaman
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Divisions of Molecular Medicine and Infectious Diseases, Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
- Lundquist Institute for Infection & Immunity at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Liliana Brown
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Reed S Shabman
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Laura D Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| |
Collapse
|
11
|
Gomes DGE, Pottier P, Crystal-Ornelas R, Hudgins EJ, Foroughirad V, Sánchez-Reyes LL, Turba R, Martinez PA, Moreau D, Bertram MG, Smout CA, Gaynor KM. Why don't we share data and code? Perceived barriers and benefits to public archiving practices. Proc Biol Sci 2022; 289:20221113. [PMID: 36416041 PMCID: PMC9682438 DOI: 10.1098/rspb.2022.1113] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 11/02/2022] [Indexed: 08/10/2023] Open
Abstract
The biological sciences community is increasingly recognizing the value of open, reproducible and transparent research practices for science and society at large. Despite this recognition, many researchers fail to share their data and code publicly. This pattern may arise from knowledge barriers about how to archive data and code, concerns about its reuse, and misaligned career incentives. Here, we define, categorize and discuss barriers to data and code sharing that are relevant to many research fields. We explore how real and perceived barriers might be overcome or reframed in the light of the benefits relative to costs. By elucidating these barriers and the contexts in which they arise, we can take steps to mitigate them and align our actions with the goals of open science, both as individual scientists and as a scientific community.
Collapse
Affiliation(s)
- Dylan G. E. Gomes
- NRC Research Associate, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA 98112, USA
- Cooperative Institute for Marine Resources Studies, Hatfield Marine Science Center, Oregon State University, Newport, OR 97365, USA
| | - Patrice Pottier
- Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, The University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Robert Crystal-Ornelas
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emma J. Hudgins
- Department of Biology, Carleton University, Ottawa, Canada, K1S 5B6
| | | | | | - Rachel Turba
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095-7239, USA
| | - Paula Andrea Martinez
- Australian Research Data Commons, The University of Queensland, Brisbane 4072, Australia
| | - David Moreau
- School of Psychology and Centre for Brain Research, University of Auckland, Auckland 1010, New Zealand
| | - Michael G. Bertram
- Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, SE-907 36, Sweden
| | - Cooper A. Smout
- Institute for Globally Distributed Open Research and Education (IGDORE), Brisbane 4001, Australia
| | - Kaitlyn M. Gaynor
- Departments of Zoology and Botany, University of British Columbia, Vancouver, Canada, BC V6T 1Z4
- National Center for Ecological Analysis and Synthesis, Santa Barbara, CA 93101, USA
| |
Collapse
|
12
|
Unifying the identification of biomedical entities with the Bioregistry. Sci Data 2022; 9:714. [DOI: 10.1038/s41597-022-01807-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 10/26/2022] [Indexed: 11/21/2022] Open
Abstract
AbstractThe standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry.
Collapse
|
13
|
Bittremieux W, Wang M, Dorrestein PC. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 2022; 18:94. [PMID: 36409434 DOI: 10.1007/s11306-022-01947-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 10/19/2022] [Indexed: 11/22/2022]
Abstract
BACKGROUND Spectral library searching is currently the most common approach for compound annotation in untargeted metabolomics. Spectral libraries applicable to liquid chromatography mass spectrometry have grown in size over the past decade to include hundreds of thousands to millions of mass spectra and tens of thousands of compounds, forming an essential knowledge base for the interpretation of metabolomics experiments. AIM OF REVIEW We describe existing spectral library resources, highlight different strategies for compiling spectral libraries, and discuss quality considerations that should be taken into account when interpreting spectral library searching results. Finally, we describe how spectral libraries are empowering the next generation of machine learning tools in computational metabolomics, and discuss several opportunities for using increasingly accessible large spectral libraries. KEY SCIENTIFIC CONCEPTS OF REVIEW This review focuses on the current state of spectral libraries for untargeted LC-MS/MS based metabolomics. We show how the number of entries in publicly accessible spectral libraries has increased more than 60-fold in the past eight years to aid molecular interpretation and we discuss how the role of spectral libraries in untargeted metabolomics will evolve in the near future.
Collapse
Affiliation(s)
- Wout Bittremieux
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
| | - Mingxun Wang
- Department of Computer Science, University of California Riverside, Riverside, CA, 92507, USA
| | - Pieter C Dorrestein
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
14
|
Wang LQ, Fernandez-Boyano I, Robinson WP. Genetic variation in placental insufficiency: What have we learned over time? Front Cell Dev Biol 2022; 10:1038358. [PMID: 36313546 PMCID: PMC9613937 DOI: 10.3389/fcell.2022.1038358] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/03/2022] [Indexed: 11/28/2022] Open
Abstract
Genetic variation shapes placental development and function, which has long been known to impact fetal growth and pregnancy outcomes such as miscarriage or maternal pre-eclampsia. Early epidemiology studies provided evidence of a strong heritable component to these conditions with both maternal and fetal-placental genetic factors contributing. Subsequently, cytogenetic studies of the placenta and the advent of prenatal diagnosis to detect chromosomal abnormalities provided direct evidence of the importance of spontaneously arising genetic variation in the placenta, such as trisomy and uniparental disomy, drawing inferences that remain relevant to this day. Candidate gene approaches highlighted the role of genetic variation in genes influencing immune interactions at the maternal-fetal interface and angiogenic factors. More recently, the emergence of molecular techniques and in particular high-throughput technologies such as Single-Nucleotide Polymorphism (SNP) arrays, has facilitated the discovery of copy number variation and study of SNP associations with conditions related to placental insufficiency. This review integrates past and more recent knowledge to provide important insights into the role of placental function on fetal and perinatal health, as well as into the mechanisms leading to genetic variation during development.
Collapse
Affiliation(s)
- Li Qing Wang
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Icíar Fernandez-Boyano
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Wendy P. Robinson
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
- *Correspondence: Wendy P. Robinson,
| |
Collapse
|
15
|
Garcia BJ, Urrutia J, Zheng G, Becker D, Corbet C, Maschhoff P, Cristofaro A, Gaffney N, Vaughn M, Saxena U, Chen YP, Gordon DB, Eslami M. A toolkit for enhanced reproducibility of RNASeq analysis for synthetic biologists. SYNTHETIC BIOLOGY (OXFORD, ENGLAND) 2022; 7:ysac012. [PMID: 36035514 PMCID: PMC9408027 DOI: 10.1093/synbio/ysac012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 06/17/2022] [Accepted: 08/22/2022] [Indexed: 11/13/2022]
Abstract
Sequencing technologies, in particular RNASeq, have become critical tools in the design, build, test and learn cycle of synthetic biology. They provide a better understanding of synthetic designs, and they help identify ways to improve and select designs. While these data are beneficial to design, their collection and analysis is a complex, multistep process that has implications on both discovery and reproducibility of experiments. Additionally, tool parameters, experimental metadata, normalization of data and standardization of file formats present challenges that are computationally intensive. This calls for high-throughput pipelines expressly designed to handle the combinatorial and longitudinal nature of synthetic biology. In this paper, we present a pipeline to maximize the analytical reproducibility of RNASeq for synthetic biologists. We also explore the impact of reproducibility on the validation of machine learning models. We present the design of a pipeline that combines traditional RNASeq data processing tools with structured metadata tracking to allow for the exploration of the combinatorial design in a high-throughput and reproducible manner. We then demonstrate utility via two different experiments: a control comparison experiment and a machine learning model experiment. The first experiment compares datasets collected from identical biological controls across multiple days for two different organisms. It shows that a reproducible experimental protocol for one organism does not guarantee reproducibility in another. The second experiment quantifies the differences in experimental runs from multiple perspectives. It shows that the lack of reproducibility from these different perspectives can place an upper bound on the validation of machine learning models trained on RNASeq data.
Graphical Abstract
Collapse
Affiliation(s)
- Benjamin J Garcia
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Joshua Urrutia
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | | | | | | | | | - Alexander Cristofaro
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Niall Gaffney
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | - Matthew Vaughn
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | - Uma Saxena
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - D Benjamin Gordon
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | |
Collapse
|
16
|
Forero DA, Curioso WH, Patrinos GP. The importance of adherence to international standards for depositing open data in public repositories. BMC Res Notes 2021; 14:405. [PMID: 34727971 PMCID: PMC8561348 DOI: 10.1186/s13104-021-05817-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 10/22/2021] [Indexed: 12/14/2022] Open
Abstract
There has been an important global interest in Open Science, which include open data and methods, in addition to open access publications. It has been proposed that public availability of raw data increases the value and the possibility of confirmation of scientific findings, in addition to the potential of reducing research waste. Availability of raw data in open repositories facilitates the adequate development of meta-analysis and the cumulative evaluation of evidence for specific topics. In this commentary, we discuss key elements about data sharing in open repositories and we invite researchers around the world to deposit their data in them.
Collapse
Affiliation(s)
- Diego A Forero
- Health and Sport Sciences Research Group, School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia. .,Professional Program in Respiratory Therapy, School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia.
| | - Walter H Curioso
- Vicerrectorado de Investigación, Universidad Continental, Lima, Peru
| | - George P Patrinos
- Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece.,Department of Pathology, College of Medicine and Health Sciences, United Arab Emirates University, Al-Ain, UAE.,Zayed Center for Health Sciences, United Arab Emirates University, Al-Ain, UAE
| |
Collapse
|
17
|
Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, Hicks SC. Reproducibility standards for machine learning in the life sciences. Nat Methods 2021; 18:1132-1135. [PMID: 34462593 PMCID: PMC9131851 DOI: 10.1038/s41592-021-01256-7] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
To make machine learning analyses in the life sciences more computationally reproducible, we propose standards based on data, model, and code publication, programming best practices, and workflow automation. By meeting these standards, the community of researchers applying machine learning methods in the life sciences can ensure that their analyses are worthy of trust. this article has been peer reviewed.
Collapse
Affiliation(s)
- Benjamin J Heil
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Florian Markowetz
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
18
|
Way GP, Greene CS, Carninci P, Carvalho BS, de Hoon M, Finley SD, Gosline SJC, Lȇ Cao KA, Lee JSH, Marchionni L, Robine N, Sindi SS, Theis FJ, Yang JYH, Carpenter AE, Fertig EJ. A field guide to cultivating computational biology. PLoS Biol 2021; 19:e3001419. [PMID: 34618807 PMCID: PMC8525744 DOI: 10.1371/journal.pbio.3001419] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 10/19/2021] [Indexed: 11/18/2022] Open
Abstract
Evolving in sync with the computation revolution over the past 30 years, computational biology has emerged as a mature scientific field. While the field has made major contributions toward improving scientific knowledge and human health, individual computational biology practitioners at various institutions often languish in career development. As optimistic biologists passionate about the future of our field, we propose solutions for both eager and reluctant individual scientists, institutions, publishers, funding agencies, and educators to fully embrace computational biology. We believe that in order to pave the way for the next generation of discoveries, we need to improve recognition for computational biologists and better align pathways of career success with pathways of scientific progress. With 10 outlined steps, we call on all adjacent fields to move away from the traditional individual, single-discipline investigator research model and embrace multidisciplinary, data-driven, team science.
Collapse
Affiliation(s)
- Gregory P. Way
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Casey S. Greene
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Piero Carninci
- RIKEN Center for Integrative Medical Sciences Yokohama, Kanagawa, Japan
- Human Technopole, Milan, Italy
| | - Benilton S. Carvalho
- Department of Statistics, Institute of Mathematics, Statistics and Scientific Computing, University of Campinas, Campinas, Brazil
| | - Michiel de Hoon
- RIKEN Center for Integrative Medical Sciences Yokohama, Kanagawa, Japan
| | - Stacey D. Finley
- Department of Biomedical Engineering, Quantitative and Computational Biology, and Chemical Engineering & Materials Science, University of Southern California, Los Angeles, California, United States of America
| | - Sara J. C. Gosline
- Pacific Northwest National Laboratory, Seattle, Washington, United States of America
| | - Kim-Anh Lȇ Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - Jerry S. H. Lee
- Ellison Institute and Departments of Medicine/Oncology, Chemical Engineering, and Material Sciences, University of Southern California, Los Angeles, California, United States of America
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill-Cornell Medicine, New York, New York, United States of America
| | - Nicolas Robine
- Computational Biology Lab, New York Genome Center, New York, New York, United States of America
| | - Suzanne S. Sindi
- Department of Applied Mathematics, University of California Merced, Merced, California, United States of America
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Center Munich and Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Jean Y. H. Yang
- Charles Perkins Centre and School of Mathematics and Statistics, The University of Sydney, Australia
| | - Anne E. Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Elana J. Fertig
- Convergence Institute, Departments of Oncology, Biomedical Engineering, and Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
19
|
Abstract
The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Collapse
Affiliation(s)
- William E Fondrie
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|