1
|
Making your raw data available to the macromolecular crystallography community. Acta Crystallogr F Struct Biol Commun 2023; 79:267-273. [PMID: 37815476 PMCID: PMC10565795 DOI: 10.1107/s2053230x23007987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023] Open
Abstract
A recent editorial in the IUCr macromolecular crystallography journals [Helliwell et al. (2019), Acta Cryst. D75, 455-457] called for the implementation of the FAIR data principles. This implies that the authors of a paper that describes research on a macromolecular structure should make their raw diffraction data available. Authors are already used to submitting the derived data (coordinates) and the processed data (structure factors, merged or unmerged) to the PDB, but may still be uncomfortable with making the raw diffraction images available. In this paper, some guidelines and instructions on depositing raw data to Zenodo are given.
Collapse
|
2
|
Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management. Histochem Cell Biol 2023; 160:169-192. [PMID: 37052655 PMCID: PMC10492738 DOI: 10.1007/s00418-023-02191-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/21/2023] [Indexed: 04/14/2023]
Abstract
The second decade of the twenty-first century witnessed a new challenge in the handling of microscopy data. Big data, data deluge, large data, data compliance, data analytics, data integrity, data interoperability, data retention and data lifecycle are terms that have introduced themselves to the electron microscopy sciences. This is largely attributed to the booming development of new microscopy hardware tools. As a result, large digital image files with an average size of one terabyte within one single acquisition session is not uncommon nowadays, especially in the field of cryogenic electron microscopy. This brings along numerous challenges in data transfer, compute and management. In this review, we will discuss in detail the current state of international knowledge on big data in contemporary electron microscopy and how big data can be transferred, computed and managed efficiently and sustainably. Workflows, solutions, approaches and suggestions will be provided, with the example of the latest experiences in Australia. Finally, important principles such as data integrity, data lifetime and the FAIR and CARE principles will be considered.
Collapse
|
3
|
FACT and FAIR with Big Data allows objectivity in science: The view of crystallography. STRUCTURAL DYNAMICS (MELVILLE, N.Y.) 2019; 6:054306. [PMID: 31673568 PMCID: PMC6816445 DOI: 10.1063/1.5124439] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 10/09/2019] [Indexed: 05/14/2023]
Abstract
A publication is an important narrative of the work done and interpretations made by researchers securing a scientific discovery. As The Royal Society neatly states though, "Nullius in verba" ("Take nobody's word for it"), whereby the role of the underpinning data is paramount. Therefore, the objectivity that preserving that data within the article provides is due to readers being able to check the calculation decisions of the authors. But how to achieve full data archiving? This is the raw data archiving challenge, in size and need for correct metadata. Processed diffraction data and final derived molecular coordinates archiving in crystallography have achieved an exemplary state of the art relative to most fields. One can credit IUCr with developing exemplary peer review procedures, of narrative, underpinning structure factors and coordinate data and validation report, through its checkcif development and submission system introduced for Acta Cryst. C and subsequently developed for its other chemistry journals. The crystallographic databases likewise have achieved amazing success and sustainability these last 50 years or so. The wider science data scene is celebrating the FAIR data accord, namely, that data be Findable, Accessible, Interoperable, and Reusable [Wilkinson et al., "Comment: The FAIR guiding principles for scientific data management and stewardship," Sci. Data 3, 160018 (2016)]. Some social scientists also emphasize more than FAIR being needed, the data should be "FACT," which is an acronym meaning Fair, Accurate, Confidential, and Transparent [van der Aalst et al., "Responsible data science," Bus Inf. Syst. Eng. 59(5), 311-313 (2017)], this being the issue of ensuring reproducibility not just reusability. (Confidentiality of data not likely being relevant to our data obviously.) Acta Cryst. B, C, E, and IUCrData are the closest I know to being both FACT and FAIR where I repeat for due emphasis: the narrative, the automatic "general" validation checks, and the underpinning data are checked thoroughly by subject specialists (i.e., the specialist referees). IUCr Journals are also the best that I know of for encouraging and then expediting the citation of the DOI for a raw diffraction dataset in a publication; examples can be found in IUCrJ, Acta Cryst D, and Acta Cryst F. The wish for a checkcif for raw diffraction data has been championed by the IUCr Diffraction Data Deposition Working Group and its successor, the IUCr Committee on Data.
Collapse
|
4
|
Synchrotron Big Data Science. SMALL (WEINHEIM AN DER BERGSTRASSE, GERMANY) 2018; 14:e1802291. [PMID: 30222245 DOI: 10.1002/smll.201802291] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 07/27/2018] [Indexed: 06/08/2023]
Abstract
The rapid development of synchrotrons has massively increased the speed at which experiments can be performed, while new techniques have increased the amount of raw data collected during each experiment. While this has created enormous new opportunities, it has also created tremendous challenges for national facilities and users. With the huge increase in data volume, the manual analysis of data is no longer possible. As a result, only a fraction of the data collected during the time- and money-expensive synchrotron beam-time is analyzed and used to deliver new science. Additionally, the lack of an appropriate data analysis environment limits the realization of experiments that generate a large amount of data in a very short period of time. The current lack of automated data analysis pipelines prevents the fine-tuning of beam-time experiments, further reducing their potential usage. These effects, collectively known as the "data deluge," affect synchrotrons in several different ways including fast data collection, available local storage, data management systems, and curation of the data. This review highlights the Big Data strategies adopted nowadays at synchrotrons, documenting this novel and promising hybridization between science and technology, which promise a dramatic increase in the number of scientific discoveries.
Collapse
|
5
|
DA+ data acquisition and analysis software at the Swiss Light Source macromolecular crystallography beamlines. JOURNAL OF SYNCHROTRON RADIATION 2018; 25:293-303. [PMID: 29271779 PMCID: PMC5741135 DOI: 10.1107/s1600577517014503] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2017] [Accepted: 10/08/2017] [Indexed: 05/19/2023]
Abstract
Data acquisition software is an essential component of modern macromolecular crystallography (MX) beamlines, enabling efficient use of beam time at synchrotron facilities. Developed at the Paul Scherrer Institute, the DA+ data acquisition software is implemented at all three Swiss Light Source (SLS) MX beamlines. DA+ consists of distributed services and components written in Python and Java, which communicate via messaging and streaming technologies. The major components of DA+ are the user interface, acquisition engine, online processing and database. Immediate data quality feedback is achieved with distributed automatic data analysis routines. The software architecture enables exploration of the full potential of the latest instrumentation at the SLS MX beamlines, such as the SmarGon goniometer and the EIGER X 16M detector, and development of new data collection methods.
Collapse
|
6
|
Abstract
Understanding published research results should be through one's own eyes and include the opportunity to work with raw diffraction data to check the various decisions made in the analyses by the original authors. Today, preserving raw diffraction data is technically and organizationally viable at a growing number of data archives, both centralized and distributed, which are empowered to register data sets and obtain a preservation descriptor, typically a 'digital object identifier'. This introduces an important role of preserving raw data, namely understanding where we fail in or could improve our analyses. Individual science area case studies in crystallography are provided.
Collapse
|
7
|
Databases, Repositories, and Other Data Resources in Structural Biology. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2017; 1607:643-665. [PMID: 28573593 DOI: 10.1007/978-1-4939-7000-1_27] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Structural biology, like many other areas of modern science, produces an enormous amount of primary, derived, and "meta" data with a high demand on data storage and manipulations. Primary data come from various steps of sample preparation, diffraction experiments, and functional studies. These data are not only used to obtain tangible results, like macromolecular structural models, but also to enrich and guide our analysis and interpretation of various biomedical problems. Herein we define several categories of data resources, (a) Archives, (b) Repositories, (c) Databases, and (d) Advanced Information Systems, that can accommodate primary, derived, or reference data. Data resources may be used either as web portals or internally by structural biology software. To be useful, each resource must be maintained, curated, as well as integrated with other resources. Ideally, the system of interconnected resources should evolve toward comprehensive "hubs", or Advanced Information Systems. Such systems, encompassing the PDB and UniProt, are indispensable not only for structural biology, but for many related fields of science. The categories of data resources described herein are applicable well beyond our usual scientific endeavors.
Collapse
|
8
|
Abstract
Macromolecular Big Data provide numerous challenges and a number of initiatives that are starting to overcome these issues are discussed.
Collapse
|
9
|
Raw diffraction data preservation and reuse: overview, update on practicalities and metadata requirements. IUCRJ 2017; 4:87-99. [PMID: 28250944 PMCID: PMC5331468 DOI: 10.1107/s2052252516018315] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Accepted: 11/15/2016] [Indexed: 05/20/2023]
Abstract
A topical review is presented of the rapidly developing interest in and storage options for the preservation and reuse of raw data within the scientific domain of the IUCr and its Commissions, each of which operates within a great diversity of instrumentation. A résumé is included of the case for raw diffraction data deposition. An overall context is set by highlighting the initiatives of science policy makers towards an 'Open Science' model within which crystallographers will increasingly work in the future; this will bring new funding opportunities but also new codes of procedure within open science frameworks. Skills education and training for crystallographers will need to be expanded. Overall, there are now the means and the organization for the preservation of raw crystallographic diffraction data via different types of archive, such as at universities, discipline-specific repositories (Integrated Resource for Reproducibility in Macromol-ecular Crystallography, Structural Biology Data Grid), general public data repositories (Zenodo, ResearchGate) and centralized neutron and X-ray facilities. Formulation of improved metadata descriptors for the raw data types of each of the IUCr Commissions is in progress; some detailed examples are provided. A number of specific case studies are presented, including an example research thread that provides complete open access to raw data.
Collapse
|
10
|
A public database of macromolecular diffraction experiments. Acta Crystallogr D Struct Biol 2016; 72:1181-1193. [PMID: 27841751 PMCID: PMC5108346 DOI: 10.1107/s2059798316014716] [Citation(s) in RCA: 91] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Accepted: 09/17/2016] [Indexed: 12/28/2022] Open
Abstract
The low reproducibility of published experimental results in many scientific disciplines has recently garnered negative attention in scientific journals and the general media. Public transparency, including the availability of `raw' experimental data, will help to address growing concerns regarding scientific integrity. Macromolecular X-ray crystallography has led the way in requiring the public dissemination of atomic coordinates and a wealth of experimental data, making the field one of the most reproducible in the biological sciences. However, there remains no mandate for public disclosure of the original diffraction data. The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) has been developed to archive raw data from diffraction experiments and, equally importantly, to provide related metadata. Currently, the database of our resource contains data from 2920 macromolecular diffraction experiments (5767 data sets), accounting for around 3% of all depositions in the Protein Data Bank (PDB), with their corresponding partially curated metadata. IRRMC utilizes distributed storage implemented using a federated architecture of many independent storage servers, which provides both scalability and sustainability. The resource, which is accessible via the web portal at http://www.proteindiffraction.org, can be searched using various criteria. All data are available for unrestricted access and download. The resource serves as a proof of concept and demonstrates the feasibility of archiving raw diffraction data and associated metadata from X-ray crystallographic studies of biological macromolecules. The goal is to expand this resource and include data sets that failed to yield X-ray structures in order to facilitate collaborative efforts that will improve protein structure-determination methods and to ensure the availability of `orphan' data left behind for various reasons by individual investigators and/or extinct structural genomics projects.
Collapse
|
11
|
Data publication with the structural biology data grid supports live analysis. Nat Commun 2016; 7:10882. [PMID: 26947396 PMCID: PMC4786681 DOI: 10.1038/ncomms10882] [Citation(s) in RCA: 93] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Accepted: 01/28/2016] [Indexed: 11/26/2022] Open
Abstract
Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.
Collapse
|
12
|
Safeguarding Structural Data Repositories against Bad Apples. Structure 2016; 24:216-20. [PMID: 26840827 PMCID: PMC4743038 DOI: 10.1016/j.str.2015.12.010] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Revised: 12/15/2015] [Accepted: 12/16/2015] [Indexed: 11/17/2022]
Abstract
Structural biology research generates large amounts of data, some deposited in public databases or repositories, but a substantial remainder never becomes available to the scientific community. In addition, some of the deposited data contain less or more serious errors that may bias the results of data mining. Thorough analysis and discussion of these problems is needed to ameliorate this situation. This perspective is an attempt to propose some solutions and encourage both further discussion and action on the part of the relevant organizations, in particular the PDB and various bodies of the International Union of Crystallography.
Collapse
|
13
|
MicroED data collection and processing. ACTA CRYSTALLOGRAPHICA A-FOUNDATION AND ADVANCES 2015; 71:353-60. [PMID: 26131894 PMCID: PMC4487423 DOI: 10.1107/s2053273315010669] [Citation(s) in RCA: 94] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2014] [Accepted: 06/02/2015] [Indexed: 11/30/2022]
Abstract
The collection and processing of MicroED data are presented. MicroED, a method at the intersection of X-ray crystallography and electron cryo-microscopy, has rapidly progressed by exploiting advances in both fields and has already been successfully employed to determine the atomic structures of several proteins from sub-micron-sized, three-dimensional crystals. A major limiting factor in X-ray crystallography is the requirement for large and well ordered crystals. By permitting electron diffraction patterns to be collected from much smaller crystals, or even single well ordered domains of large crystals composed of several small mosaic blocks, MicroED has the potential to overcome the limiting size requirement and enable structural studies on difficult-to-crystallize samples. This communication details the steps for sample preparation, data collection and reduction necessary to obtain refined, high-resolution, three-dimensional models by MicroED, and presents some of its unique challenges.
Collapse
|
14
|
The structure of human SFPQ reveals a coiled-coil mediated polymer essential for functional aggregation in gene regulation. Nucleic Acids Res 2015; 43:3826-40. [PMID: 25765647 PMCID: PMC4402515 DOI: 10.1093/nar/gkv156] [Citation(s) in RCA: 98] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Accepted: 02/18/2015] [Indexed: 12/14/2022] Open
Abstract
SFPQ, (a.k.a. PSF), is a human tumor suppressor protein that regulates many important functions in the cell nucleus including coordination of long non-coding RNA molecules into nuclear bodies. Here we describe the first crystal structures of Splicing Factor Proline and Glutamine Rich (SFPQ), revealing structural similarity to the related PSPC1/NONO heterodimer and a strikingly extended structure (over 265 Å long) formed by an unusual anti-parallel coiled-coil that results in an infinite linear polymer of SFPQ dimers within the crystals. Small-angle X-ray scattering and transmission electron microscopy experiments show that polymerization is reversible in solution and can be templated by DNA. We demonstrate that the ability to polymerize is essential for the cellular functions of SFPQ: disruptive mutation of the coiled-coil interaction motif results in SFPQ mislocalization, reduced formation of nuclear bodies, abrogated molecular interactions and deficient transcriptional regulation. The coiled-coil interaction motif thus provides a molecular explanation for the functional aggregation of SFPQ that directs its role in regulating many aspects of cellular nucleic acid metabolism.
Collapse
|
15
|
The design and structural characterization of a synthetic pentatricopeptide repeat protein. ACTA ACUST UNITED AC 2015; 71:196-208. [PMID: 25664731 DOI: 10.1107/s1399004714024869] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 11/12/2014] [Indexed: 11/10/2022]
Abstract
Proteins of the pentatricopeptide repeat (PPR) superfamily are characterized by tandem arrays of a degenerate 35-amino-acid α-hairpin motif. PPR proteins are typically single-stranded RNA-binding proteins with essential roles in organelle biogenesis, RNA editing and mRNA maturation. A modular, predictable code for sequence-specific binding of RNA by PPR proteins has recently been revealed, which opens the door to the de novo design of bespoke proteins with specific RNA targets, with widespread biotechnological potential. Here, the design and production of a synthetic PPR protein based on a consensus sequence and the determination of its crystal structure to 2.2 Å resolution are described. The crystal structure displays helical disorder, resulting in electron density representing an infinite superhelical PPR protein. A structural comparison with related tetratricopeptide repeat (TPR) proteins, and with native PPR proteins, reveals key roles for conserved residues in directing the structure and function of PPR proteins. The designed proteins have high solubility and thermal stability, and can form long tracts of PPR repeats. Thus, consensus-sequence synthetic PPR proteins could provide a suitable backbone for the design of bespoke RNA-binding proteins with the potential for high specificity.
Collapse
|
16
|
Two-Pronged Attack: Dual Inhibition of Plasmodium falciparum M1 and M17 Metalloaminopeptidases by a Novel Series of Hydroxamic Acid-Based Inhibitors. J Med Chem 2014; 57:9168-83. [DOI: 10.1021/jm501323a] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
17
|
How to make deposition of images a reality. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2014; 70:2520-32. [PMID: 25286838 PMCID: PMC4188000 DOI: 10.1107/s1399004714005185] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Accepted: 03/06/2014] [Indexed: 11/24/2022]
Abstract
The IUCr Diffraction Data Deposition Working Group is investigating the rationale and policies for routine deposition of diffraction images (and other primary experimental data sets). An information-management framework is described that should inform policy directions, and some of the technical and other issues that need to be addressed in an effort to achieve such a goal are analysed. In the near future, routine data deposition could be encouraged at one of the growing number of institutional repositories that accept data sets or at a generic data-publishing web repository service. To realise all of the potential benefits of depositing diffraction data, specialized archives would be preferable. Funding such an initiative will be challenging.
Collapse
|