1
|
Johnston LR, Hofelich Mohr A, Herndon J, Taylor S, Carlson JR, Ge L, Moore J, Petters J, Kozlowski W, Hudson Vitale C. Seek and you may (not) find: A multi-institutional analysis of where research data are shared. PLoS One 2024; 19:e0302426. [PMID: 38662676 PMCID: PMC11045069 DOI: 10.1371/journal.pone.0302426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 04/04/2024] [Indexed: 04/28/2024] Open
Abstract
Research data sharing has become an expected component of scientific research and scholarly publishing practice over the last few decades, due in part to requirements for federally funded research. As part of a larger effort to better understand the workflows and costs of public access to research data, this project conducted a high-level analysis of where academic research data is most frequently shared. To do this, we leveraged the DataCite and Crossref application programming interfaces (APIs) in search of Publisher field elements demonstrating which data repositories were utilized by researchers from six academic research institutions between 2012-2022. In addition, we also ran a preliminary analysis of the quality of the metadata associated with these published datasets, comparing the extent to which information was missing from metadata fields deemed important for public access to research data. Results show that the top 10 publishers accounted for 89.0% to 99.8% of the datasets connected with the institutions in our study. Known data repositories, including institutional data repositories hosted by those institutions, were initially lacking from our sample due to varying metadata standards and practices. We conclude that the metadata quality landscape for published research datasets is uneven; key information, such as author affiliation, is often incomplete or missing from source data repositories and aggregators. To enhance the findability, interoperability, accessibility, and reusability (FAIRness) of research data, we provide a set of concrete recommendations that repositories and data authors can take to improve scholarly metadata associated with shared datasets.
Collapse
Affiliation(s)
- Lisa R. Johnston
- Data, Academic Planning & Institutional Research, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Alicia Hofelich Mohr
- Liberal Arts Technologies and Innovation Services, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Joel Herndon
- Center for Data and Visualization Sciences, Duke University Libraries, Duke University, Durham, North Carolina, United States of America
| | - Shawna Taylor
- Association of Research Libraries, Washington, D.C., United States of America
| | - Jake R. Carlson
- University at Buffalo Libraries, University at Buffalo, Buffalo, New York, United States of America
| | - Lizhao Ge
- Milken Institute School of Public Health, George Washington University, Washington, D.C., United States of America
| | - Jennifer Moore
- University Libraries, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jonathan Petters
- Data Services, University Libraries, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Wendy Kozlowski
- Research Data and Open Scholarship, Cornell University Library, Cornell University, Ithaca, New York, United States of America
| | | |
Collapse
|
2
|
Claussnitzer M, Parikh VN, Wagner AH, Arbesfeld JA, Bult CJ, Firth HV, Muffley LA, Nguyen Ba AN, Riehle K, Roth FP, Tabet D, Bolognesi B, Glazer AM, Rubin AF. Minimum information and guidelines for reporting a multiplexed assay of variant effect. Genome Biol 2024; 25:100. [PMID: 38641812 PMCID: PMC11027375 DOI: 10.1186/s13059-024-03223-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 03/25/2024] [Indexed: 04/21/2024] Open
Abstract
Multiplexed assays of variant effect (MAVEs) have emerged as a powerful approach for interrogating thousands of genetic variants in a single experiment. The flexibility and widespread adoption of these techniques across diverse disciplines have led to a heterogeneous mix of data formats and descriptions, which complicates the downstream use of the resulting datasets. To address these issues and promote reproducibility and reuse of MAVE data, we define a set of minimum information standards for MAVE data and metadata and outline a controlled vocabulary aligned with established biomedical ontologies for describing these experimental designs.
Collapse
Affiliation(s)
- Melina Claussnitzer
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Cambridge, MA, 02142, USA
| | - Victoria N Parikh
- Stanford Center for Inherited Cardiovascular Disease, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Alex H Wagner
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43215, USA
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43210, USA
| | - Jeremy A Arbesfeld
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43215, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Carol J Bult
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | - Helen V Firth
- Wellcome Sanger Institute, Hinxton, Cambridge, UK
- Dept of Medical Genetics, Cambridge University Hospitals NHS Trust, Cambridge, UK
| | - Lara A Muffley
- Department of Genome Sciences, University of Washington, Seattle, WA, 98105, USA
| | - Alex N Nguyen Ba
- Department of Biology, University of Toronto at Mississauga, Mississauga, ON, Canada
| | - Kevin Riehle
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Frederick P Roth
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Daniel Tabet
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Benedetta Bolognesi
- Institute for Bioengineering of Catalunya (IBEC), The Barcelona Institute of Science and Technology, Barcelona, Spain.
| | - Andrew M Glazer
- Vanderbilt University Medical Center, Nashville, TN, 37232, USA.
| | - Alan F Rubin
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia.
- Department of Medical Biology, University of Melbourne, Parkville, VIC, Australia.
| |
Collapse
|
3
|
Athanasoulias S, Guasselli F, Doulamis N, Doulamis A, Ipiotis N, Katsari A, Stankovic L, Stankovic V. The Plegma dataset: Domestic appliance-level and aggregate electricity demand with metadata from Greece. Sci Data 2024; 11:376. [PMID: 38609400 PMCID: PMC11014970 DOI: 10.1038/s41597-024-03208-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 04/02/2024] [Indexed: 04/14/2024] Open
Abstract
The growing availability of smart meter data has facilitated the development of energy-saving services like demand response, personalized energy feedback, and non-intrusive-load-monitoring applications, all of which heavily rely on advanced machine learning algorithms trained on energy consumption datasets. To ensure the accuracy and reliability of these services, real-world smart meter data collection is crucial. The Plegma dataset described in this paper addresses this need bfy providing whole- house aggregate loads and appliance-level consumption measurements at 10-second intervals from 13 different households over a period of one year. It also includes environmental data such as humidity and temperature, building characteristics, demographic information, and user practice routines to enable quantitative as well as qualitative analysis. Plegma is the first high-frequency electricity measurements dataset in Greece, capturing the consumption behavior of people in the Mediterranean area who use devices not commonly included in other datasets, such as AC and electric-water boilers. The dataset comprises 218 million readings from 88 installed meters and sensors. The collected data are available in CSV format.
Collapse
Affiliation(s)
- Sotirios Athanasoulias
- National Technical University of Athens, School of Rural, Surveying and Geoinformatics Engineering, Athens, 157 80, Greece.
- Plegma Labs, Marousi, 151 24, Greece.
| | - Fernanda Guasselli
- Aalborg University, Department of the Built Environment, Copenhagen, 2450, Denmark
| | - Nikolaos Doulamis
- National Technical University of Athens, School of Rural, Surveying and Geoinformatics Engineering, Athens, 157 80, Greece
| | - Anastasios Doulamis
- National Technical University of Athens, School of Rural, Surveying and Geoinformatics Engineering, Athens, 157 80, Greece
| | | | | | - Lina Stankovic
- University of Strathclyde, Department of Electronic and Electrical Engineering, Glasgow, G1 1XQ, UK
| | - Vladimir Stankovic
- University of Strathclyde, Department of Electronic and Electrical Engineering, Glasgow, G1 1XQ, UK
| |
Collapse
|
4
|
Sittinger M, Uhler J, Pink M, Herz A. Insect detect: An open-source DIY camera trap for automated insect monitoring. PLoS One 2024; 19:e0295474. [PMID: 38568922 PMCID: PMC10990185 DOI: 10.1371/journal.pone.0295474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 02/28/2024] [Indexed: 04/05/2024] Open
Abstract
Insect monitoring is essential to design effective conservation strategies, which are indispensable to mitigate worldwide declines and biodiversity loss. For this purpose, traditional monitoring methods are widely established and can provide data with a high taxonomic resolution. However, processing of captured insect samples is often time-consuming and expensive, which limits the number of potential replicates. Automated monitoring methods can facilitate data collection at a higher spatiotemporal resolution with a comparatively lower effort and cost. Here, we present the Insect Detect DIY (do-it-yourself) camera trap for non-invasive automated monitoring of flower-visiting insects, which is based on low-cost off-the-shelf hardware components combined with open-source software. Custom trained deep learning models detect and track insects landing on an artificial flower platform in real time on-device and subsequently classify the cropped detections on a local computer. Field deployment of the solar-powered camera trap confirmed its resistance to high temperatures and humidity, which enables autonomous deployment during a whole season. On-device detection and tracking can estimate insect activity/abundance after metadata post-processing. Our insect classification model achieved a high top-1 accuracy on the test dataset and generalized well on a real-world dataset with captured insect images. The camera trap design and open-source software are highly customizable and can be adapted to different use cases. With custom trained detection and classification models, as well as accessible software programming, many possible applications surpassing our proposed deployment method can be realized.
Collapse
Affiliation(s)
- Maximilian Sittinger
- Julius Kühn Institute (JKI)—Federal Research Centre for Cultivated Plants, Institute for Biological Control, Dossenheim, Germany
| | - Johannes Uhler
- Julius Kühn Institute (JKI)—Federal Research Centre for Cultivated Plants, Institute for Biological Control, Dossenheim, Germany
| | - Maximilian Pink
- Julius Kühn Institute (JKI)—Federal Research Centre for Cultivated Plants, Institute for Biological Control, Dossenheim, Germany
| | - Annette Herz
- Julius Kühn Institute (JKI)—Federal Research Centre for Cultivated Plants, Institute for Biological Control, Dossenheim, Germany
| |
Collapse
|
5
|
Sivagnanam S, Yeu S, Lin K, Sakai S, Garzon F, Yoshimoto K, Prantzalos K, Upadhyaya DP, Majumdar A, Sahoo SS, Lytton WW. Towards building a trustworthy pipeline integrating Neuroscience Gateway and Open Science Chain. Database (Oxford) 2024; 2024:baae023. [PMID: 38581360 PMCID: PMC10998337 DOI: 10.1093/database/baae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 02/22/2024] [Accepted: 03/11/2024] [Indexed: 04/08/2024]
Abstract
When the scientific dataset evolves or is reused in workflows creating derived datasets, the integrity of the dataset with its metadata information, including provenance, needs to be securely preserved while providing assurances that they are not accidentally or maliciously altered during the process. Providing a secure method to efficiently share and verify the data as well as metadata is essential for the reuse of the scientific data. The National Science Foundation (NSF) funded Open Science Chain (OSC) utilizes consortium blockchain to provide a cyberinfrastructure solution to maintain integrity of the provenance metadata for published datasets and provides a way to perform independent verification of the dataset while promoting reuse and reproducibility. The NSF- and National Institutes of Health (NIH)-funded Neuroscience Gateway (NSG) provides a freely available web portal that allows neuroscience researchers to execute computational data analysis pipeline on high performance computing resources. Combined, the OSC and NSG platforms form an efficient, integrated framework to automatically and securely preserve and verify the integrity of the artifacts used in research workflows while using the NSG platform. This paper presents the results of the first study that integrates OSC-NSG frameworks to track the provenance of neurophysiological signal data analysis to study brain network dynamics using the Neuro-Integrative Connectivity tool, which is deployed in the NSG platform. Database URL: https://www.opensciencechain.org.
Collapse
Affiliation(s)
- S Sivagnanam
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
- Biomedical Engineering, SUNY Downstate Health Sciences University, 450 Clarkson Avenue, Brooklyn, NY 11203, USA
| | - S Yeu
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - K Lin
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - S Sakai
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - F Garzon
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - K Yoshimoto
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - K Prantzalos
- School of Medicine, Case Western University, 9501 Euclid Ave, Cleveland, OH 44106, USA
| | - D P Upadhyaya
- School of Medicine, Case Western University, 9501 Euclid Ave, Cleveland, OH 44106, USA
| | - A Majumdar
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - S S Sahoo
- School of Medicine, Case Western University, 9501 Euclid Ave, Cleveland, OH 44106, USA
| | - W W Lytton
- Biomedical Engineering, SUNY Downstate Health Sciences University, 450 Clarkson Avenue, Brooklyn, NY 11203, USA
| |
Collapse
|
6
|
Foer D, Rubins DM, Nguyen V, McDowell A, Quint M, Kellaway M, Reisner SL, Zhou L, Bates DW. Utilization of electronic health record sex and gender demographic fields: a metadata and mixed methods analysis. J Am Med Inform Assoc 2024; 31:910-918. [PMID: 38308819 PMCID: PMC10990507 DOI: 10.1093/jamia/ocae016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 12/12/2023] [Accepted: 01/17/2024] [Indexed: 02/05/2024] Open
Abstract
OBJECTIVES Despite federally mandated collection of sex and gender demographics in the electronic health record (EHR), longitudinal assessments are lacking. We assessed sex and gender demographic field utilization using EHR metadata. MATERIALS AND METHODS Patients ≥18 years of age in the Mass General Brigham health system with a first Legal Sex entry (registration requirement) between January 8, 2018 and January 1, 2022 were included in this retrospective study. Metadata for all sex and gender fields (Legal Sex, Sex Assigned at Birth [SAAB], Gender Identity) were quantified by completion rates, user types, and longitudinal change. A nested qualitative study of providers from specialties with high and low field use identified themes related to utilization. RESULTS 1 576 120 patients met inclusion criteria: 100% had a Legal Sex, 20% a Gender Identity, and 19% a SAAB; 321 185 patients had field changes other than initial Legal Sex entry. About 2% of patients had a subsequent Legal Sex change, and 25% of those had ≥2 changes; 20% of patients had ≥1 update to Gender Identity and 19% to SAAB. Excluding the first Legal Sex entry, administrators made most changes (67%) across all fields, followed by patients (25%), providers (7.2%), and automated Health Level-7 (HL7) interface messages (0.7%). Provider utilization varied by subspecialty; themes related to systems barriers and personal perceptions were identified. DISCUSSION Sex and gender demographic fields are primarily used by administrators and raise concern about data accuracy; provider use is heterogenous and lacking. Provider awareness of field availability and variable workflows may impede use. CONCLUSION EHR metadata highlights areas for improvement of sex and gender field utilization.
Collapse
Affiliation(s)
- Dinah Foer
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA 02115, United States
- Harvard Medical School, Boston, MA 02115, United States
| | - David M Rubins
- Harvard Medical School, Boston, MA 02115, United States
- Mass General Brigham Digital, Somerville, MA 02145, United States
| | - Vi Nguyen
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA 02115, United States
| | - Alex McDowell
- Harvard Medical School, Boston, MA 02115, United States
- Health Policy Research Institute, Mongan Institute, Massachusetts General Hospital, Boston, MA 02114, United States
| | - Meg Quint
- Division of Endocrinology, Diabetes and Hypertension, Brigham and Women’s Hospital, Boston, MA 02115, United States
| | - Mitchell Kellaway
- Adult Primary Care, Boston Medical Center, Boston, MA 02118, United States
| | - Sari L Reisner
- Harvard Medical School, Boston, MA 02115, United States
- Division of Endocrinology, Diabetes and Hypertension, Brigham and Women’s Hospital, Boston, MA 02115, United States
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA 02115, United States
- Harvard Medical School, Boston, MA 02115, United States
| | - David W Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA 02115, United States
- Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
7
|
Gancz AS, Wright SL, Weyrich LS. Ancient human dental calculus metadata collection and sampling strategies: Recommendations for best practices. Am J Biol Anthropol 2024; 183:e24871. [PMID: 37994571 DOI: 10.1002/ajpa.24871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 09/26/2023] [Accepted: 10/18/2023] [Indexed: 11/24/2023]
Abstract
OBJECTIVES Ancient human dental calculus is a unique, nonrenewable biological resource encapsulating key information about the diets, lifestyles, and health conditions of past individuals and populations. With compounding calls its destructive analysis, it is imperative to refine the ways in which the scientific community documents, samples, and analyzes dental calculus so as to maximize its utility to the public and scientific community. MATERIALS AND METHODS Our research team conducted an IRB-approved survey of dental calculus researchers with diverse academic backgrounds, research foci, and analytical specializations. RESULTS This survey reveals variation in how metadata is collected and utilized across different subdisciplines and highlights how these differences have profound implications for dental calculus research. Moreover, the survey suggests the need for more communication between those who excavate, curate, and analyze biomolecular data from dental calculus. DISCUSSION Challenges in cross-disciplinary communication limit researchers' ability to effectively utilize samples in rigorous and reproducible ways. Specifically, the lack of standardized skeletal and dental metadata recording and contamination avoidance procedures hinder downstream anthropological applications, as well as the pursuit of broader paleodemographic and paleoepidemiological inquiries that rely on more complete information about the individuals sampled. To provide a path forward toward more ethical and standardized dental calculus sampling and documentation approaches, we review the current methods by which skeletal and dental metadata are recorded. We also describe trends in sampling and contamination-control approaches. Finally, we use that information to suggest new guidelines for ancient dental calculus documentation and sampling strategies that will improve research practices in the future.
Collapse
Affiliation(s)
- Abigail S Gancz
- Department of Anthropology, The Pennsylvania State University, University Park, Pennsylvania, USA
- One Health Microbiome Center, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Sterling L Wright
- Department of Anthropology, The Pennsylvania State University, University Park, Pennsylvania, USA
- One Health Microbiome Center, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Laura S Weyrich
- Department of Anthropology, The Pennsylvania State University, University Park, Pennsylvania, USA
- One Health Microbiome Center, The Pennsylvania State University, University Park, Pennsylvania, USA
- Huck Institutes of Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
8
|
Dall’Amico L, Kleynhans J, Gauvin L, Tizzoni M, Ozella L, Makhasi M, Wolter N, Language B, Wagner RG, Cohen C, Tempia S, Cattuto C. Estimating household contact matrices structure from easily collectable metadata. PLoS One 2024; 19:e0296810. [PMID: 38483886 PMCID: PMC10939291 DOI: 10.1371/journal.pone.0296810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/18/2023] [Indexed: 03/17/2024] Open
Abstract
Contact matrices are a commonly adopted data representation, used to develop compartmental models for epidemic spreading, accounting for the contact heterogeneities across age groups. Their estimation, however, is generally time and effort consuming and model-driven strategies to quantify the contacts are often needed. In this article we focus on household contact matrices, describing the contacts among the members of a family and develop a parametric model to describe them. This model combines demographic and easily quantifiable survey-based data and is tested on high resolution proximity data collected in two sites in South Africa. Given its simplicity and interpretability, we expect our method to be easily applied to other contexts as well and we identify relevant questions that need to be addressed during the data collection procedure.
Collapse
Affiliation(s)
| | - Jackie Kleynhans
- National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
- School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Laetitia Gauvin
- ISI Foundation, Turin, Italy
- Institute for Research on sustainable Development, UMR215 PRODIG, Aubervilliers, France
| | - Michele Tizzoni
- ISI Foundation, Turin, Italy
- Department of Sociology and Social Research, University of Trento, Trento, Italy
| | | | - Mvuyo Makhasi
- National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
| | - Nicole Wolter
- National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
- School of Pathology, University of the Witwatersrand, Johannesburg, South Africa
| | - Brigitte Language
- Unit for Environmental Science and Management, Climatology Research Group, North-West University, Potchefstroom, South Africa
| | - Ryan G. Wagner
- MRC/Wits Rural Public Health and Health Transitions Research Unit (Agincourt), Agincourt, South Africa
| | - Cheryl Cohen
- National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
- School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Stefano Tempia
- National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
- School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Ciro Cattuto
- ISI Foundation, Turin, Italy
- Department of Informatics, University of Turin, Turin, Italy
| |
Collapse
|
9
|
Zemaityte V, Karjus A, Rohn U, Schich M, Ibrus I. Quantifying the global film festival circuit: Networks, diversity, and public value creation. PLoS One 2024; 19:e0297404. [PMID: 38446758 PMCID: PMC10917328 DOI: 10.1371/journal.pone.0297404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 12/31/2023] [Indexed: 03/08/2024] Open
Abstract
Film festivals are a key component in the global film industry in terms of trendsetting, publicity, trade, and collaboration. We present an unprecedented analysis of the international film festival circuit, which has so far remained relatively understudied quantitatively, partly due to the limited availability of suitable data sets. We use large-scale data from the Cinando platform of the Cannes Film Market, widely used by industry professionals. We explicitly model festival events as a global network connected by shared films and quantify festivals as aggregates of the metadata of their showcased films. Importantly, we argue against using simple count distributions for discrete labels such as language or production country, as such categories are typically not equidistant. Rather, we propose embedding them in continuous latent vector spaces. We demonstrate how these "festival embeddings" provide insight into changes in programmed content over time, predict festival connections, and can be used to measure diversity in film festival programming across various cultural, social, and geographical variables-which all constitute an aspect of public value creation by film festivals. Our results provide a novel mapping of the film festival circuit between 2009-2021 (616 festivals, 31,989 unique films), highlighting festival types that occupy specific niches, diverse series, and those that evolve over time. We also discuss how these quantitative findings fit into media studies and research on public value creation by cultural industries. With festivals occupying a central position in the film industry, investigations into the data they generate hold opportunities for researchers to better understand industry dynamics and cultural impact, and for organizers, policymakers, and industry actors to make more informed, data-driven decisions. We hope our proposed methodological approach to festival data paves way for more comprehensive film festival studies and large-scale quantitative cultural event analytics in general.
Collapse
Affiliation(s)
- Vejune Zemaityte
- Baltic Film, Media and Arts School, Tallinn University, Tallinn, Estonia
- ERA Chair for Cultural Data Analytics, Tallinn University, Tallinn, Estonia
| | - Andres Karjus
- ERA Chair for Cultural Data Analytics, Tallinn University, Tallinn, Estonia
- School of Humanities, Tallinn University, Tallinn, Estonia
- Estonian Business School, Tallinn, Estonia
| | - Ulrike Rohn
- Baltic Film, Media and Arts School, Tallinn University, Tallinn, Estonia
| | - Maximilian Schich
- Baltic Film, Media and Arts School, Tallinn University, Tallinn, Estonia
- ERA Chair for Cultural Data Analytics, Tallinn University, Tallinn, Estonia
| | - Indrek Ibrus
- Baltic Film, Media and Arts School, Tallinn University, Tallinn, Estonia
- ERA Chair for Cultural Data Analytics, Tallinn University, Tallinn, Estonia
| |
Collapse
|
10
|
Cordes J, Enzlein T, Hopf C, Wolf I. pyM2aia: Python interface for mass spectrometry imaging with focus on deep learning. Bioinformatics 2024; 40:btae133. [PMID: 38445753 PMCID: PMC10948279 DOI: 10.1093/bioinformatics/btae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 11/20/2023] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
SUMMARY Python is the most commonly used language for deep learning (DL). Existing Python packages for mass spectrometry imaging (MSI) data are not optimized for DL tasks. We, therefore, introduce pyM2aia, a Python package for MSI data analysis with a focus on memory-efficient handling, processing and convenient data-access for DL applications. pyM2aia provides interfaces to its parent application M2aia, which offers interactive capabilities for exploring and annotating MSI data in imzML format. pyM2aia utilizes the image input and output routines, data formats, and processing functions of M2aia, ensures data interchangeability, and enables the writing of readable and easy-to-maintain DL pipelines by providing batch generators for typical MSI data access strategies. We showcase the package in several examples, including imzML metadata parsing, signal processing, ion-image generation, and, in particular, DL model training and inference for spectrum-wise approaches, ion-image-based approaches, and approaches that use spectral and spatial information simultaneously. AVAILABILITY AND IMPLEMENTATION Python package, code and examples are available at (https://m2aia.github.io/m2aia).
Collapse
Affiliation(s)
- Jonas Cordes
- Faculty of Computer Science, Mannheim University of Applied Sciences, Mannheim 68163, Germany
- Medical Faculty Mannheim, Heidelberg University, Mannheim 68167, Germany
| | - Thomas Enzlein
- Center for Mass Spectrometry and Optical Spectroscopy, Mannheim University of Applied Sciences, Mannheim 68163, Germany
| | - Carsten Hopf
- Medical Faculty Mannheim, Heidelberg University, Mannheim 68167, Germany
- Center for Mass Spectrometry and Optical Spectroscopy, Mannheim University of Applied Sciences, Mannheim 68163, Germany
- Medical Faculty, Heidelberg University, Heidelberg 69120, Germany
| | - Ivo Wolf
- Faculty of Computer Science, Mannheim University of Applied Sciences, Mannheim 68163, Germany
| |
Collapse
|
11
|
Nandi B, Patel G, Das S. Prediction of maximum scour depth at clear water conditions: Multivariate and robust comparative analysis between empirical equations and machine learning approaches using extensive reference metadata. J Environ Manage 2024; 354:120349. [PMID: 38401497 DOI: 10.1016/j.jenvman.2024.120349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 01/02/2024] [Accepted: 02/08/2024] [Indexed: 02/26/2024]
Abstract
Flow obstructed by bridge piers can increase sediment transport leading to local scour. This local scour poses a risk to the stability of bridge structures, which could lead to structural failures. There are two main approaches for evaluating the scour depth (ds) of bridge piers. The first is based on understanding hydraulic phenomena and developing relationships with properties affecting scour. The second uses data-driven soft computing models that lack physical interpretations but rely on algorithms to predict outcomes. Methods are chosen by researchers based on their goals and resources. This study aims to create innovative ensemble frameworks comprising support vector machine for regression (SVMR), random forest regression (RFR), and reduced error pruning tree (REPTree) as base learners, alongside bagging regression tree (BRT) and stochastic gradient boosting (SGB) as meta learners. These ensembles were developed to analyse maximum scour depths (dsm) in clear water conditions, utilizing 35 literature's experimental data published in last 63 years. The performance of each machine learning (ML) approach was assessed using statistical performance indicators. The proposed model was also compared with top six empirical equations with strong predictive ability. Results show that among these empirical equations, the equation from Nandi and Das (2023) performs best. Performance evaluation considering training, testing, and the entire dataset, SGB (REPTree), BRT(SVMR-PUK), and SGB (REPTree) exhibited the highest performance, securing the top rank among all ML models and empirical equations. Sensitivity analysis identified sediment gradation and flow intensity as the most influential variables for predicting dsm during both training and testing phases, respectively.
Collapse
Affiliation(s)
- Buddhadev Nandi
- School of Water Resources Engineering, Jadavpur University, Kolkata 700032, India.
| | - Gaurav Patel
- School of Water Resources Engineering, Jadavpur University, Kolkata 700032, India.
| | - Subhasish Das
- School of Water Resources Engineering, Jadavpur University, Kolkata 700032, India.
| |
Collapse
|
12
|
Shome M, MacKenzie TMG, Subbareddy SR, Snyder MP. The Importance, Challenges, and Possible Solutions for Sharing Proteomics Data While Safeguarding Individuals' Privacy. Mol Cell Proteomics 2024; 23:100731. [PMID: 38331191 PMCID: PMC10915627 DOI: 10.1016/j.mcpro.2024.100731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/28/2024] [Accepted: 02/05/2024] [Indexed: 02/10/2024] Open
Abstract
Proteomics data sharing has profound benefits at the individual level as well as at the community level. While data sharing has increased over the years, mostly due to journal and funding agency requirements, the reluctance of researchers with regard to data sharing is evident as many shares only the bare minimum dataset required to publish an article. In many cases, proper metadata is missing, essentially making the dataset useless. This behavior can be explained by a lack of incentives, insufficient awareness, or a lack of clarity surrounding ethical issues. Through adequate training at research institutes, researchers can realize the benefits associated with data sharing and can accelerate the norm of data sharing for the field of proteomics, as has been the standard in genomics for decades. In this article, we have put together various repository options available for proteomics data. We have also added pros and cons of those repositories to facilitate researchers in selecting the repository most suitable for their data submission. It is also important to note that a few types of proteomics data have the potential to re-identify an individual in certain scenarios. In such cases, extra caution should be taken to remove any personal identifiers before sharing on public repositories. Data sets that will be useless without personal identifiers need to be shared in a controlled access repository so that only authorized researchers can access the data and personal identifiers are kept safe.
Collapse
Affiliation(s)
- Mahasish Shome
- Department of Genetics, Stanford University, Palo Alto, California, USA
| | - Tim M G MacKenzie
- Department of Genetics, Stanford University, Palo Alto, California, USA
| | | | - Michael P Snyder
- Department of Genetics, Stanford University, Palo Alto, California, USA.
| |
Collapse
|
13
|
Stoinski LM, Perkuhn J, Hebart MN. THINGSplus: New norms and metadata for the THINGS database of 1854 object concepts and 26,107 natural object images. Behav Res Methods 2024; 56:1583-1603. [PMID: 37095326 PMCID: PMC10991023 DOI: 10.3758/s13428-023-02110-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/13/2023] [Indexed: 04/26/2023]
Abstract
To study visual and semantic object representations, the need for well-curated object concepts and images has grown significantly over the past years. To address this, we have previously developed THINGS, a large-scale database of 1854 systematically sampled object concepts with 26,107 high-quality naturalistic images of these concepts. With THINGSplus, we significantly extend THINGS by adding concept- and image-specific norms and metadata for all 1854 concepts and one copyright-free image example per concept. Concept-specific norms were collected for the properties of real-world size, manmadeness, preciousness, liveliness, heaviness, naturalness, ability to move or be moved, graspability, holdability, pleasantness, and arousal. Further, we provide 53 superordinate categories as well as typicality ratings for all their members. Image-specific metadata includes a nameability measure, based on human-generated labels of the objects depicted in the 26,107 images. Finally, we identified one new public domain image per concept. Property (M = 0.97, SD = 0.03) and typicality ratings (M = 0.97, SD = 0.01) demonstrate excellent consistency, with the subsequently collected arousal ratings as the only exception (r = 0.69). Our property (M = 0.85, SD = 0.11) and typicality (r = 0.72, 0.74, 0.88) data correlated strongly with external norms, again with the lowest validity for arousal (M = 0.41, SD = 0.08). To summarize, THINGSplus provides a large-scale, externally validated extension to existing object norms and an important extension to THINGS, allowing detailed selection of stimuli and control variables for a wide range of research interested in visual object processing, language, and semantic memory.
Collapse
Affiliation(s)
- Laura M Stoinski
- Max Planck Institute for Human Cognitive & Brain Sciences, Leipzig, Germany.
| | - Jonas Perkuhn
- Max Planck Institute for Human Cognitive & Brain Sciences, Leipzig, Germany
| | - Martin N Hebart
- Max Planck Institute for Human Cognitive & Brain Sciences, Leipzig, Germany
- Justus Liebig University, Gießen, Germany
| |
Collapse
|
14
|
Moresis A, Restivo L, Bromilow S, Flik G, Rosati G, Scorrano F, Tsoory M, O'Connor EC, Gaburro S, Bannach-Brown A. A minimal metadata set (MNMS) to repurpose nonclinical in vivo data for biomedical research. Lab Anim (NY) 2024; 53:67-79. [PMID: 38438748 PMCID: PMC10912024 DOI: 10.1038/s41684-024-01335-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 01/31/2024] [Indexed: 03/06/2024]
Abstract
Although biomedical research is experiencing a data explosion, the accumulation of vast quantities of data alone does not guarantee a primary objective for science: building upon existing knowledge. Data collected that lack appropriate metadata cannot be fully interrogated or integrated into new research projects, leading to wasted resources and missed opportunities for data repurposing. This issue is particularly acute for research using animals, where concerns regarding data reproducibility and ensuring animal welfare are paramount. Here, to address this problem, we propose a minimal metadata set (MNMS) designed to enable the repurposing of in vivo data. MNMS aligns with an existing validated guideline for reporting in vivo data (ARRIVE 2.0) and contributes to making in vivo data FAIR-compliant. Scenarios where MNMS should be implemented in diverse research environments are presented, highlighting opportunities and challenges for data repurposing at different scales. We conclude with a 'call for action' to key stakeholders in biomedical research to adopt and apply MNMS to accelerate both the advancement of knowledge and the betterment of animal welfare.
Collapse
Affiliation(s)
- Anastasios Moresis
- Roche Pharma Research and Early Development, Data & Analytics, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Leonardo Restivo
- Neuro-Behavioral Analysis Unit, Faculty of Biology & Medicine, University of Lausanne, Lausanne, Switzerland
| | - Sophie Bromilow
- Group Legal Department, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Gunnar Flik
- Discovery, Charles River Laboratories, Groningen, the Netherlands
| | | | - Fabrizio Scorrano
- Emerging Technologies, Comparative Medicine, Novartis International AG, Basel, Switzerland
| | - Michael Tsoory
- Behavioral and Physiological Phenotyping Unit, Department of Veterinary Resources, Weizmann Institute of Science, Rehovot, Israel
| | - Eoin C O'Connor
- Roche Pharma Research and Early Development, Neuroscience & Rare Diseases, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland.
| | | | - Alexandra Bannach-Brown
- QUEST Center for Responsible Research, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany.
| |
Collapse
|
15
|
Rule A, Kannampallil T, Hribar MR, Dziorny AC, Thombley R, Apathy NC, Adler-Milstein J. Guidance for reporting analyses of metadata on electronic health record use. J Am Med Inform Assoc 2024; 31:784-789. [PMID: 38123497 PMCID: PMC10873840 DOI: 10.1093/jamia/ocad254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/14/2023] [Accepted: 12/18/2023] [Indexed: 12/23/2023] Open
Abstract
INTRODUCTION Research on how people interact with electronic health records (EHRs) increasingly involves the analysis of metadata on EHR use. These metadata can be recorded unobtrusively and capture EHR use at a scale unattainable through direct observation or self-reports. However, there is substantial variation in how metadata on EHR use are recorded, analyzed and described, limiting understanding, replication, and synthesis across studies. RECOMMENDATIONS In this perspective, we provide guidance to those working with EHR use metadata by describing 4 common types, how they are recorded, and how they can be aggregated into higher-level measures of EHR use. We also describe guidelines for reporting analyses of EHR use metadata-or measures of EHR use derived from them-to foster clarity, standardization, and reproducibility in this emerging and critical area of research.
Collapse
Affiliation(s)
- Adam Rule
- Information School, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Thomas Kannampallil
- Department of Anesthesiology, Washington University School of Medicine, St Louis, MO 63110, United States
- Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine, St Louis, MO 63110, United States
| | - Michelle R Hribar
- Office of Data Science and Health Informatics, National Eye Institute, National Institute of Health, Bethesda, MD 20892, United States
- Department of Ophthalmology, Casey Eye Institute, Portland, OR 97239, United States
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, United States
| | - Adam C Dziorny
- Department of Pediatrics, University of Rochester School of Medicine, Rochester, NY 14642, United States
| | - Robert Thombley
- Department of Medicine, Center for Clinical Informatics and Improvement Research, University of California, San Francisco, San Francisco, CA 94118, United States
| | - Nate C Apathy
- National Center for Human Factors in Healthcare, MedStar Health Research Institute, Washington, DC 20782, United States
- Center for Biomedical Informatics, Regenstrief Institute Inc, Indianapolis, IN 46202, United States
| | - Julia Adler-Milstein
- Department of Medicine, Center for Clinical Informatics and Improvement Research, University of California, San Francisco, San Francisco, CA 94118, United States
| |
Collapse
|
16
|
Levitas D, Hayashi S, Vinci-Booher S, Heinsfeld A, Bhatia D, Lee N, Galassi A, Niso G, Pestilli F. ezBIDS: Guided standardization of neuroimaging data interoperable with major data archives and platforms. Sci Data 2024; 11:179. [PMID: 38332144 PMCID: PMC10853279 DOI: 10.1038/s41597-024-02959-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 01/12/2024] [Indexed: 02/10/2024] Open
Abstract
Data standardization promotes a common framework through which researchers can utilize others' data and is one of the leading methods neuroimaging researchers use to share and replicate findings. As of today, standardizing datasets requires technical expertise such as coding and knowledge of file formats. We present ezBIDS, a tool for converting neuroimaging data and associated metadata to the Brain Imaging Data Structure (BIDS) standard. ezBIDS contains four major features: (1) No installation or programming requirements. (2) Handling of both imaging and task events data and metadata. (3) Semi-automated inference and guidance for adherence to BIDS. (4) Multiple data management options: download BIDS data to local system, or transfer to OpenNeuro.org or to brainlife.io. In sum, ezBIDS requires neither coding proficiency nor knowledge of BIDS, and is the first BIDS tool to offer guided standardization, support for task events conversion, and interoperability with OpenNeuro.org and brainlife.io.
Collapse
Affiliation(s)
- Daniel Levitas
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA
| | - Soichi Hayashi
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA
| | - Sophia Vinci-Booher
- Department of Psychology and Human Development, Peabody College, Vanderbilt University, Nashville, TN, 37203, USA
| | - Anibal Heinsfeld
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA
| | - Dheeraj Bhatia
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA
| | - Nicholas Lee
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA
| | - Anthony Galassi
- Center for Multimodal Neuroimaging, National Institute of Mental Health, Bethesda, MD, USA
| | | | - Franco Pestilli
- Department of Psychology, Department of Neuroscience, Center for Perceptual Systems, Center for Learning and Memory, Center for Aging Population Sciences, University of Texas, Austin, TX, 78712, USA.
| |
Collapse
|
17
|
González-Rodríguez N, Areán-Ulloa E, Fernández-Leiro R. A web-based dashboard for RELION metadata visualization. Acta Crystallogr D Struct Biol 2024; 80:93-100. [PMID: 38265874 PMCID: PMC10836394 DOI: 10.1107/s2059798323010902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/20/2023] [Indexed: 01/26/2024] Open
Abstract
Cryo-electron microscopy (cryo-EM) has witnessed radical progress in the past decade, driven by developments in hardware and software. While current software packages include processing pipelines that simplify the image-processing workflow, they do not prioritize the in-depth analysis of crucial metadata, limiting troubleshooting for challenging data sets. The widely used RELION software package lacks a graphical native representation of the underlying metadata. Here, two web-based tools are introduced: relion_live.py, which offers real-time feedback on data collection, aiding swift decision-making during data acquisition, and relion_analyse.py, a graphical interface to represent RELION projects by plotting essential metadata including interactive data filtration and analysis. A useful script for estimating ice thickness and data quality during movie pre-processing is also presented. These tools empower researchers to analyse data efficiently and allow informed decisions during data collection and processing.
Collapse
Affiliation(s)
- Nayim González-Rodríguez
- Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
| | - Emma Areán-Ulloa
- Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
- Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, The Netherlands
| | - Rafael Fernández-Leiro
- Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
| |
Collapse
|
18
|
Wolfe SR, Lafuente B, Keller RM, Detweiler AM, Bristow TF, Parenteau MN, Boydstun K, Dateo CE, Des Marais DJ, Jahnke LL, Rojo S, Stone N, Vorobets M. Enabling Data Discovery with the Astrobiology Resource Metadata Standard. Astrobiology 2024; 24:131-137. [PMID: 38393827 PMCID: PMC10902265 DOI: 10.1089/ast.2023.0067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2024]
Abstract
As scientific investigations increasingly adopt Open Science practices, reuse of data becomes paramount. However, despite decades of progress in internet search tools, finding relevant astrobiology datasets for an envisioned investigation remains challenging due to the precise and atypical needs of the astrobiology researcher. In response, we have developed the Astrobiology Resource Metadata Standard (ARMS), a metadata standard designed to uniformly describe astrobiology "resources," that is, virtually any product of astrobiology research. Those resources include datasets, physical samples, software (modeling codes and scripts), publications, websites, images, videos, presentations, and so on. ARMS has been formulated to describe astrobiology resources generated by individual scientists or smaller scientific teams, rather than larger mission teams who may be required to use more complex archival metadata schemes. In the following, we discuss the participatory development process, give an overview of the metadata standard, describe its current use in practice, and close with a discussion of additional possible uses and extensions.
Collapse
Affiliation(s)
- Shawn R Wolfe
- NASA Ames Research Center, Moffett Field, California, USA
| | | | | | - Angela M Detweiler
- Bay Area Environmental Research Institute, Moffett Field, California, USA
| | | | | | - Kevin Boydstun
- NASA Ames Research Center, Moffett Field, California, USA
| | | | | | - Linda L Jahnke
- NASA Ames Research Center, Moffett Field, California, USA
| | - Sara Rojo
- NASA Ames Research Center, Moffett Field, California, USA
| | | | - Mark Vorobets
- NASA Ames Research Center, Moffett Field, California, USA
| |
Collapse
|
19
|
Isasa I, Hernandez M, Epelde G, Londoño F, Beristain A, Larrea X, Alberdi A, Bamidis P, Konstantinidis E. Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis. BMC Med Inform Decis Mak 2024; 24:27. [PMID: 38291386 PMCID: PMC10826010 DOI: 10.1186/s12911-024-02427-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 01/16/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects' metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. METHODS Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). RESULTS Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. CONCLUSION The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.
Collapse
Affiliation(s)
- Imanol Isasa
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
| | - Mikel Hernandez
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia - San Sebastian, Spain
| | - Gorka Epelde
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.
- eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastian, Spain.
| | - Francisco Londoño
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
| | - Andoni Beristain
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia - San Sebastian, Spain
- eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastian, Spain
| | - Xabat Larrea
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- Biomedical Engineering Department, Mondragon University, Arrasate-Mondragon, Spain
| | - Ane Alberdi
- Biomedical Engineering Department, Mondragon University, Arrasate-Mondragon, Spain
| | - Panagiotis Bamidis
- Laboratory of Medical Physics and Digital Innovation, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Evdokimos Konstantinidis
- Laboratory of Medical Physics and Digital Innovation, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki, Greece
- European Network of Living Labs (ENoLL), Brussels, Belgium
| |
Collapse
|
20
|
Matley JK, Klinard NV, Martins AB, Oakley-Cogan A, Huveneers C, Vandergoot CS, Fisk AT. TrackdAT, an acoustic telemetry metadata dataset to support aquatic animal tracking research. Sci Data 2024; 11:143. [PMID: 38291027 PMCID: PMC10828395 DOI: 10.1038/s41597-024-02969-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 01/15/2024] [Indexed: 02/01/2024] Open
Abstract
Data on the movement and space use of aquatic animals are crucial to understand complex interactions among biotic and abiotic components of ecosystems and facilitate effective conservation and management. Acoustic telemetry (AT) is a leading method for studying the movement ecology of aquatic animals worldwide, yet the ability to efficiently access study information from AT research is currently lacking, limiting advancements in its application. Here, we describe TrackdAT, an open-source metadata dataset where AT research parameters are catalogued to provide scientists, managers, and other stakeholders with the ability to efficiently identify and evaluate existing peer-reviewed research. Extracted metadata encompasses key information about biological and technical aspects of research, providing a comprehensive summary of existing AT research. TrackdAT currently hosts information from 2,412 journal articles published from 1969 to 2022 spanning 614 species and 380,289 tagged animals. TrackdAT has the potential to enable regional and global mobilization of knowledge, increased opportunities for collaboration, greater stakeholder engagement, and optimization of future ecological research.
Collapse
Affiliation(s)
- Jordan K Matley
- College of Science and Engineering, Flinders University, Bedford Park, SA, 5042, Australia.
| | - Natalie V Klinard
- Department of Biology, Dalhousie University, Halifax, NS, B3H 4R2, Canada
| | | | - Arun Oakley-Cogan
- Department of Biology, Dalhousie University, Halifax, NS, B3H 4R2, Canada
| | - Charlie Huveneers
- College of Science and Engineering, Flinders University, Bedford Park, SA, 5042, Australia
| | | | - Aaron T Fisk
- Great Lakes Institute for Environment Research, University of Windsor, Windsor, ON, N9B 3P4, Canada
| |
Collapse
|
21
|
Basiri R, Manji K, LeLievre PM, Toole J, Kim F, Khan SS, Popovic MR. Protocol for metadata and image collection at diabetic foot ulcer clinics: enabling research in wound analytics and deep learning. Biomed Eng Online 2024; 23:12. [PMID: 38287324 PMCID: PMC10826077 DOI: 10.1186/s12938-024-01210-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 01/22/2024] [Indexed: 01/31/2024] Open
Abstract
BACKGROUND The escalating impact of diabetes and its complications, including diabetic foot ulcers (DFUs), presents global challenges in quality of life, economics, and resources, affecting around half a billion people. DFU healing is hindered by hyperglycemia-related issues and diverse diabetes-related physiological changes, necessitating ongoing personalized care. Artificial intelligence and clinical research strive to address these challenges by facilitating early detection and efficient treatments despite resource constraints. This study establishes a standardized framework for DFU data collection, introducing a dedicated case report form, a comprehensive dataset named Zivot with patient population clinical feature breakdowns and a baseline for DFU detection using this dataset and a UNet architecture. RESULTS Following this protocol, we created the Zivot dataset consisting of 269 patients with active DFUs, and about 3700 RGB images and corresponding thermal and depth maps for the DFUs. The effectiveness of collecting a consistent and clean dataset was demonstrated using a bounding box prediction deep learning network that was constructed with EfficientNet as the feature extractor and UNet architecture. The network was trained on the Zivot dataset, and the evaluation metrics showed promising values of 0.79 and 0.86 for F1-score and mAP segmentation metrics. CONCLUSIONS This work and the Zivot database offer a foundation for further exploration of holistic and multimodal approaches to DFU research.
Collapse
Affiliation(s)
- Reza Basiri
- Institute of Biomedical Engineering, University of Toronto, Toronto, Canada.
- KITE Research Institute, Toronto Rehabilitation Institute - University Health Network, Toronto, Canada.
| | - Karim Manji
- Zivot Limb Preservation Centre, Peter Lougheed Centre, Calgary, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Philip M LeLievre
- Zivot Limb Preservation Centre, Peter Lougheed Centre, Calgary, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - John Toole
- Zivot Limb Preservation Centre, Peter Lougheed Centre, Calgary, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Faith Kim
- Faculty of Kinesiology, University of Calgary, Calgary, Canada
| | - Shehroz S Khan
- Institute of Biomedical Engineering, University of Toronto, Toronto, Canada
- KITE Research Institute, Toronto Rehabilitation Institute - University Health Network, Toronto, Canada
| | - Milos R Popovic
- Institute of Biomedical Engineering, University of Toronto, Toronto, Canada
- KITE Research Institute, Toronto Rehabilitation Institute - University Health Network, Toronto, Canada
| |
Collapse
|
22
|
Ranallo P, Southwell B, Tignanelli C, Johnson SG, Krueger R, Sevareid-Groth T, Carvel A, Melton GB. Promoting Learning Health System Cycles by Optimizing EHR Data Clinical Concept Encoding Processes. Stud Health Technol Inform 2024; 310:68-73. [PMID: 38269767 DOI: 10.3233/shti230929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Electronic health records (EHRs) and other real-world data (RWD) are critical to accelerating and scaling care improvement and transformation. To efficiently leverage it for secondary uses, EHR/RWD should be optimally managed and mapped to industry standard concepts (ISCs). Inherent challenges in concept encoding usually result in inefficient and costly workflows and resultant metadata representation structures outside the EHR. Using three related projects to map data to ISCs, we describe the development of standard, repeatable processes for precisely and unambiguously representing EHR data using appropriate ISCs within the EHR platform lifecycle and mappings specific to SNOMED-CT for Demographics, Specialty and Services. Mappings in these 3 areas resulted in ISC mappings of 779 data elements requiring 90 new concept requests to SNOMED-CT and 738 new ISCs mapped into the workflow within an accessible, enterprise-wide EHR resource with supporting processes.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Adam Carvel
- Fairview Health Services, Minneapolis, MN USA
| | - Genevieve B Melton
- Fairview Health Services, Minneapolis, MN USA
- University of Minnesota, Minneapolis, MN USA
| |
Collapse
|
23
|
Stellmach C, Muzoora MR. How to Assess FAIRness of Your Data - A Summary of Testing Two FAIR Validators. Stud Health Technol Inform 2024; 310:154-158. [PMID: 38269784 DOI: 10.3233/shti230946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Decision-making in healthcare is heavily reliant on data that is findable, accessible, interoperable and reusable (FAIR). Evolving advancements in genomics also heavily rely on FAIR data to steer reliable research for the future. For practical purposes, ensuring FAIRness of a clinical data set can be challenging but could be aided by using FAIR validators. The study describes the test of two open-access web-tools in their demo versions to determine the FAIR levels of three submitted genomic data files with different formats (JSON, TXT, CSV). The F-UJI tool and FAIR-Checker tools provided similar FAIR scores for the three submitted files. However, the F-UJI tool assigned a total rating whereas the FAIR-Checker gave scores clustered by FAIR principles. Neither tool was suited to determine FAIR levels of a FHIR® JSON metadata file. Despite their early developmental status, FAIR validator tools have great potential to assist clinicians in the FAIRification of their research data.
Collapse
|
24
|
Klopfenstein SAI, Sass J, Vorisek CN, Jorczik F, Schmidt CO, Löbe M, Golebiewski M, Abaza H, Thun S. Bringing Communities Together: Mapping the Investigation-Study-Assay-Model (ISA) to Fast Healthcare Interoperability Resources (FHIR). Stud Health Technol Inform 2024; 310:18-22. [PMID: 38269757 DOI: 10.3233/shti230919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Adhering to FAIR principles (findability, accessibility, interoperability, reusability) ensures sustainability and reliable exchange of data and metadata. Research communities need common infrastructures and information models to collect, store, manage and work with data and metadata. The German initiative NFDI4Health created a metadata schema and an infrastructure integrating existing platforms based on different information models and standards. To ensure system compatibility and enhance data integration possibilities, we mapped the Investigation-Study-Assay (ISA) model to Fast Healthcare Interoperability Resources (FHIR). We present the mapping in FHIR logical models, a resulting FHIR resources' network and challenges that we encountered. Challenges mainly related to ISA's genericness, and to different structures and datatypes used in ISA and FHIR. Mapping ISA to FHIR is feasible but requires further analyses of example data and adaptations to better specify target FHIR elements, and enable possible automatized conversions from ISA to FHIR.
Collapse
Affiliation(s)
- Sophie A I Klopfenstein
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Germany
- Institute for Medical Informatics, Charité - Universitätsmedizin Berlin, Germany
| | - Julian Sass
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Germany
| | - Carina N Vorisek
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Germany
| | - Felix Jorczik
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Germany
| | | | - Matthias Löbe
- Institute for Medical Informatics (IMISE), University of Leipzig
| | | | | | - Sylvia Thun
- Institute for Medical Informatics, Charité - Universitätsmedizin Berlin, Germany
| |
Collapse
|
25
|
Hahn U, Modersohn L, Faller J, Lohr C. Final Report on the German Clinical Reference Corpus 3000PA. Stud Health Technol Inform 2024; 310:599-603. [PMID: 38269879 DOI: 10.3233/shti231035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
We here report on one of the outcomes of a large-scale German research program, the Medical Informatics Initiative (MII), aiming at the development of a solid data and software infrastructure for German-language clinical natural language processing. Within this framework, we have developed 3000PA, a national clinical reference corpus composed of patient records from three clinical university sites and annotated with a multitude of semantic annotation layers (including medical named entities, semantic and temporal relations between entities, as well as certainty and negation information related to entities and relations). This non-sharable corpus has been complemented by three sharable ones (JSYNCC, GGPONC, and GRASCCO). Overall, 3000PA, JSYNCC and GRASCCO feature about 2.1 million metadata points.
Collapse
Affiliation(s)
- Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
| | - Luise Modersohn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
- Medizinische Informatik, TU München, München, Germany
| | - Jakob Faller
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
- Universitätsklinikum Jena, Jena, Germany
| | - Christina Lohr
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), Universität Leipzig, Leipzig, Germany
| |
Collapse
|
26
|
Bönisch C, Kesztyüs D, Kesztyüs T. FAIR+R: Making Clinical Data Reliable Through Qualitative Metadata. Stud Health Technol Inform 2024; 310:99-103. [PMID: 38269773 DOI: 10.3233/shti230935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Metadata are often the first access to data repositories for researchers within secondary use. Through automatic metadata generation and metadata harvesting the amount of data about data has been growing ever since. In order to make data not only FAIR but also reliable, the aspect of metadata quality has to be considered. But as earlier assessments of metadata of different repositories showed, metadata quality still lacks behind its capability. Providing an extensive literature review the authors conclude nine measures to assess metadata in relation to clinical care repositories, such as Medical Data Integration Centers (MeDICs). Proceeding from these measures the authors propose an addition of the FAIR Guiding Principles by adding a fifth block for Reliability including three principles, that resulted from the measures presented. The results form the basis for the future work of an assessment of metadata, that is stored in a MeDIC.
Collapse
Affiliation(s)
- Caroline Bönisch
- Medical Data Integration Center, Department of Medical Informatics, University Medical Center Göttingen, Robert-Koch-Str. 40, 37075 Göttingen, Germany
| | - Dorothea Kesztyüs
- Medical Data Integration Center, Department of Medical Informatics, University Medical Center Göttingen, Robert-Koch-Str. 40, 37075 Göttingen, Germany
| | - Tibor Kesztyüs
- Medical Data Integration Center, Department of Medical Informatics, University Medical Center Göttingen, Robert-Koch-Str. 40, 37075 Göttingen, Germany
| |
Collapse
|
27
|
Webel H, Perez-Riverol Y, Nielsen AB, Rasmussen S. Mass spectrometry-based proteomics data from thousands of HeLa control samples. Sci Data 2024; 11:112. [PMID: 38263211 PMCID: PMC10806275 DOI: 10.1038/s41597-024-02922-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 01/05/2024] [Indexed: 01/25/2024] Open
Abstract
Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and overview along the 7,444 raw files and MaxQuant search output. For convenience, we provide three filtered and aggregated development datasets on the protein groups, peptides and precursors level. Next to providing easy to access training data, we provide a SDRF file annotating each raw file with instrument settings allowing automated reprocessing. We encourage others to enlarge this data set by instrument runs of further HeLa samples from different machine types by providing our workflows and analysis scripts.
Collapse
Affiliation(s)
- Henry Webel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Annelaura Bach Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
28
|
Polyanskiy MN. Refractiveindex.info database of optical constants. Sci Data 2024; 11:94. [PMID: 38238330 PMCID: PMC10796781 DOI: 10.1038/s41597-023-02898-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 12/27/2023] [Indexed: 01/22/2024] Open
Abstract
We introduce the refractiveindex.info database, a comprehensive open-source repository containing optical constants for a wide array of materials, and describe in detail the underlying dataset. This collection, derived from a meticulous compilation of data sourced from peer-reviewed publications, manufacturers' datasheets, and authoritative texts, aims to advance research in optics and photonics. The data is stored using a YAML-based format, ensuring integrity, consistency, and ease of access. Each record is accompanied by detailed metadata, facilitating a comprehensive understanding and efficient utilization of the data. In this descriptor, we outline the data curation protocols and the file format used for data records, and briefly demonstrate how the data can be organized in a user-friendly fashion akin to the books in a traditional library.
Collapse
Affiliation(s)
- Mikhail N Polyanskiy
- Brookhaven National Laboratory, Accelerator Test Facility, Upton, NY, 11973, USA.
| |
Collapse
|
29
|
Khan H, Mosa ASM, Paka V, Rana MKZ, Mandhadi V, Islam S, Xu H, McClay JC, Sarker S, Rao P, Waitman LR. Mapping Clinical Documents to the Logical Observation Identifiers, Names and Codes (LOINC) Document Ontology using Electronic Health Record Systems Structured Metadata. AMIA Annu Symp Proc 2024; 2023:1017-1026. [PMID: 38222329 PMCID: PMC10785913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
As Electronic Health Record (EHR) systems increase in usage, organizations struggle to maintain and categorize clinical documentation so it can be used for clinical care and research. While prior research has often employed natural language processing techniques to categorize free text documents, there are shortcomings relative to computational scalability and the lack of key metadata within notes' text. This study presents a framework that can allow institutions to map their notes to the LOINC document ontology using a Bag of Words approach. After preliminary manual value- set mapping, an automated pipeline that leverages key dimensions of metadata from structured EHR fields aligns the notes with the dimensions of the document ontology. This framework resulted in 73.4% coverage of EHR documents, while also mapping 132 million notes in less than 2 hours; an order of magnitude more efficient than NLP based methods.
Collapse
Affiliation(s)
- Huzaifa Khan
- MU Institute of Data Science and Informatics, University of Missouri-Columbia
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Abu Saleh Mohammad Mosa
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Vyshnavi Paka
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Md Kamruz Zaman Rana
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Vasanthi Mandhadi
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Soliman Islam
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Hua Xu
- Yale University, New Haven, CT, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - James C McClay
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| | - Sraboni Sarker
- Department of Electrical and Computer Science, School of Engineering, University of Missouri-Columbia
| | - Praveen Rao
- Department of Electrical and Computer Science, School of Engineering, University of Missouri-Columbia
| | - Lemuel R Waitman
- Department of Health Management and Informatics, School of Medicine, University of Missouri-Columbia
| |
Collapse
|
30
|
Kiran A, Hanachi M, Alsayed N, Fassatoui M, Oduaran OH, Allali I, Maslamoney S, Meintjes A, Zass L, Rocha JD, Kefi R, Benkahla A, Ghedira K, Panji S, Mulder N, Fadlelmola FM, Souiai O. The African Human Microbiome Portal: a public web portal of curated metagenomic metadata. Database (Oxford) 2024; 2024:baad092. [PMID: 38204360 PMCID: PMC10782148 DOI: 10.1093/database/baad092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 11/03/2023] [Accepted: 12/21/2023] [Indexed: 01/12/2024]
Abstract
There is growing evidence that comprehensive and harmonized metadata are fundamental for effective public data reusability. However, it is often challenging to extract accurate metadata from public repositories. Of particular concern is the metagenomic data related to African individuals, which often omit important information about the particular features of these populations. As part of a collaborative consortium, H3ABioNet, we created a web portal, namely the African Human Microbiome Portal (AHMP), exclusively dedicated to metadata related to African human microbiome samples. Metadata were collected from various public repositories prior to cleaning, curation and harmonization according to a pre-established guideline and using ontology terms. These metadata sets can be accessed at https://microbiome.h3abionet.org/. This web portal is open access and offers an interactive visualization of 14 889 records from 70 bioprojects associated with 72 peer reviewed research articles. It also offers the ability to download harmonized metadata according to the user's applied filters. The AHMP thereby supports metadata search and retrieve operations, facilitating, thus, access to relevant studies linked to the African Human microbiome. Database URL: https://microbiome.h3abionet.org/.
Collapse
Affiliation(s)
| | - Mariem Hanachi
- Laboratory of Bioinformatics, Biomathematics and Biostatistics (LR16IPT09), Institute Pasteur of Tunis, University Tunis El Manar, Tunis 1002, Tunisia
- Faculty of Science of Bizerte, University of Carthage, Tunis, Tunisia
| | - Nihad Alsayed
- Kush Centre for Genomics and Biomedical Informatics, Biotechnology Perspectives Organization, Khartoum, Sudan
| | - Meriem Fassatoui
- Laboratory of Biomedical Genomics & Oncogenetics, Institut Pasteur de Tunis, University Tunis El Manar, Tunis 1002, Tunisia
| | - Ovokeraye H Oduaran
- The Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
| | - Imane Allali
- Laboratory of Human Pathologies Biology, Department of Biology, Faculty of Sciences, Mohammed V University in Rabat, Rabat, Morocco
| | - Suresh Maslamoney
- Computational Biology Division, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| | - Ayton Meintjes
- Computational Biology Division, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| | - Lyndon Zass
- Computational Biology Division, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| | - Jorge Da Rocha
- The Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
| | - Rym Kefi
- Laboratory of Biomedical Genomics & Oncogenetics, Institut Pasteur de Tunis, University Tunis El Manar, Tunis 1002, Tunisia
| | - Alia Benkahla
- Laboratory of Bioinformatics, Biomathematics and Biostatistics (LR16IPT09), Institute Pasteur of Tunis, University Tunis El Manar, Tunis 1002, Tunisia
| | - Kais Ghedira
- Laboratory of Bioinformatics, Biomathematics and Biostatistics (LR16IPT09), Institute Pasteur of Tunis, University Tunis El Manar, Tunis 1002, Tunisia
| | - Sumir Panji
- Computational Biology Division, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| | - Nicola Mulder
- Computational Biology Division, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| | - Faisal M Fadlelmola
- Kush Centre for Genomics and Biomedical Informatics, Biotechnology Perspectives Organization, Khartoum, Sudan
| | - Oussema Souiai
- Laboratory of Bioinformatics, Biomathematics and Biostatistics (LR16IPT09), Institute Pasteur of Tunis, University Tunis El Manar, Tunis 1002, Tunisia
- Malawi-Liverpool-Wellcome Trust, Blantyre 3, Malawi
- Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool CH64 7TE, UK
| |
Collapse
|
31
|
Ara T, Kodama Y, Tokimatsu T, Fukuda A, Kosuge T, Mashima J, Tanizawa Y, Tanjo T, Ogasawara O, Fujisawa T, Nakamura Y, Arita M. DDBJ update in 2023: the MetaboBank for metabolomics data and associated metadata. Nucleic Acids Res 2024; 52:D67-D71. [PMID: 37971299 PMCID: PMC10767850 DOI: 10.1093/nar/gkad1046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 10/21/2023] [Accepted: 10/27/2023] [Indexed: 11/19/2023] Open
Abstract
The Bioinformation and DNA Data Bank of Japan (DDBJ) Center (https://www.ddbj.nig.ac.jp) provides database archives that cover a wide range of fields in life sciences. As a founding member of the International Nucleotide Sequence Database Collaboration (INSDC), DDBJ accepts and distributes nucleotide sequence data as well as their study and sample information along with the National Center for Biotechnology Information in the United States and the European Bioinformatics Institute (EBI). Besides INSDC databases, the DDBJ Center provides databases for functional genomics (GEA: Genomic Expression Archive), metabolomics (MetaboBank) and human genetic and phenotypic data (JGA: Japanese Genotype-phenotype Archive). These database systems have been built on the National Institute of Genetics (NIG) supercomputer, which is also open for domestic life science researchers to analyze large-scale sequence data. This paper reports recent updates on the archival databases and the services of the DDBJ Center, highlighting the newly redesigned MetaboBank. MetaboBank uses BioProject and BioSample in its metadata description making it suitable for multi-omics large studies. Its collaboration with MetaboLights at EBI brings synergy in locating and reusing public data.
Collapse
Affiliation(s)
- Takeshi Ara
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Yuichi Kodama
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Toshiaki Tokimatsu
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Asami Fukuda
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Takehide Kosuge
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Jun Mashima
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Yasuhiro Tanizawa
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Tomoya Tanjo
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Osamu Ogasawara
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Takatomo Fujisawa
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Yasukazu Nakamura
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Masanori Arita
- Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| |
Collapse
|
32
|
George N, Fexova S, Fuentes AM, Madrigal P, Bi Y, Iqbal H, Kumbham U, Nolte N, Zhao L, Thanki A, Yu I, Marugan Calles J, Erdos K, Vilmovsky L, Kurri S, Vathrakokoili-Pournara A, Osumi-Sutherland D, Prakash A, Wang S, Tello-Ruiz M, Kumari S, Ware D, Goutte-Gattat D, Hu Y, Brown N, Perrimon N, Vizcaíno JA, Burdett T, Teichmann S, Brazma A, Papatheodorou I. Expression Atlas update: insights from sequencing data at both bulk and single cell level. Nucleic Acids Res 2024; 52:D107-D114. [PMID: 37992296 PMCID: PMC10767917 DOI: 10.1093/nar/gkad1021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/13/2023] [Accepted: 10/30/2023] [Indexed: 11/24/2023] Open
Abstract
Expression Atlas (www.ebi.ac.uk/gxa) and its newest counterpart the Single Cell Expression Atlas (www.ebi.ac.uk/gxa/sc) are EMBL-EBI's knowledgebases for gene and protein expression and localisation in bulk and at single cell level. These resources aim to allow users to investigate their expression in normal tissue (baseline) or in response to perturbations such as disease or changes to genotype (differential) across multiple species. Users are invited to search for genes or metadata terms across species or biological conditions in a standardised consistent interface. Alongside these data, new features in Single Cell Expression Atlas allow users to query metadata through our new cell type wheel search. At the experiment level data can be explored through two types of dimensionality reduction plots, t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP), overlaid with either clustering or metadata information to assist users' understanding. Data are also visualised as marker gene heatmaps identifying genes that help confer cluster identity. For some data, additional visualisations are available as interactive cell level anatomograms and cell type gene expression heatmaps.
Collapse
Affiliation(s)
- Nancy George
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Silvie Fexova
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Alfonso Munoz Fuentes
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Pedro Madrigal
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Yalan Bi
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Haider Iqbal
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Upendra Kumbham
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Nadja Francesca Nolte
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Lingyun Zhao
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Anil S Thanki
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Iris D Yu
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Jose C Marugan Calles
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Karoly Erdos
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Liora Vilmovsky
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Sandeep R Kurri
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | | | - David Osumi-Sutherland
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Ananth Prakash
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Shengbo Wang
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Marcela K Tello-Ruiz
- Cold Spring Harbour Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Sunita Kumari
- Cold Spring Harbour Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Doreen Ware
- Cold Spring Harbour Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
- USDA ARS NEA, Plant Soil & Nutrition Laboratory Research Unit, Ithaca, NY 14853, USA
| | - Damien Goutte-Gattat
- FlyBase-Cambridge, Department of Physiology, Development and Neuroscience, University of Cambridge Downing Street, Cambridge CB2 3DY, UK
| | - Yanhui Hu
- Perrimon Lab, Department of Genetics, Harvard Medical School, Boston MA 02115, USA
| | - Nick Brown
- FlyBase-Cambridge, Department of Physiology, Development and Neuroscience, University of Cambridge Downing Street, Cambridge CB2 3DY, UK
| | - Norbert Perrimon
- Perrimon Lab, Department of Genetics, Harvard Medical School, Boston MA 02115, USA
- FlyBase-Harvard Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Sarah Teichmann
- Wellcome Trust Sanger Institute. Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| | - Irene Papatheodorou
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK
| |
Collapse
|
33
|
Camargo AP, Call L, Roux S, Nayfach S, Huntemann M, Palaniappan K, Ratner A, Chu K, Mukherjeep S, Reddy TBK, Chen IM, Ivanova N, Eloe-Fadrosh E, Woyke T, Baltrus D, Castañeda-Barba S, de la Cruz F, Funnell BE, Hall JJ, Mukhopadhyay A, Rocha EC, Stalder T, Top E, Kyrpides N. IMG/PR: a database of plasmids from genomes and metagenomes with rich annotations and metadata. Nucleic Acids Res 2024; 52:D164-D173. [PMID: 37930866 PMCID: PMC10767988 DOI: 10.1093/nar/gkad964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 10/09/2023] [Accepted: 10/14/2023] [Indexed: 11/08/2023] Open
Abstract
Plasmids are mobile genetic elements found in many clades of Archaea and Bacteria. They drive horizontal gene transfer, impacting ecological and evolutionary processes within microbial communities, and hold substantial importance in human health and biotechnology. To support plasmid research and provide scientists with data of an unprecedented diversity of plasmid sequences, we introduce the IMG/PR database, a new resource encompassing 699 973 plasmid sequences derived from genomes, metagenomes and metatranscriptomes. IMG/PR is the first database to provide data of plasmid that were systematically identified from diverse microbiome samples. IMG/PR plasmids are associated with rich metadata that includes geographical and ecosystem information, host taxonomy, similarity to other plasmids, functional annotation, presence of genes involved in conjugation and antibiotic resistance. The database offers diverse methods for exploring its extensive plasmid collection, enabling users to navigate plasmids through metadata-centric queries, plasmid comparisons and BLAST searches. The web interface for IMG/PR is accessible at https://img.jgi.doe.gov/pr. Plasmid metadata and sequences can be downloaded from https://genome.jgi.doe.gov/portal/IMG_PR.
Collapse
Affiliation(s)
- Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lee Call
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Marcel Huntemann
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | | | - Anna Ratner
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ken Chu
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Supratim Mukherjeep
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - T B K Reddy
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - I-Min A Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - David A Baltrus
- School of Plant Sciences, University of Arizona, Tucson AZ, USA
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson AZ, USA
| | | | - Fernando de la Cruz
- Instituto de Biomedicina y Biotecnología de Cantabria (Consejo Superior de Investigaciones Científicas – Universidad de Cantabria), Cantabria, Spain
| | - Barbara E Funnell
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5G 1M1, Canada
| | - James P J Hall
- Department of Evolution, Ecology and Behaviour, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Aindrila Mukhopadhyay
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
- Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Eduardo P C Rocha
- Institut Pasteur, Université de Paris Cité, CNRS UMR3525, Microbial Evolutionary Genomics, Paris, France
| | - Thibault Stalder
- Department of Biological Sciences, University of Idaho, Moscow, ID 83844, USA
| | - Eva Top
- Department of Biological Sciences, University of Idaho, Moscow, ID 83844, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
34
|
Lu K, Pan Y, Shen J, Yang L, Zhan C, Liang S, Tai S, Wan L, Li T, Cheng T, Ma B, Pan G, He N, Lu C, Westhof E, Xiang Z, Han MJ, Tong X, Dai F. SilkMeta: a comprehensive platform for sharing and exploiting pan-genomic and multi-omic silkworm data. Nucleic Acids Res 2024; 52:D1024-D1032. [PMID: 37941143 PMCID: PMC10767832 DOI: 10.1093/nar/gkad956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 10/03/2023] [Accepted: 10/13/2023] [Indexed: 11/10/2023] Open
Abstract
The silkworm Bombyx mori is a domesticated insect that serves as an animal model for research and agriculture. The silkworm super-pan-genome dataset, which we published last year, is a unique resource for the study of global genomic diversity and phenotype-genotype association. Here we present SilkMeta (http://silkmeta.org.cn), a comprehensive database covering the available silkworm pan-genome and multi-omics data. The database contains 1082 short-read genomes, 546 long-read assembled genomes, 1168 transcriptomes, 294 phenotype characterizations (phenome), tens of millions of variations (variome), 7253 long non-coding RNAs (lncRNAs), 18 717 full length transcripts and a set of population statistics. We have compiled publications on functional genomics research and genetic stock deciphering (mutant map). A range of bioinformatics tools is also provided for data visualization and retrieval. The large batch of omics data and tools were integrated in twelve functional modules that provide useful strategies and data for comparative and functional genomics research. The interactive bioinformatics platform SilkMeta will benefit not only the silkworm but also the insect biology communities.
Collapse
Affiliation(s)
- Kunpeng Lu
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
- Key Laboratory of Sericultural Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, College of Sericulture, Textile and Biomass Sciences, Southwest University, Chongqing 400715, China
| | - Yifei Pan
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Jianghong Shen
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Lin Yang
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Chengyu Zhan
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Shubo Liang
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | | | - Linrong Wan
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Tian Li
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Tingcai Cheng
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Bi Ma
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Guoqing Pan
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Ningjia He
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Cheng Lu
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Eric Westhof
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
- Architecture et Réactivité de l’ARN, Institut de Biologie Moléculaire et Cellulaire, UPR9002 CNRS, Université de Strasbourg, Strasbourg 67084, France
| | - Zhonghuai Xiang
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
| | - Min-Jin Han
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
- Key Laboratory of Sericultural Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, College of Sericulture, Textile and Biomass Sciences, Southwest University, Chongqing 400715, China
| | - Xiaoling Tong
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
- Key Laboratory of Sericultural Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, College of Sericulture, Textile and Biomass Sciences, Southwest University, Chongqing 400715, China
| | - Fangyin Dai
- State Key Laboratory of Resource Insects, Institute of Sericulture and Systems Biology, Southwest University, Chongqing 400715, China
- Key Laboratory of Sericultural Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, College of Sericulture, Textile and Biomass Sciences, Southwest University, Chongqing 400715, China
| |
Collapse
|
35
|
Yurekten O, Payne T, Tejera N, Amaladoss FX, Martin C, Williams M, O’Donovan C. MetaboLights: open data repository for metabolomics. Nucleic Acids Res 2024; 52:D640-D646. [PMID: 37971328 PMCID: PMC10767962 DOI: 10.1093/nar/gkad1045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/16/2023] [Accepted: 10/26/2023] [Indexed: 11/19/2023] Open
Abstract
MetaboLights is a global database for metabolomics studies including the raw experimental data and the associated metadata. The database is cross-species and cross-technique and covers metabolite structures and their reference spectra as well as their biological roles and locations where available. MetaboLights is the recommended metabolomics repository for a number of leading journals and ELIXIR, the European infrastructure for life science information. In this article, we describe the continued growth and diversity of submissions and the significant developments in recent years. In particular, we highlight MetaboLights Labs, our new Galaxy Project instance with repository-scale standardized workflows, and how data public on MetaboLights are being reused by the community. Metabolomics resources and data are available under the EMBL-EBI's Terms of Use at https://www.ebi.ac.uk/metabolights and under Apache 2.0 at https://github.com/EBI-Metabolights.
Collapse
Affiliation(s)
- Ozgur Yurekten
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Payne
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Noemi Tejera
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Felix Xavier Amaladoss
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Callum Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mark Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Claire O’Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
36
|
Niehues A, de Visser C, Hagenbeek FA, Kulkarni P, Pool R, Karu N, Kindt ASD, Singh G, Vermeiren RRJM, Boomsma DI, van Dongen J, 't Hoen PAC, van Gool AJ. A multi-omics data analysis workflow packaged as a FAIR Digital Object. Gigascience 2024; 13:giad115. [PMID: 38217405 PMCID: PMC10787363 DOI: 10.1093/gigascience/giad115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 11/14/2023] [Accepted: 12/10/2023] [Indexed: 01/15/2024] Open
Abstract
BACKGROUND Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. FINDINGS We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. CONCLUSIONS Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
Collapse
Affiliation(s)
- Anna Niehues
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| | - Casper de Visser
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Fiona A Hagenbeek
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Purva Kulkarni
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
- Department of Human Genetics, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - René Pool
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Naama Karu
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Alida S D Kindt
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Gurnoor Singh
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Robert R J M Vermeiren
- Department of Child and Adolescent Psychiatry, LUMC-Curium, Leiden University Medical Center, 2342 AK Oegstgeest, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Jenny van Dongen
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Peter A C 't Hoen
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Alain J van Gool
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| |
Collapse
|
37
|
Kawakami R, Wright KD, Scharre DW, Ning X. Detection of Cognitive Impairment From eSAGE Metadata Using Machine Learning. Alzheimer Dis Assoc Disord 2024; 38:22-27. [PMID: 38109352 DOI: 10.1097/wad.0000000000000593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 10/27/2023] [Indexed: 12/20/2023]
Abstract
OBJECTIVE Using the metadata collected in the digital version of the Self-Administered Gerocognitive Examination (eSAGE), we aim to improve the prediction of mild cognitive impairment (MCI) and dementia (DM) by applying machine learning methods. PATIENTS AND METHODS A total of 66 patients had a diagnosis of normal cognition (NC), MCI, or DM, and eSAGE scores and metadata were used. eSAGE scores and metadata were obtained. Each eSAGE question was scored and behavioral features (metadata) such as the time spent on each test page, drawing speed, and average stroke length were extracted for each patient. Logistic regression (LR) and gradient boosting models were trained using these features to detect cognitive impairment (CI). Performance was evaluated using 10-fold cross-validation, with accuracy, precision, recall, F1 score, and receiver operating characteristic area under the curve (AUC) score as evaluation metrics. RESULTS LR with feature selection achieved an AUC of 89.51%, a recall of 87.56%, and an F1 of 85.07% using both behavioral and scoring. LR using scores and metadata also achieved an AUC of 84.00% in detecting MCI from NC, and an AUC of 98.12% in detecting DM from NC. Average stroke length was particularly useful for prediction and when combined with 4 other scoring features, LR achieved an even better AUC of 92.06% in detecting CI. The study shows that eSAGE scores and metadata are predictive of CI. CONCLUSIONS eSAGE scores and metadata are predictive of CI. With machine learning methods, the metadata could be combined with scores to enable more accurate detection of CI.
Collapse
Affiliation(s)
| | | | | | - Xia Ning
- Department of Computer Science and Engineering
- Department of Biomedical Informatics
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH
| |
Collapse
|
38
|
Rahrooh A, Garlid AO, Bartlett K, Coons W, Petousis P, Hsu W, Bui AAT. Towards a framework for interoperability and reproducibility of predictive models. J Biomed Inform 2024; 149:104551. [PMID: 38000765 DOI: 10.1016/j.jbi.2023.104551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 08/28/2023] [Accepted: 11/19/2023] [Indexed: 11/26/2023]
Abstract
The development and deployment of machine learning (ML) models for biomedical research and healthcare currently lacks standard methodologies. Although tools for model replication are numerous, without a unifying blueprint it remains difficult to scientifically reproduce predictive ML models for any number of reasons (e.g., assumptions regarding data distributions and preprocessing, unclear test metrics, etc.) and ultimately, questions around generalizability and transportability are not readily answered. To facilitate scientific reproducibility, we built upon the Predictive Model Markup Language (PMML) to capture essential information. As a key component of the PREdictive Model Index and Exchange REpository (PREMIERE) platform, we present the Automated Metadata Pipeline (AMP) for conversion of a given predictive ML model into an extended PMML file that autocompletes an ML-based checklist, assessing model elements for interoperability and reproducibility. We demonstrate this pipeline on multiple test cases with three different ML algorithms and health-related datasets, providing a foundation for future predictive model reproducibility, sharing, and comparison.
Collapse
Affiliation(s)
- Al Rahrooh
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA.
| | - Anders O Garlid
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Kelly Bartlett
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Warren Coons
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Panayiotis Petousis
- Clinical and Translational Science Institute (CTSI), University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - William Hsu
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Alex A T Bui
- Medical & Imaging Informatics (MII) Group, University of California Los Angeles (UCLA), Los Angeles, CA, USA; Clinical and Translational Science Institute (CTSI), University of California Los Angeles (UCLA), Los Angeles, CA, USA
| |
Collapse
|
39
|
Schwedhelm C, Nimptsch K, Ahrens W, Hasselhorn HM, Jöckel KH, Katzke V, Kluttig A, Linkohr B, Mikolajczyk R, Nöthlings U, Perrar I, Peters A, Schmidt CO, Schmidt B, Schulze MB, Stang A, Zeeb H, Pischon T. Chronic disease outcome metadata from German observational studies - public availability and FAIR principles. Sci Data 2023; 10:868. [PMID: 38052810 PMCID: PMC10698176 DOI: 10.1038/s41597-023-02726-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 11/07/2023] [Indexed: 12/07/2023] Open
Abstract
Metadata from epidemiological studies, including chronic disease outcome metadata (CDOM), are important to be findable to allow interpretability and reusability. We propose a comprehensive metadata schema and used it to assess public availability and findability of CDOM from German population-based observational studies participating in the consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health). Additionally, principal investigators from the included studies completed a checklist evaluating consistency with FAIR principles (Findability, Accessibility, Interoperability, Reusability) within their studies. Overall, six of sixteen studies had complete publicly available CDOM. The most frequent CDOM source was scientific publications and the most frequently missing metadata were availability of codes of the International Classification of Diseases, Tenth Revision (ICD-10). Principal investigators' main perceived barriers for consistency with FAIR principles were limited human and financial resources. Our results reveal that CDOM from German population-based studies have incomplete availability and limited findability. There is a need to make CDOM publicly available in searchable platforms or metadata catalogues to improve their FAIRness, which requires human and financial resources.
Collapse
Affiliation(s)
- Carolina Schwedhelm
- Molecular Epidemiology Research Group, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, 13125, Germany.
| | - Katharina Nimptsch
- Molecular Epidemiology Research Group, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, 13125, Germany
| | - Wolfgang Ahrens
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, 28359, Germany
- Institute of Statistics, Faculty of Mathematics and Computer Science, University of Bremen, Bremen, 28334, Germany
| | - Hans Martin Hasselhorn
- Department of Occupational Health Science, University of Wuppertal, Wuppertal, 42119, Germany
| | - Karl-Heinz Jöckel
- Institute for Medical Informatics, Biometry and Epidemiology, University Hospital of Essen, Essen, 45122, Germany
| | - Verena Katzke
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
| | - Alexander Kluttig
- Institute of Medical Epidemiology, Biometrics, and Informatics, Interdisciplinary Center for Health Sciences, Medical Faculty of the Martin-Luther-University Halle-Wittenberg, Halle (Saale), 06112, Germany
| | - Birgit Linkohr
- Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, 85764, Germany
| | - Rafael Mikolajczyk
- Institute of Medical Epidemiology, Biometrics, and Informatics, Interdisciplinary Center for Health Sciences, Medical Faculty of the Martin-Luther-University Halle-Wittenberg, Halle (Saale), 06112, Germany
- DZPG (German Center for Mental Health), partner site Halle-Jena-Magdeburg, 07743, Jena, Germany
| | - Ute Nöthlings
- Institute of Nutrition and Food Sciences, Nutritional Epidemiology, University of Bonn, Bonn, 53115, Germany
| | - Ines Perrar
- Institute of Nutrition and Food Sciences, Nutritional Epidemiology, University of Bonn, Bonn, 53115, Germany
| | - Annette Peters
- Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, 85764, Germany
- Institute for Medical Information Processing, Biometry and Epidemiology, Department of Epidemiology, Medical Faculty of the Ludwig-Maximilians-Universität München, Munich, 81377, Germany
| | - Carsten O Schmidt
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, 17489, Germany
| | - Börge Schmidt
- Institute for Medical Informatics, Biometry and Epidemiology, University Hospital of Essen, Essen, 45122, Germany
| | - Matthias B Schulze
- Department of Molecular Epidemiology, German Institute of Human Nutrition Potsdam Rehbruecke, Nuthetal, 14558, Germany
- Institute of Nutritional Science, University of Potsdam, Nuthetal, 14558, Germany
| | - Andreas Stang
- Institute for Medical Informatics, Biometry and Epidemiology, University Hospital of Essen, Essen, 45122, Germany
- Department of Epidemiology, School of Public Health, Boston University, Boston, MA, 02118, USA
| | - Hajo Zeeb
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, 28359, Germany
- Faculty 11 - Human and Health Sciences, University of Bremen, Bremen, 28359, Germany
| | - Tobias Pischon
- Molecular Epidemiology Research Group, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, 13125, Germany
- Biobank Technology Platform, Max-Delbrueck-Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, 13125, Germany
- Core Facility Biobank, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, 13125, Germany
- Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, 10117, Germany
| |
Collapse
|
40
|
Lu S, Liu J, Wang X, Zhou Y. Collaborative Multi- Metadata Fusion to Improve the Classification of Lumbar Disc Herniation. IEEE Trans Med Imaging 2023; 42:3590-3601. [PMID: 37432809 DOI: 10.1109/tmi.2023.3294248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/13/2023]
Abstract
Computed tomography (CT) images are the most commonly used radiographic imaging modality for detecting and diagnosing lumbar diseases. Despite many outstanding advances, computer-aided diagnosis (CAD) of lumbar disc disease remains challenging due to the complexity of pathological abnormalities and poor discrimination between different lesions. Therefore, we propose a Collaborative Multi-Metadata Fusion classification network (CMMF-Net) to address these challenges. The network consists of a feature selection model and a classification model. We propose a novel Multi-scale Feature Fusion (MFF) module that can improve the edge learning ability of the network region of interest (ROI) by fusing features of different scales and dimensions. We also propose a new loss function to improve the convergence of the network to the internal and external edges of the intervertebral disc. Subsequently, we use the ROI bounding box from the feature selection model to crop the original image and calculate the distance features matrix. We then concatenate the cropped CT images, multiscale fusion features, and distance feature matrices and input them into the classification network. Next, the model outputs the classification results and the class activation map (CAM). Finally, the CAM of the original image size is returned to the feature selection network during the upsampling process to achieve collaborative model training. Extensive experiments demonstrate the effectiveness of our method. The model achieved 91.32% accuracy in the lumbar spine disease classification task. In the labelled lumbar disc segmentation task, the Dice coefficient reaches 94.39%. The classification accuracy in the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) reaches 91.82%.
Collapse
|
41
|
Gorade V, Mittal S, Singhal R. PaCL: Patient-aware contrastive learning through metadata refinement for generalized early disease diagnosis. Comput Biol Med 2023; 167:107569. [PMID: 37865984 DOI: 10.1016/j.compbiomed.2023.107569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 09/13/2023] [Accepted: 10/10/2023] [Indexed: 10/24/2023]
Abstract
Early diagnosis plays a pivotal role in effectively treating numerous diseases, especially in healthcare scenarios where prompt and accurate diagnoses are essential. Contrastive learning (CL) has emerged as a promising approach for medical tasks, offering advantages over traditional supervised learning methods. However, in healthcare, patient metadata contains valuable clinical information that can enhance representations, yet existing CL methods often overlook this data. In this study, we propose an novel approach that leverages both clinical information and imaging data in contrastive learning to enhance model generalization and interpretability. Furthermore, existing contrastive methods may be prone to sampling bias, which can lead to the model capturing spurious relationships and exhibiting unequal performance across protected subgroups frequently encountered in medical settings. To address these limitations, we introduce Patient-aware Contrastive Learning (PaCL), featuring an inter-class separability objective (IeSO) and an intra-class diversity objective (IaDO). IeSO harnesses rich clinical information to refine samples, while IaDO ensures the necessary diversity among samples to prevent class collapse. We demonstrate the effectiveness of PaCL both theoretically through causal refinements and empirically across six real-world medical imaging tasks spanning three imaging modalities: ophthalmology, radiology, and dermatology. Notably, PaCL outperforms previous techniques across all six tasks.
Collapse
|
42
|
Csibra E, Stan GB. Parsley: a web app for parsing data from plate readers. Bioinformatics 2023; 39:btad733. [PMID: 38048610 PMCID: PMC10715767 DOI: 10.1093/bioinformatics/btad733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 10/16/2023] [Accepted: 12/01/2023] [Indexed: 12/06/2023] Open
Abstract
SUMMARY As demand for the automation of biological assays has increased over recent years, the range of measurement types implemented by multiwell plate readers has broadened and the list of published software packages that caters to their analysis has grown. However, most plate readers export data in esoteric formats with little or no metadata, while most analytical software packages are built to work with tidy data accompanied by associated metadata. 'Parser' functions are therefore required to prepare raw data for analysis. Such functions are instrument- and data type-specific, and to date, no generic tool exists that can parse data from multiple data types or multiple plate readers, despite the potential for such a tool to speed up access to analysed data and remove an important barrier for less confident coders. We have developed the interactive web application, Parsley, to bridge this gap. Unlike conventional programmatic parser functions, Parsley makes few assumptions about exported data, instead employing user inputs to identify and extract data from data files. In doing so, it is designed to enable any user to parse plate reader data and can handle a wide variety of instruments (10+) and data types (53+). Parsley is freely available via a web interface, enabling access to its unique plate reader data parsing functionality, without the need to install software or write code. AVAILABILITY AND IMPLEMENTATION The Parsley web application can be accessed at: https://gbstan.shinyapps.io/parsleyapp/. The source code is available at: https://github.com/ec363/parsleyapp and is archived on Zenodo: https://zenodo.org/records/10011752.
Collapse
Affiliation(s)
- Eszter Csibra
- Department of Bioengineering, Imperial College Centre for Synthetic Biology (IC-CSynB), Imperial College London, London SW7 2AY, United Kingdom
| | - Guy-Bart Stan
- Department of Bioengineering, Imperial College Centre for Synthetic Biology (IC-CSynB), Imperial College London, London SW7 2AY, United Kingdom
| |
Collapse
|
43
|
Spinellis D. Open reproducible scientometric research with Alexandria3k. PLoS One 2023; 18:e0294946. [PMID: 38032908 PMCID: PMC10688655 DOI: 10.1371/journal.pone.0294946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 11/11/2023] [Indexed: 12/02/2023] Open
Abstract
Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications, often with the help of online scientific publication databases and search engines. However, use of online sources suffers from a lack of repeatability and transparency, as well as from technical restrictions. Alexandria3k is a Python software package and an associated command-line tool that can populate embedded relational databases with slices from the complete set of several open publication metadata sets. These can then be employed for reproducible processing and analysis through versatile and performant queries. We demonstrate the software's utility by visualizing the evolution of publications in diverse scientific fields and relationships among them, by outlining scientometric facts associated with COVID-19 research, and by replicating commonly-used bibliometric measures and findings regarding scientific productivity, impact, and disruption.
Collapse
Affiliation(s)
- Diomidis Spinellis
- Department of Management Science and Technology, Athens University of Economics and Business, Athens, Greece
- Department of Software Technology, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
44
|
Lo CM, Syu ZS. Analyzing drama metadata through machine learning to gain insights into social information dissemination patterns. PLoS One 2023; 18:e0288932. [PMID: 38032993 PMCID: PMC10688626 DOI: 10.1371/journal.pone.0288932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 07/03/2023] [Indexed: 12/02/2023] Open
Abstract
TV drama, through synchronization with social phenomena, allows the audience to resonate with the characters and desire to watch the next episode. In particular, drama ratings can be the criterion for advertisers to invest in ad placement and a predictor of subsequent economic efficiency in the surrounding areas. To identify the dissemination patterns of social information about dramas, this study used machine learning to predict drama ratings and the contribution of various drama metadata, including broadcast year, broadcast season, TV stations, day of the week, broadcast time slot, genre, screenwriters, status as an original work or sequel, actors and facial features on posters. A total of 800 Japanese TV dramas broadcast during prime time between 2003 and 2020 were collected for analysis. Four machine learning classifiers, including naïve Bayes, artificial neural network, support vector machine, and random forest, were used to combine the metadata. With facial features, the accuracy of the random forest model increased from 75.80% to 77.10%, which shows that poster information can improve the accuracy of the overall predicted ratings. Using only posters to predict ratings with a convolutional neural network still obtained an accuracy rate of 71.70%. More insights about the correlations between drama metadata and social information dissemination patterns were explored.
Collapse
Affiliation(s)
- Chung-Ming Lo
- Graduate Institute of Library, Information and Archival Studies, National Chengchi University, Taipei, Taiwan
| | - Zih-Sin Syu
- Graduate Institute of Library, Information and Archival Studies, National Chengchi University, Taipei, Taiwan
| |
Collapse
|
45
|
Boiński TM. Photos and rendered images of LEGO bricks. Sci Data 2023; 10:811. [PMID: 37980420 PMCID: PMC10657460 DOI: 10.1038/s41597-023-02682-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 10/23/2023] [Indexed: 11/20/2023] Open
Abstract
The paper describes a collection of datasets containing both LEGO brick renders and real photos. The datasets contain around 155,000 photos and nearly 1,500,000 renders. The renders aim to simulate real-life photos of LEGO bricks allowing faster creation of extensive datasets. The datasets are publicly available via the Gdansk University of Technology "Most Wiedzy" institutional repository. The source files of all tools used during the creation of the dataset were made publicly available via GitHub repositories. The images, both photos and the renders were annotated with the unique brick ID and category from the official LEGO catalog. The proposed datasets are stored in easy-to-read formats and are labeled via directory structure allowing easy manipulation and conversion of metadata to other formats.
Collapse
Affiliation(s)
- Tomasz Maria Boiński
- Gdańsk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdańsk, 80-233, Poland.
| |
Collapse
|
46
|
Zhao K, Farrell K, Mashiku M, Abay D, Tang K, Oberste MS, Burns CC. A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health. Front Public Health 2023; 11:1254976. [PMID: 38035280 PMCID: PMC10683794 DOI: 10.3389/fpubh.2023.1254976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 10/19/2023] [Indexed: 12/02/2023] Open
Abstract
Background The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized metadata and inconsistent submission practices in SRA may impede the data's utility in public health. Methods To address this issue, we introduce the Search-based Geographic Metadata Curation (SGMC) pipeline. SGMC utilized Python and web scraping to extract geographic data of sequencing institutions from NCBI SRA in the Cloud and its website. It then harnessed ChatGPT to refine the sequencing institution and location assignments. To illustrate the pipeline's utility, we examined the geographic distribution of the sequencing institutions and their countries relevant to polio eradication and categorized them. Results SGMC successfully identified 7,649 sequencing institutions and their global locations from a random selection of 2,321,044 SRA accessions. These institutions were distributed across 97 countries, with strong representation in the United States, the United Kingdom and China. However, there was a lack of data from African, Central Asian, and Central American countries, indicating potential disparities in sequencing capabilities. Comparison with manually curated data for U.S. institutions reveals SGMC's accuracy rates of 94.8% for institutions, 93.1% for countries, and 74.5% for geographic coordinates. Conclusion SGMC may represent a novel approach using a generative AI model to enhance geographic data (country and institution assignments) for large numbers of samples within SRA datasets. This information can be utilized to bolster public health endeavors.
Collapse
Affiliation(s)
- Kun Zhao
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
| | - Katie Farrell
- Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
| | - Melchizedek Mashiku
- Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
| | - Dawit Abay
- Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
| | - Kevin Tang
- Division of Scientific Resources, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
| | - M Steven Oberste
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
| | - Cara C Burns
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
| |
Collapse
|
47
|
Mallya P, Stevens LM, Zhao J, Hong C, Henao R, Economou-Zavlanos N, Wojdyla DM, Schibler T, Manchanda V, Pencina MJ, Hall JL. Facilitating Harmonization of Variables in Framingham, MESA, ARIC, and REGARDS Studies Through a Metadata Repository. Circ Cardiovasc Qual Outcomes 2023; 16:e009938. [PMID: 37850400 PMCID: PMC10841164 DOI: 10.1161/circoutcomes.123.009938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
BACKGROUND High-quality research in cardiovascular prevention, as in other fields, requires inclusion of a broad range of data sets from different sources. Integrating and harmonizing different data sources are essential to increase generalizability, sample size, and representation of understudied populations-strengthening the evidence for the scientific questions being addressed. METHODS Here, we describe an effort to build an open-access repository and interactive online portal for researchers to access the metadata and code harmonizing data from 4 well-known cohort studies-the REGARDS (Reasons for Geographic and Racial Differences in Stroke) study, FHS (Framingham Heart Study), MESA (Multi-Ethnic Study of Atherosclerosis), and ARIC (Atherosclerosis Risk in Communities) study. We introduce a methodology and a framework used for preprocessing and harmonizing variables from multiple studies. RESULTS We provide a real-case study and step-by-step guidance to demonstrate the practical utility of our repository and interactive web page. In addition to our successful development of such an open-access repository and interactive web page, this exercise in harmonizing data from multiple cohort studies has revealed several key themes. These themes include the importance of careful preprocessing and harmonization of variables, the value of creating an open-access repository to facilitate collaboration and reproducibility, and the potential for using harmonized data to address important scientific questions and disparities in cardiovascular disease research. CONCLUSIONS By integrating and harmonizing these large-scale cohort studies, such a repository may improve the statistical power and representation of understudied cohorts, enabling development and validation of risk prediction models, identification and investigation of risk factors, and creating a platform for racial disparities research. REGISTRATION URL: https://precision.heart.org/duke-ninds.
Collapse
Affiliation(s)
- Pratheek Mallya
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Laura M Stevens
- University of Colorado Anschutz Medical School, Aurora (L.M.S.)
| | - Juan Zhao
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Ricardo Henao
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | | | - Daniel M Wojdyla
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Tony Schibler
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Vihaan Manchanda
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Michael J Pencina
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
| | - Jennifer L Hall
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| |
Collapse
|
48
|
Mackenzie A, Lewis E, Loveland J. Successes and challenges in extracting information from DICOM image databases for audit and research. Br J Radiol 2023; 96:20230104. [PMID: 37698251 PMCID: PMC10607388 DOI: 10.1259/bjr.20230104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 05/05/2023] [Accepted: 05/11/2023] [Indexed: 09/13/2023] Open
Abstract
In radiography, much valuable associated data (metadata) is generated during image acquisition. The current setup of picture archive and communication systems (PACS) can make extraction of this metadata difficult, especially as it is typically stored with the image. The aim of this work is to examine the current challenges in extracting image metadata and to discuss the potential benefits of using this rich information. This work focuses on breast screening, though the conclusions are applicable to other modalities.The data stored in PACS contain information, currently underutilised, and is of great benefit for auditing and improving imaging and radiographic practice. From the literature, we present examples of the potential clinical benefit such as audits of dose, and radiographic practice, as well as more advanced research highlighting the effects of radiographic practice, e.g. cancer detection rates affected by imaging technology.This review considers the challenges in extracting data, namely,• The search tools for data on most PACS are inadequate being both time-consuming and limited in elements that can be searched.• Security and information governance considerations• Anonymisation of data if required• Data curationThe review describes some solutions that have been successfully implemented.• Retrospective extraction: direct query on PACS• Extracting data prospectively• Use of structured reports• Use of trusted research environmentsUltimately, the data access process will be made easier by inclusion during PACS procurement. Auditing data from PACS can be used to improve quality of imaging and workflow, all of which will be a clinical benefit to patients.
Collapse
Affiliation(s)
| | | | - John Loveland
- NCCPM, Royal Surrey NHS Foundation Trust, Guildford, United Kingdom
| |
Collapse
|
49
|
Hasan MAM, Maniruzzaman M, Shin J. Gene Expression and Metadata Based Identification of Key Genes for Hepatocellular Carcinoma Using Machine Learning and Statistical Models. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:3786-3799. [PMID: 37812547 DOI: 10.1109/tcbb.2023.3322753] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2023]
Abstract
Biomarkers associated with hepatocellular carcinoma (HCC) are of great importance to better understand biological response mechanisms to internal or external intervention. The study aimed to identify key candidate genes for HCC using machine learning (ML) and statistics-based bioinformatics models. Differentially expressed genes (DEGs) were identified using limma and then selected their common genes among DEGs identified from four datasets. After that, protein-protein interaction networks were constructed using STRING and then Cytoscape was used to determine hub genes, significant modules, and their associated genes. Simultaneously, three ML-based techniques such as support vector machine (SVM), least absolute shrinkage and selection operator-logistic regression (LASSO-LR), and partial least squares-discriminant analysis (PLS-DA) were implemented to determine the discriminative genes of HCC from common DEGs. Moreover, metadata of hub genes were formed by listing all hub genes from existing studies to incorporate other findings in our analysis. Finally, seven key candidate genes (ASPM, CCNB1, CDK1, DLGAP5, KIF20 A, MT1X, and TOP2A) were identified by intersecting common genes among hub genes, significant modules genes, discriminative genes from SVM, LASSO-LR, and PLS-DA, and meta hub genes from existing studies. Another three independent test datasets were also used to validate these seven key candidate genes using AUC, computed from ROC.
Collapse
|
50
|
Mehari T, Strodthoff N. Towards Quantitative Precision for ECG Analysis: Leveraging State Space Models, Self-Supervision and Patient Metadata. IEEE J Biomed Health Inform 2023; 27:5326-5334. [PMID: 37656655 DOI: 10.1109/jbhi.2023.3310989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
Abstract
Deep learning has emerged as the preferred modeling approach for automatic ECG analysis. In this study, we investigate three elements aimed at improving the quantitative accuracy of such systems. These components consistently enhance performance beyond the existing state-of-the-art, which is predominantly based on convolutional models. Firstly, we explore more expressive architectures by exploiting structured state space models (SSMs). These models have shown promise in capturing long-term dependencies in time series data. By incorporating SSMs into our approach, we not only achieve better performance, but also gain insights into long-standing questions in the field. Specifically, for standard diagnostic tasks, we find no advantage in using higher sampling rates such as 500 Hz compared to 100 Hz. Similarly, extending the input size of the model beyond 3 seconds does not lead to significant improvements. Secondly, we demonstrate that self-supervised learning using contrastive predictive coding can further improve the performance of SSMs. By leveraging self-supervision, we enable the model to learn more robust and representative features, leading to improved analysis accuracy. Lastly, we depart from synthetic benchmarking scenarios and incorporate basic demographic metadata alongside the ECG signal as input. This inclusion of patient metadata departs from the conventional practice of relying solely on the signal itself. Remarkably, this addition consistently yields positive effects on predictive performance. We firmly believe that all three components should be considered when developing next-generation ECG analysis algorithms.
Collapse
|