1
|
Guimarães PAS, Carvalho MGR, Ruiz JC. A computational framework for extracting biological insights from SRA cancer data. Sci Rep 2025; 15:8117. [PMID: 40057525 PMCID: PMC11890766 DOI: 10.1038/s41598-025-91781-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Accepted: 02/24/2025] [Indexed: 05/13/2025] Open
Abstract
The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics. As a result, it identifies and illustrates relationships among sample collections, aiming to discover potential cancer biomarkers. In colorectal cancer (CRC) and acute lymphoblastic leukemia (ALL) case studies, this methodology effectively navigates SRA metadata, retrieving, extracting, and integrating data. It highlights significant connections between samples and patient clinical data, revealing important biological insights. The study grouped 2,737 (CRC) and 3,655 (ALL) samples into potential comparison groups, demonstrating the method's power in identifying relationships and aiding biomarker discovery.
Collapse
Affiliation(s)
- Paul Anderson Souza Guimarães
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil
| | - Maria Gabriela Reis Carvalho
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.
| | - Jeronimo Conceição Ruiz
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.
| |
Collapse
|
2
|
Rymuza J, Sun Y, Zheng G, LeRoy N, Murach M, Phan N, Zhang A, Sheffield N. Methods for constructing and evaluating consensus genomic interval sets. Nucleic Acids Res 2024; 52:10119-10131. [PMID: 39180401 PMCID: PMC11417377 DOI: 10.1093/nar/gkae685] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 07/05/2024] [Accepted: 07/29/2024] [Indexed: 08/26/2024] Open
Abstract
The amount of genomic region data continues to increase. Integrating across diverse genomic region sets requires consensus regions, which enable comparing regions across experiments, but also by necessity lose precision in region definitions. We require methods to assess this loss of precision and build optimal consensus region sets. Here, we introduce the concept of flexible intervals and propose three novel methods for building consensus region sets, or universes: a coverage cutoff method, a likelihood method, and a Hidden Markov Model. We then propose three novel measures for evaluating how well a proposed universe fits a collection of region sets: a base-level overlap score, a region boundary distance score, and a likelihood score. We apply our methods and evaluation approaches to several collections of region sets and show how these methods can be used to evaluate fit of universes and build optimal universes. We describe scenarios where the common approach of merging regions to create consensus leads to undesirable outcomes and provide principled alternatives that provide interoperability of interval data while minimizing loss of resolution.
Collapse
Affiliation(s)
- Julia Rymuza
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Yuchen Sun
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan J LeRoy
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Maria Murach
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Neil Phan
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan C Sheffield
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
3
|
Zheng G, Rymuza J, Gharavi E, LeRoy N, Zhang A, Sheffield N. Methods for evaluating unsupervised vector representations of genomic regions. NAR Genom Bioinform 2024; 6:lqae086. [PMID: 39131817 PMCID: PMC11316252 DOI: 10.1093/nargab/lqae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 06/14/2024] [Accepted: 07/29/2024] [Indexed: 08/13/2024] Open
Abstract
Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.
Collapse
Affiliation(s)
- Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Julia Rymuza
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Erfaneh Gharavi
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Aidong Zhang
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan C Sheffield
- Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
4
|
LeRoy N, Smith J, Zheng G, Rymuza J, Gharavi E, Brown D, Zhang A, Sheffield N. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. NAR Genom Bioinform 2024; 6:lqae073. [PMID: 38974799 PMCID: PMC11224678 DOI: 10.1093/nargab/lqae073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 04/29/2024] [Accepted: 06/20/2024] [Indexed: 07/09/2024] Open
Abstract
Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Jason P Smith
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Julia Rymuza
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
5
|
Lott MJ, Frankham GJ, Eldridge MDB, Alquezar‐Planas DE, Donnelly L, Zenger KR, Leigh KA, Kjeldsen SR, Field MA, Lemon J, Lunney D, Crowther MS, Krockenberger MB, Fisher M, Neaves LE. Reversing the decline of threatened koala ( Phascolarctos cinereus) populations in New South Wales: Using genomics to enhance conservation outcomes. Ecol Evol 2024; 14:e11700. [PMID: 39091325 PMCID: PMC11289790 DOI: 10.1002/ece3.11700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 06/17/2024] [Accepted: 06/24/2024] [Indexed: 08/04/2024] Open
Abstract
Genetic management is a critical component of threatened species conservation. Understanding spatial patterns of genetic diversity is essential for evaluating the resilience of fragmented populations to accelerating anthropogenic threats. Nowhere is this more relevant than on the Australian continent, which is experiencing an ongoing loss of biodiversity that exceeds any other developed nation. Using a proprietary genome complexity reduction-based method (DArTSeq), we generated a data set of 3239 high quality Single Nucleotide Polymorphisms (SNPs) to investigate spatial patterns and indices of genetic diversity in the koala (Phascolarctos cinereus), a highly specialised folivorous marsupial that is experiencing rapid and widespread population declines across much of its former range. Our findings demonstrate that current management divisions across the state of New South Wales (NSW) do not fully represent the distribution of genetic diversity among extant koala populations, and that care must be taken to ensure that translocation paradigms based on these frameworks do not inadvertently restrict gene flow between populations and regions that were historically interconnected. We also recommend that koala populations should be prioritised for conservation action based on the scale and severity of the threatening processes that they are currently faced with, rather than placing too much emphasis on their perceived value (e.g., as reservoirs of potentially adaptive alleles), as our data indicate that existing genetic variation in koalas is primarily partitioned among individual animals. As such, the extirpation of koalas from any part of their range represents a potentially critical reduction of genetic diversity for this iconic Australian species.
Collapse
Affiliation(s)
- Matthew J. Lott
- Australian Museum Research InstituteSydneyNew South WalesAustralia
| | | | | | | | - Lily Donnelly
- Molecular Ecology and Evolutionary Laboratory, College of Science and EngineeringJames Cook UniversityTownsvilleQueenslandAustralia
| | - Kyall R. Zenger
- Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and EngineeringJames Cook UniversityTownsvilleQueenslandAustralia
| | - Kellie A. Leigh
- Science for Wildlife LtdMount VictoriaNew South WalesAustralia
| | - Shannon R. Kjeldsen
- Molecular Ecology and Evolutionary Laboratory, College of Science and EngineeringJames Cook UniversityTownsvilleQueenslandAustralia
- Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and EngineeringJames Cook UniversityTownsvilleQueenslandAustralia
- Centre for Tropical Bioinformatics and Molecular BiologyJames Cook UniversityTownsvilleQueenslandAustralia
| | - Matt A. Field
- Centre for Tropical Bioinformatics and Molecular BiologyJames Cook UniversityTownsvilleQueenslandAustralia
- Immunogenomics LabGarvan Institute of Medical ResearchDarlinghurstNew South WalesAustralia
| | - John Lemon
- JML Environmental ConsultantsArmidaleNew South WalesAustralia
- School of Environmental and Rural ScienceUniversity of New EnglandArmidaleNew South WalesAustralia
| | - Daniel Lunney
- Australian Museum Research InstituteSydneyNew South WalesAustralia
- Department of Planning and EnvironmentParramattaNew South WalesAustralia
- School of Life and Environmental SciencesUniversity of SydneyCamperdownNew South WalesAustralia
| | - Mathew S. Crowther
- School of Life and Environmental SciencesUniversity of SydneyCamperdownNew South WalesAustralia
| | - Mark B. Krockenberger
- Sydney School of Veterinary ScienceUniversity of SydneyCamperdownNew South WalesAustralia
| | - Mark Fisher
- 3D Ecology MappingEmerald BeachNew South WalesAustralia
| | - Linda E. Neaves
- Fenner School of Environment and SocietyThe Australian National UniversityCanberraAustralian Capital TerritoryAustralia
| |
Collapse
|
6
|
LeRoy NJ, Khoroshevskyi O, O’Brien A, Stepień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.15.551388. [PMID: 37645717 PMCID: PMC10462087 DOI: 10.1101/2023.08.15.551388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself. Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data. Availability https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Aaron O’Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Rafał Stepień
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, 22908, Charlottesville VA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Child Health Research Center, School of Medicine, University of Virginia, 22908, Charlottesville VA
| |
Collapse
|
7
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 PMCID: PMC10967841 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E. Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
8
|
LeRoy NJ, Khoroshevskyi O, O’Brien A, Stępień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. Gigascience 2024; 13:giae033. [PMID: 38991851 PMCID: PMC11238423 DOI: 10.1093/gigascience/giae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 02/07/2024] [Accepted: 05/21/2024] [Indexed: 07/13/2024] Open
Abstract
BACKGROUND As biological data increase, we need additional infrastructure to share them and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important and in some ways has a wider scope than sharing data themselves. RESULTS Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural-language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural-language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data or to share new data. AVAILABILITY https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Aaron O’Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Rafał Stępień
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
9
|
Sheffield NC, LeRoy NJ, Khoroshevskyi O. Challenges to sharing sample metadata in computational genomics. Front Genet 2023; 14:1154198. [PMID: 37287537 PMCID: PMC10243526 DOI: 10.3389/fgene.2023.1154198] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Affiliation(s)
- Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|