1
|
Brown GS, Wengler J, Fabelico AJS, Muir A, Tubbs A, Warren A, Millett AN, Yu XX, Pavlidis P, Rogic S, Piccolo SR. Using semantic search to find publicly available gene-expression datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.13.643153. [PMID: 40161731 PMCID: PMC11952526 DOI: 10.1101/2025.03.13.643153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Millions of high-throughput, molecular datasets have been shared in public repositories. have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets. We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. Our top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, perhaps in combination with existing search tools.
Collapse
Affiliation(s)
- Grace S. Brown
- Department of Biology, Brigham Young University, Provo, Utah, USA
| | - James Wengler
- Department of Biology, Brigham Young University, Provo, Utah, USA
- Institute of Biosciences and Technology, Texas A&M Health Science Center, Houston, TX, USA
| | | | - Abigail Muir
- Department of Biology, Brigham Young University, Provo, Utah, USA
| | - Anna Tubbs
- Department of Biology, Brigham Young University, Provo, Utah, USA
| | - Amanda Warren
- Department of Biology, Brigham Young University, Provo, Utah, USA
| | - Alexandra N. Millett
- Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Xinrui Xiang Yu
- Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Paul Pavlidis
- Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Sanja Rogic
- Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | | |
Collapse
|
2
|
Stroggilos R, Tserga A, Zoidakis J, Vlahou A, Makridakis M. Tissue proteomics repositories for data reanalysis. MASS SPECTROMETRY REVIEWS 2024; 43:1270-1284. [PMID: 37534389 DOI: 10.1002/mas.21860] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/17/2023] [Accepted: 07/18/2023] [Indexed: 08/04/2023]
Abstract
We are approaching the third decade since the establishment of the very first proteomics repositories back in the mid-'00s. New experimental approaches and technologies continuously enrich the field while producing vast amounts of mass spectrometry data. Together with initiatives to establish standard terminology and file formats, proteomics is rapidly transforming into a mature component of systems biology. Here we describe the ProteomeXchange consortium repositories. We specifically search, collect and evaluate public human tissue datasets (categorized as "complete" by the repository) submitted in 2015-2022, to both map the existing information and assess the data set reusability. Human tissue data are variably represented in the repositories reviewed, ranging between 10% and 25% of the total data submitted, with cancers being the most represented, followed by neuronal and cardiovascular diseases. About half of the retrieved data sets were found to lack annotations or metadata necessary to directly replicate the analysis. This poses a rough challenge to data reusability and highlights the need to increase awareness of the mage-tab file format for metadata in the community. Overall, proteomics repositories have evolved greatly over the past 7 years, as they have grown in size and become equipped with various powerful applications and tools that enable data searching and analytical tasks. However, to make the most of this potential, priority must be given to finding ways to secure detailed metadata for each submission, which is likely the next major milestone for proteomics repositories.
Collapse
Affiliation(s)
- Rafael Stroggilos
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Aggeliki Tserga
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Jerome Zoidakis
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Antonia Vlahou
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| | - Manousos Makridakis
- Biomedical Research Foundation, Academy of Athens, Department of Biotechnology, Athens, Greece
| |
Collapse
|
3
|
Marino GB, Clarke DJ, Lachmann A, Deng EZ, Ma’ayan A. RummaGEO: Automatic mining of human and mouse gene sets from GEO. PATTERNS (NEW YORK, N.Y.) 2024; 5:101072. [PMID: 39569206 PMCID: PMC11573963 DOI: 10.1016/j.patter.2024.101072] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 07/22/2024] [Accepted: 09/11/2024] [Indexed: 11/22/2024]
Abstract
The Gene Expression Omnibus (GEO) has millions of samples from thousands of studies. While users of GEO can search the metadata describing studies, there is a need for methods to search GEO at the data level. RummaGEO is a gene expression signature search engine for human and mouse RNA sequencing perturbation studies extracted from GEO. To develop RummaGEO, we automatically identified groups of samples and computed differential expressions to extract gene sets from each study. The contents of RummaGEO are served for gene set, PubMed, and metadata search. Next, we analyzed the contents of RummaGEO to identify patterns and perform global analyses. Overall, RummaGEO provides a resource that is enabling users to identify relevant GEO studies based on their own gene expression results. Users of RummaGEO can incorporate RummaGEO into their analysis workflows for integrative analyses and hypothesis generation.
Collapse
Affiliation(s)
- Giacomo B. Marino
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Daniel J.B. Clarke
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Alexander Lachmann
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Eden Z. Deng
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Avi Ma’ayan
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| |
Collapse
|
4
|
Marino GB, Clarke DJB, Deng EZ, Ma’ayan A. RummaGEO: Automatic Mining of Human and Mouse Gene Sets from GEO. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.09.588712. [PMID: 38645198 PMCID: PMC11030343 DOI: 10.1101/2024.04.09.588712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
The Gene Expression Omnibus (GEO) is a major open biomedical research repository for transcriptomics and other omics datasets. It currently contains millions of gene expression samples from tens of thousands of studies collected by many biomedical research laboratories from around the world. While users of the GEO repository can search the metadata describing studies for locating relevant datasets, there are currently no methods or resources that facilitate global search of GEO at the data level. To address this shortcoming, we developed RummaGEO, a webserver application that enables gene expression signature search of a large collection of human and mouse RNA-seq studies deposited into GEO. To develop the search engine, we performed offline automatic identification of sample conditions from the uniformly aligned GEO studies available from ARCHS4. We then computed differential expression signatures to extract gene sets from these studies. In total, RummaGEO currently contains 135,264 human and 158,062 mouse gene sets extracted from 23,395 GEO studies. Next, we analyzed the contents of the RummaGEO database to identify statistical patterns and perform various global analyses. The contents of the RummaGEO database are provided as a web-server search engine with signature search, PubMed search, and metadata search functionalities. Overall, RummaGEO provides an unprecedented resource for the biomedical research community enabling hypothesis generation for many future studies. The RummaGEO search engine is available from: https://rummageo.com/.
Collapse
Affiliation(s)
- Giacomo B. Marino
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York 10029, NY USA
| | - Daniel J. B. Clarke
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York 10029, NY USA
| | - Eden Z. Deng
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York 10029, NY USA
| | - Avi Ma’ayan
- Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York 10029, NY USA
| |
Collapse
|
5
|
Sheffield NC, LeRoy NJ, Khoroshevskyi O. Challenges to sharing sample metadata in computational genomics. Front Genet 2023; 14:1154198. [PMID: 37287537 PMCID: PMC10243526 DOI: 10.3389/fgene.2023.1154198] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Affiliation(s)
- Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
6
|
Yang M, Wu Y, Yang XB, Liu T, Zhang Y, Zhuo Y, Luo Y, Zhang N. Establishing a prediction model of severe acute mountain sickness using machine learning of support vector machine recursive feature elimination. Sci Rep 2023; 13:4633. [PMID: 36944699 PMCID: PMC10030784 DOI: 10.1038/s41598-023-31797-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Accepted: 03/17/2023] [Indexed: 03/23/2023] Open
Abstract
Severe acute mountain sickness (sAMS) can be life-threatening, but little is known about its genetic basis. The study was aimed to explore the genetic susceptibility of sAMS for the purpose of prediction, using microarray data from 112 peripheral blood mononuclear cell (PBMC) samples of 21 subjects, who were exposed to very high altitude (5260 m), low barometric pressure (406 mmHg), and hypobaric hypoxia (VLH) at various timepoints. We found that exposure to VLH activated gene expression in leukocytes, resulting in an inverted CD4/CD8 ratio that interacted with other phenotypic risk factors at the genetic level. A total of 2286 underlying risk genes were input into the support vector machine recursive feature elimination (SVM-RFE) system for machine learning, and a model with satisfactory predictive accuracy and clinical applicability was established for sAMS screening using ten featured genes with significant predictive power. Five featured genes (EPHB3, DIP2B, RHEBL1, GALNT13, and SLC8A2) were identified upstream of hypoxia- and/or inflammation-related pathways mediated by microRNAs as potential biomarkers for sAMS. The established prediction model of sAMS holds promise for clinical application as a genetic screening tool for sAMS.
Collapse
Affiliation(s)
- Min Yang
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China.
| | - Yang Wu
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Xing-Biao Yang
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Tao Liu
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Ya Zhang
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Yue Zhuo
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Yong Luo
- Department of Traditional Chinese Medicine, Rheumatology Center of Integrated Medicine, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| | - Nan Zhang
- Department of Hematology, The General Hospital of Western Theater Command, PLA, Chengdu, 610083, China
| |
Collapse
|
7
|
Chao A, Grossman J, Carberry C, Lai Y, Williams AJ, Minucci JM, Purucker ST, Szilagyi J, Lu K, Boggess K, Fry RC, Sobus JR, Rager JE. Integrative exposomic, transcriptomic, epigenomic analyses of human placental samples links understudied chemicals to preeclampsia. ENVIRONMENT INTERNATIONAL 2022; 167:107385. [PMID: 35952468 PMCID: PMC9552572 DOI: 10.1016/j.envint.2022.107385] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 06/22/2022] [Accepted: 06/27/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND Environmental health research has recently undergone a dramatic shift, with ongoing technological advancements allowing for broader coverage of exposure and molecular biology signatures. Approaches to integrate such measures are still needed to increase understanding between systems-level exposure and biology. OBJECTIVES We address this gap by evaluating placental tissues to identify novel chemical-biological interactions associated with preeclampsia. This study tests the hypothesis that understudied chemicals are present in the human placenta and associated with preeclampsia-relevant disruptions, including overall case status (preeclamptic vs. normotensive patients) and underlying transcriptomic/epigenomic signatures. METHODS A non-targeted analysis based on high-resolution mass spectrometry was used to analyze placental tissues from a cohort of 35 patients with preeclampsia (n = 18) and normotensive (n = 17) pregnancies. Molecular feature data were prioritized for confirmation based on association with preeclampsia case status and confidence of chemical identification. All molecular features were evaluated for relationships to mRNA, microRNA, and CpG methylation (i.e., multi-omic) signature alterations involved in preeclampsia. RESULTS A total of 183 molecular features were identified with significantly differentiated abundance in placental extracts of preeclamptic patients; these features clustered into distinct chemical groupings using unsupervised methods. Of these features, 53 were identified (mapping to 40 distinct chemicals) using chemical standards, fragmentation spectra, and chemical metadata. In general, human metabolites had the largest feature intensities and strongest associations with preeclampsia-relevant multi-omic changes. Exogenous drugs were second most abundant and had fewer associations with multi-omic changes. Other exogenous chemicals (non-drugs) were least abundant and had the fewest associations with multi-omic changes. CONCLUSIONS These global data trends suggest that human metabolites are heavily intertwined with biological processes involved in preeclampsia etiology, while exogenous chemicals may still impact select transcriptomic/epigenomic processes. This study serves as a demonstration of merging systems exposures with systems biology to better understand chemical-disease relationships.
Collapse
Affiliation(s)
- Alex Chao
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Chemical Characterization and Exposure Division, Research Triangle Park, NC, USA
| | | | - Celeste Carberry
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- The Institute for Environmental Health Solutions, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Yunjia Lai
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Antony J. Williams
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Chemical Characterization and Exposure Division, Research Triangle Park, NC, USA
| | - Jeffrey M. Minucci
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Public Health and Environmental Assessment, Public Health and Environmental Systems Division, Research Triangle Park, NC, USA
| | - S. Thomas Purucker
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Great Lakes Toxicology and Ecology Division, Research Triangle Park, NC, USA
| | - John Szilagyi
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- The Institute for Environmental Health Solutions, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Kun Lu
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- The Institute for Environmental Health Solutions, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Toxicology and Environmental Medicine, School of Medicine, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Kim Boggess
- Department of Obstetrics and Gynecology, Division of Maternal Fetal Medicine, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Rebecca C. Fry
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- The Institute for Environmental Health Solutions, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Toxicology and Environmental Medicine, School of Medicine, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jon R. Sobus
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Chemical Characterization and Exposure Division, Research Triangle Park, NC, USA
| | - Julia E. Rager
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- The Institute for Environmental Health Solutions, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Toxicology and Environmental Medicine, School of Medicine, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
8
|
Ko YK, Gim JA. New Drug Development and Clinical Trial Design by Applying Genomic Information Management. Pharmaceutics 2022; 14:1539. [PMID: 35893795 PMCID: PMC9330622 DOI: 10.3390/pharmaceutics14081539] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 07/19/2022] [Accepted: 07/22/2022] [Indexed: 02/04/2023] Open
Abstract
Depending on the patients' genotype, the same drug may have different efficacies or side effects. With the cost of genomic analysis decreasing and reliability of analysis methods improving, vast amount of genomic information has been made available. Several studies in pharmacology have been based on genomic information to select the optimal drug, determine the dose, predict efficacy, and prevent side effects. This paper reviews the tissue specificity and genomic information of cancer. If the tissue specificity of cancer is low, cancer is induced in various organs based on a single gene mutation. Basket trials can be performed for carcinomas with low tissue specificity, confirming the efficacy of one drug for a single gene mutation in various carcinomas. Conversely, if the tissue specificity of cancer is high, cancer is induced in only one organ based on a single gene mutation. An umbrella trial can be performed for carcinomas with a high tissue specificity. Some drugs are effective for patients with a specific genotype. A companion diagnostic strategy that prescribes a specific drug for patients selected with a specific genotype is also reviewed. Genomic information is used in pharmacometrics to identify the relationship among pharmacokinetics, pharmacodynamics, and biomarkers of disease treatment effects. Utilizing genomic information, sophisticated clinical trials can be designed that will be better suited to the patients of specific genotypes. Genomic information also provides prospects for innovative drug development. Through proper genomic information management, factors relating to drug response and effects can be determined by selecting the appropriate data for analysis and by understanding the structure of the data. Selecting pre-processing and appropriate machine-learning libraries for use as machine-learning input features is also necessary. Professional curation of the output result is also required. Personalized medicine can be realized using a genome-based customized clinical trial design.
Collapse
Affiliation(s)
- Young Kyung Ko
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Internal Medicine, Korea University Guro Hospital, Seoul 08308, Korea;
| | - Jeong-An Gim
- Medical Science Research Center, College of Medicine, Korea University Guro Hospital, Seoul 08308, Korea
| |
Collapse
|
9
|
Serna Garcia G, Leone M, Bernasconi A, Carman MJ. GeMI: interactive interface for transformer-based Genomic Metadata Integration. Database (Oxford) 2022; 2022:6600540. [PMID: 35657113 PMCID: PMC9216561 DOI: 10.1093/database/baac036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/26/2022] [Accepted: 04/26/2022] [Indexed: 11/15/2022]
Abstract
The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases.
Database URL
http://gmql.eu/gemi/
Collapse
Affiliation(s)
- Giuseppe Serna Garcia
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Michele Leone
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Anna Bernasconi
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Mark J Carman
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| |
Collapse
|
10
|
Drug repurposing in silico screening platforms. Biochem Soc Trans 2022; 50:747-758. [PMID: 35285479 DOI: 10.1042/bst20200967] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/08/2022] [Accepted: 02/21/2022] [Indexed: 12/15/2022]
Abstract
Over the last decade, for the first time, substantial efforts have been directed at the development of dedicated in silico platforms for drug repurposing, including initiatives targeting cancers and conditions as diverse as cryptosporidiosis, dengue, dental caries, diabetes, herpes, lupus, malaria, tuberculosis and Covid-19 related respiratory disease. This review outlines some of the exciting advances in the specific applications of in silico approaches to the challenge of drug repurposing and focuses particularly on where these efforts have resulted in the development of generic platform technologies of broad value to researchers involved in programmatic drug repurposing work. Recent advances in molecular docking methodologies and validation approaches, and their combination with machine learning or deep learning approaches are continually enhancing the precision of repurposing efforts. The meaningful integration of better understanding of molecular mechanisms with molecular pathway data and knowledge of disease networks is widening the scope for discovery of repurposing opportunities. The power of Artificial Intelligence is being gainfully exploited to advance progress in an integrated science that extends from the sub-atomic to the whole system level. There are many promising emerging developments but there are remaining challenges to be overcome in the successful integration of the new advances in useful platforms. In conclusion, the essential component requirements for development of powerful and well optimised drug repurposing screening platforms are discussed.
Collapse
|
11
|
Fanidis D, Moulos P, Aidinis V. Fibromine is a multi-omics database and mining tool for target discovery in pulmonary fibrosis. Sci Rep 2021; 11:21712. [PMID: 34741074 PMCID: PMC8571330 DOI: 10.1038/s41598-021-01069-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 10/21/2021] [Indexed: 11/22/2022] Open
Abstract
Idiopathic pulmonary fibrosis is a lethal lung fibroproliferative disease with limited therapeutic options. Differential expression profiling of affected sites has been instrumental for involved pathogenetic mechanisms dissection and therapeutic targets discovery. However, there have been limited efforts to comparatively analyse/mine the numerous related publicly available datasets, to fully exploit their potential on the validation/creation of novel research hypotheses. In this context and towards that goal, we present Fibromine, an integrated database and exploration environment comprising of consistently re-analysed, manually curated transcriptomic and proteomic pulmonary fibrosis datasets covering a wide range of experimental designs in both patients and animal models. Fibromine can be accessed via an R Shiny application (http://www.fibromine.com/Fibromine) which offers dynamic data exploration and real-time integration functionalities. Moreover, we introduce a novel benchmarking system based on transcriptomic datasets underlying characteristics, resulting to dataset accreditation aiming to aid the user on dataset selection. Cell specificity of gene expression can be visualised and/or explored in several scRNA-seq datasets, in an effort to link legacy data with this cutting-edge methodology and paving the way to their integration. Several use case examples are presented, that, importantly, can be reproduced on-the-fly by a non-specialist user, the primary target and potential user of this endeavour.
Collapse
Affiliation(s)
- Dionysios Fanidis
- Institute for Bioinnovation, Biomedical Sciences Research Center ″Alexander Fleming″, 16672, Athens, Greece
| | - Panagiotis Moulos
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center ″Alexander Fleming″, 16672, Athens, Greece.
| | - Vassilis Aidinis
- Institute for Bioinnovation, Biomedical Sciences Research Center ″Alexander Fleming″, 16672, Athens, Greece.
| |
Collapse
|
12
|
Shang X, Shi LE, Taule D, Zhu ZZ. A Novel miRNA-mRNA Axis Involves in Regulating Transcriptional Disorders in Pancreatic Adenocarcinoma. Cancer Manag Res 2021; 13:5989-6004. [PMID: 34377019 PMCID: PMC8349199 DOI: 10.2147/cmar.s316935] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 07/10/2021] [Indexed: 12/11/2022] Open
Abstract
Background Currently, there is still a lack of understanding about the mechanism and therapeutic targets of pancreatic adenocarcinoma (PAAD). The potential of miRNA-mRNA networks for the identification of regulatory mechanisms involved in PAAD development remains unexplored. Methods We compared differentially expressed miRNAs (DEMIs) and differentially expressed genes (DEGs) in PAAD and normal tissues from the Gene Expression Omnibus (GEO) database. Transcription factors (TFs) were obtained from FunRich. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of DEGs and DEMIs were implemented using Database for Annotation, Visualization and Integrated Discovery (DAVID). Then, key miRNAs and targeted mRNAs were identified by assessment of their expression and prognosis in UALCAN and Kaplan-Meier plotters. In the last step, the candidate miRNA-mRNA selected was confirmed by real-time quantitative polymerase chain reaction (qRT-PCR). Results We distinguished 62 significant DEMIs, 1314 upregulated DEGs, and 1110 downregulated DEGs. The top 10 TFs were identified. In total, there were 160 hub genes obtained by intersecting the set of 2224 predicted targets with the set of significant DEGs. And we selected 8 key miRNAs. Furthermore, low expression of miR-455-3p in PAAD tissue was closely connected with poor prognosis, and only 5 target mRNAs were predicted to be increased in PAAD tissue with poor prognosis. Therefore, a novel miRNA-hub gene regulatory network in PAAD was constructed. Finally, in vitro experiments indicated that miR-455-3p expression was decreased in PAAD sample. HOXC4, DLG4, DYNLL1 and FBXO45 were validated by qRT-PCR as highly probable targets of miR-455-3p. Conclusion A novel miRNA-mRNA axis has been discovered that may be involved in the regulation of transcriptional disorders and affected the survival of PAAD patients, which would provide a novel strategy for the treatment of PAAD.
Collapse
Affiliation(s)
- Xin Shang
- The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, People's Republic of China
| | - Lan-Er Shi
- The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, People's Republic of China
| | - Dina Taule
- The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, People's Republic of China
| | - Zhang-Zhi Zhu
- The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, People's Republic of China
| |
Collapse
|
13
|
Ghosh S, Börsch A, Ghosh S, Zavolan M. The transcriptional landscape of a hepatoma cell line grown on scaffolds of extracellular matrix proteins. BMC Genomics 2021; 22:238. [PMID: 33823809 PMCID: PMC8025518 DOI: 10.1186/s12864-021-07532-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 03/14/2021] [Indexed: 11/10/2022] Open
Abstract
Background The behavior of cells in vivo is complex and highly dynamic, as it results from an interplay between intercellular matrix proteins with surface receptors and other microenvironmental cues. Although the effects of the cellular niche have been investigated for a number of cell types using different molecular approaches, comprehensive assessments of how the global transcriptome responds to 3D scaffolds composed of various extracellular matrix (ECM) constituents at different concentrations are still lacking. Results In this study, we explored the effects of two diverse extracellular matrix (ECM) components, Collagen I and Matrigel, on the transcriptional profile of cells in a cell culture system. Culturing Huh-7 cells on traditional cell culture plates (Control) or on the ECM components at different concentrations to modulate microenvironment properties, we have generated transcriptomics data that may be further explored to understand the differentiation and growth potential of this cell type for the development of 3D cultures. Our analysis infers transcription factors that are most responsible for the transcriptome response to the extracellular cues. Conclusion Our data indicates that the Collagen I substrate induces a robust transcriptional response in the Huh-7 cells, distinct from that induced by Matrigel. Enhanced hepatocyte markers (ALB and miR-122) reveal a potentially robust remodelling towards primary hepatocytes. Our results aid in defining the appropriate culture and transcription pathways while using hepatoma cell lines. As systems mimicking the in vivo structure and function of liver cells are still being developed, our study could potentially circumvent bottlenecks of limited availability of primary hepatocytes for preclinical studies of drug targets. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07532-2.
Collapse
Affiliation(s)
- Souvik Ghosh
- Biozentrum, University of Basel, Basel, Switzerland.
| | - Anastasiya Börsch
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Mihaela Zavolan
- Biozentrum, University of Basel, Basel, Switzerland. .,Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
14
|
Patra BG, Soltanalizadeh B, Deng N, Wu L, Maroufy V, Wu C, Zheng WJ, Roberts K, Wu H, Yaseen A. An informatics research platform to make public gene expression time-course datasets reusable for more scientific discoveries. Database (Oxford) 2020; 2020:baaa074. [PMID: 33247935 PMCID: PMC7698665 DOI: 10.1093/database/baaa074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 07/17/2020] [Accepted: 08/10/2020] [Indexed: 11/13/2022]
Abstract
The exponential growth of genomic/genetic data in the era of Big Data demands new solutions for making these data findable, accessible, interoperable and reusable. In this article, we present a web-based platform named Gene Expression Time-Course Research (GETc) Platform that enables the discovery and visualization of time-course gene expression data and analytical results from the NIH/NCBI-sponsored Gene Expression Omnibus (GEO). The analytical results are produced from an analytic pipeline based on the ordinary differential equation model. Furthermore, in order to extract scientific insights from these results and disseminate the scientific findings, close and efficient collaborations between domain-specific experts from biomedical and scientific fields and data scientists is required. Therefore, GETc provides several recommendation functions and tools to facilitate effective collaborations. GETc platform is a very useful tool for researchers from the biomedical genomics community to present and communicate large numbers of analysis results from GEO. It is generalizable and broadly applicable across different biomedical research areas. GETc is a user-friendly and efficient web-based platform freely accessible at http://genestudy.org/.
Collapse
Affiliation(s)
- Braja Gopal Patra
- Department of Biostatistics and Data Science, School of Public Health,The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| | - Babak Soltanalizadeh
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| | - Nan Deng
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| | - Leqing Wu
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| | - Vahed Maroufy
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| | - Canglin Wu
- TechWave International. Inc., Houston, TX, USA and
| | - W Jim Zheng
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St. Suite 600, Houston, TX 77030, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St. Suite 600, Houston, TX 77030, USA
| | - Hulin Wu
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St. Suite 600, Houston, TX 77030, USA
| | - Ashraf Yaseen
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health
Science Center at Houston, 1200 Pressler Street, Houston, TX 77030, USA
| |
Collapse
|
15
|
Screening and Functional Prediction of Key Candidate Genes in Hepatitis B Virus-Associated Hepatocellular Carcinoma. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7653506. [PMID: 33102593 PMCID: PMC7568806 DOI: 10.1155/2020/7653506] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 07/14/2020] [Accepted: 08/03/2020] [Indexed: 02/06/2023]
Abstract
Background The molecular mechanism by which hepatitis B virus (HBV) induces hepatocellular carcinoma (HCC) is still unknown. The genomic expression profile and bioinformatics methods were used to investigate the potential pathogenesis and therapeutic targets for HBV-associated HCC (HBV-HCC). Methods The microarray dataset GSE55092 was downloaded from the Gene Expression Omnibus (GEO) database. The data was analyzed by the bioinformatics software to find differentially expressed genes (DEGs). Gene Ontology (GO) enrichment analysis, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, ingenuity pathway analysis (IPA), and protein-protein interaction (PPI) network analysis were then performed on DEGs. The hub genes were identified using Centiscape2.2 and Molecular Complex Detection (MCODE) in the Cytoscape software (Cytoscape_v3.7.2). The survival data of these hub genes was downloaded from the Gene Expression Profiling Interactive Analysis (GEPIA). Results A total of 2264 mRNA transcripts were differentially expressed, including 764 upregulated and 1500 downregulated in tumor tissues. GO analysis revealed that these DEGs were related to the small-molecule metabolic process, xenobiotic metabolic process, and cellular nitrogen compound metabolic process. KEGG pathway analysis revealed that metabolic pathways, complement and coagulation cascades, and chemical carcinogenesis were involved. Diseases and biofunctions showed that DEGs were mainly associated with the following diseases or biological function abnormalities: cancer, organismal injury and abnormalities, gastrointestinal disease, and hepatic system disease. The top 10 upstream regulators were predicted to be activated or inhibited by Z-score and identified 25 networks. The 10 genes with the highest degree of connectivity were defined as the hub genes. Cox regression revealed that all the 10 genes (CDC20, BUB1B, KIF11, TTK, EZH2, ZWINT, NDC80, TPX2, MELK, and KIF20A) were related to the overall survival. Conclusion Our study provided a registry of genes that play important roles in regulating the development of HBV-HCC, assisting us in understanding the molecular mechanisms that underlie the carcinogenesis and progression of HCC.
Collapse
|
16
|
Patra BG, Roberts K, Wu H. A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:1. [PMID: 33002137 PMCID: PMC7659921 DOI: 10.1093/database/baaa064] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 07/19/2020] [Accepted: 07/27/2020] [Indexed: 11/13/2022]
Abstract
It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/.
Collapse
Affiliation(s)
- Braja Gopal Patra
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA and
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA
| | - Hulin Wu
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA.,School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA
| |
Collapse
|
17
|
A transcriptomic study of Williams-Beuren syndrome associated genes in mouse embryonic stem cells. Sci Data 2019; 6:262. [PMID: 31695049 PMCID: PMC6834640 DOI: 10.1038/s41597-019-0281-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 10/11/2019] [Indexed: 02/07/2023] Open
Abstract
Williams-Beuren syndrome (WBS) is a relatively rare disease caused by the deletion of 1.5 to 1.8 Mb on chromosome 7 which contains approximately 28 genes. This multisystem disorder is mainly characterized by supravalvular aortic stenosis, mental retardation, and distinctive facial features. We generated mouse embryonic stem (ES) cells clones expressing each of the 4 human WBS genes (WBSCR1, GTF2I, GTF2IRD1 and GTF2IRD2) found in the specific delated region 7q11.23 causative of the WBS. We generated at least three stable clones for each gene with stable integration in the ROSA26 locus of a tetracycline-inducible upstream of the coding sequence of the genet tagged with a 3xFLAG epitope. Three clones for each gene were transcriptionally profiled in inducing versus non-inducing conditions for a total of 24 profiles. This small collection of human WBS-ES cell clones represents a resource to facilitate the study of the function of these genes during differentiation. Measurement(s) | transcription profiling assay • regulation of transcription, DNA-templated | Technology Type(s) | microarray assay • gene overexpression | Factor Type(s) | WBSCR1, GTF2I, GTF2IRD1 and GTF2IRD2 | Sample Characteristic - Organism | Homo sapiens |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.10003127
Collapse
|