1
|
Bernasconi A, Canakoglu A, Masseroli M, Ceri S. META-BASE: A Novel Architecture for Large-Scale Genomic Metadata Integration. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:543-557. [PMID: 32750853 DOI: 10.1109/tcbb.2020.2998954] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.
Collapse
|
2
|
Díaz-Santiago E, Claros MG, Yahyaoui R, de Diego-Otero Y, Calvo R, Hoenicka J, Palau F, Ranea JAG, Perkins JR. Decoding Neuromuscular Disorders Using Phenotypic Clusters Obtained From Co-Occurrence Networks. Front Mol Biosci 2021; 8:635074. [PMID: 34046427 PMCID: PMC8147726 DOI: 10.3389/fmolb.2021.635074] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 02/15/2021] [Indexed: 12/19/2022] Open
Abstract
Neuromuscular disorders (NMDs) represent an important subset of rare diseases associated with elevated morbidity and mortality whose diagnosis can take years. Here we present a novel approach using systems biology to produce functionally-coherent phenotype clusters that provide insight into the cellular functions and phenotypic patterns underlying NMDs, using the Human Phenotype Ontology as a common framework. Gene and phenotype information was obtained for 424 NMDs in OMIM and 126 NMDs in Orphanet, and 335 and 216 phenotypes were identified as typical for NMDs, respectively. ‘Elevated serum creatine kinase’ was the most specific to NMDs, in agreement with the clinical test of elevated serum creatinine kinase that is conducted on NMD patients. The approach to obtain co-occurring NMD phenotypes was validated based on co-mention in PubMed abstracts. A total of 231 (OMIM) and 150 (Orphanet) clusters of highly connected co-occurrent NMD phenotypes were obtained. In parallel, a tripartite network based on phenotypes, diseases and genes was used to associate NMD phenotypes with functions, an approach also validated by literature co-mention, with KEGG pathways showing proportionally higher overlap than Gene Ontology and Reactome. Phenotype-function pairs were crossed with the co-occurrent NMD phenotype clusters to obtain 40 (OMIM) and 72 (Orphanet) functionally coherent phenotype clusters. As expected, many of these overlapped with known diseases and confirmed existing knowledge. Other clusters revealed interesting new findings, indicating informative phenotypes for differential diagnosis, providing deeper knowledge of NMDs, and pointing towards specific cell dysfunction caused by pleiotropic genes. This work is an example of reproducible research that i) can help better understand NMDs and support their diagnosis by providing a new tool that exploits existing information to obtain novel clusters of functionally-related phenotypes, and ii) takes us another step towards personalised medicine for NMDs.
Collapse
Affiliation(s)
- Elena Díaz-Santiago
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain
| | - M Gonzalo Claros
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Institute for Mediterranean and Subtropical Horticulture "La Mayora" (IHSM-UMA-CSIC), Málaga, Spain
| | - Raquel Yahyaoui
- Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Laboratory of Metabolopathies and Neonatal Screening, Málaga Regional University Hospital, Málaga, Spain
| | | | - Rocío Calvo
- Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Laboratory of Metabolopathies and Neonatal Screening, Málaga Regional University Hospital, Málaga, Spain
| | - Janet Hoenicka
- CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Sant Joan de Déu Hospital and Research Institute, Barcelona, Spain
| | - Francesc Palau
- CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Sant Joan de Déu Hospital and Research Institute, Barcelona, Spain.,Hospital Clínic and University of Barcelona School of Medicine and Health Sciences, Barcelona, Spain
| | - Juan A G Ranea
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain
| | - James R Perkins
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain
| |
Collapse
|
3
|
An improved de novo assembling and polishing of Solea senegalensis transcriptome shed light on retinoic acid signalling in larvae. Sci Rep 2020; 10:20654. [PMID: 33244091 PMCID: PMC7691524 DOI: 10.1038/s41598-020-77201-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 11/06/2020] [Indexed: 12/17/2022] Open
Abstract
Senegalese sole is an economically important flatfish species in aquaculture and an attractive model to decipher the molecular mechanisms governing the severe transformations occurring during metamorphosis, where retinoic acid seems to play a key role in tissue remodeling. In this study, a robust sole transcriptome was envisaged by reducing the number of assembled libraries (27 out of 111 available), fine-tuning a new automated and reproducible set of workflows for de novo assembling based on several assemblers, and removing low confidence transcripts after mapping onto a sole female genome draft. From a total of 96 resulting assemblies, two "raw" transcriptomes, one containing only Illumina reads and another with Illumina and GS-FLX reads, were selected to provide SOLSEv5.0, the most informative transcriptome with low redundancy and devoid of most single-exon transcripts. It included both Illumina and GS-FLX reads and consisted of 51,348 transcripts of which 22,684 code for 17,429 different proteins described in databases, where 9527 were predicted as complete proteins. SOLSEv5.0 was used as reference for the study of retinoic acid (RA) signalling in sole larvae using drug treatments (DEAB, a RA synthesis blocker, and TTNPB, a RA-receptor agonist) for 24 and 48 h. Differential expression and functional interpretation were facilitated by an updated version of DEGenes Hunter. Acute exposure of both drugs triggered an intense, specific and transient response at 24 h but with hardly observable differences after 48 h at least in the DEAB treatments. Activation of RA signalling by TTNPB specifically increased the expression of genes in pathways related to RA degradation, retinol storage, carotenoid metabolism, homeostatic response and visual cycle, and also modified the expression of transcripts related to morphogenesis and collagen fibril organisation. In contrast, DEAB mainly decreased genes related to retinal production, impairing phototransduction signalling in the retina. A total of 755 transcripts mainly related to lipid metabolism, lipid transport and lipid homeostasis were altered in response to both treatments, indicating non-specific drug responses associated with intestinal absorption. These results indicate that a new assembling and transcript sieving were both necessary to provide a reliable transcriptome to identify the many aspects of RA action during sole development that are of relevance for sole aquaculture.
Collapse
|
4
|
Canakoglu A, Bernasconi A, Colombo A, Masseroli M, Ceri S. GenoSurf: metadata driven semantic search system for integrated genomic datasets. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5670757. [PMID: 31820804 PMCID: PMC6902006 DOI: 10.1093/database/baz132] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2019] [Revised: 10/04/2019] [Accepted: 10/21/2019] [Indexed: 01/18/2023]
Abstract
Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.
Collapse
Affiliation(s)
- Arif Canakoglu
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy
| | - Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy
| | - Andrea Colombo
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy
| | - Marco Masseroli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy
| |
Collapse
|
5
|
Dovrolis N, Kolios G, Spyrou GM, Maroulakou I. Computational profiling of the gut-brain axis: microflora dysbiosis insights to neurological disorders. Brief Bioinform 2020; 20:825-841. [PMID: 29186317 DOI: 10.1093/bib/bbx154] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Revised: 10/17/2017] [Indexed: 12/14/2022] Open
Abstract
Almost 2500 years after Hippocrates' observations on health and its direct association to the gastrointestinal tract, a paradigm shift has recently occurred, making the gut and its symbionts (bacteria, fungi, archaea and viruses) a point of convergence for studies. It is nowadays well established that the gut microflora's compositional diversity regulates via its genes (the microbiome) the host's health and provides preliminary insights into disease progression and regulation. The microbiome's involvement is evident in immunological and physiological studies that link changes in its biodiversity to its contributions to the host's phenotype but also in neurological investigations, substantiating the aptly named gut-brain axis. The definitive mechanisms of this last bidirectional interaction will be our main focus because it presents researchers with a new conundrum. In this review, we prospect current literature for computational analysis methodologies that accommodate the need for better understanding of the microbiome-gut-brain interactions and neurological disorder onset and progression, through cross-disciplinary systems biology applications. We will present bioinformatics tools used in exploring these synergies that help build and interpret microbial 16S ribosomal RNA data sets, produced by shotgun and high-throughput sequencing of healthy and neurological disorder samples stored in biological databases. These approaches provide alternative means for researchers to form hypotheses to their inquests faster, cheaper and swith precision. The goal of these studies relies on the integration of combined metagenomics and metabolomics assessments. An accurate characterization of the microbiome and its functionality can support new diagnostic, prognostic and therapeutic strategies for neurological disorders, customized for each individual host.
Collapse
|
6
|
Alzubaidi A, Tepper J, Lotfi A. A novel deep mining model for effective knowledge discovery from omics data. Artif Intell Med 2020; 104:101821. [DOI: 10.1016/j.artmed.2020.101821] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Revised: 01/23/2020] [Accepted: 02/17/2020] [Indexed: 10/24/2022]
|
7
|
Greshake Tzovaras B, Tzovara A. The Personal Data Is Political. PHILOSOPHICAL STUDIES SERIES 2019. [DOI: 10.1007/978-3-030-04363-6_8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
8
|
Corpas M, Kovalevskaya NV, McMurray A, Nielsen FGG. A FAIR guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol 2018; 14:e1005873. [PMID: 29543799 PMCID: PMC5854239 DOI: 10.1371/journal.pcbi.1005873] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It is generally acknowledged that, for reproducibility and progress of human genomic research, data sharing is critical. For every sharing transaction, a successful data exchange is produced between a data consumer and a data provider. Providers of human genomic data (e.g., publicly or privately funded repositories and data archives) fulfil their social contract with data donors when their shareable data conforms to FAIR (findable, accessible, interoperable, reusable) principles. Based on our experiences via Repositive (https://repositive.io), a leading discovery platform cataloguing all shared human genomic datasets, we propose guidelines for data providers wishing to maximise their shared data's FAIRness.
Collapse
Affiliation(s)
- Manuel Corpas
- Repositive Ltd, Betjeman House, Cambridge, United Kingdom
- * E-mail:
| | | | | | | |
Collapse
|
9
|
Abstract
Next-Generation Sequencing (NGS) enables the rapid generation of billions of short nucleic acid sequence fragments (i.e., "sequencing reads"). Especially, the adoption of gene expression profiling using whole transcriptome sequencing (i.e., "RNA-Seq") has been rapid. Here, we describe an in silico method, seq2HLA, that takes standard RNA-Seq reads as input and determines a sample's (classical and non-classical) HLA class I and class II types as well as HLA expression. We demonstrate the application of seq2HLA using publicly available RNA-Seq data from the Burkitt's lymphoma cell line DAUDI and the choriocarcinoma cell line JEG-3.
Collapse
|
10
|
Abstract
Accessing the massive amount of breast cancer data that are currently publicly available may seem daunting to the brand new graduate student embarking on his/her first project or even to the seasoned lab leader, who may wish to explore a new avenue of investigation. In this review, we provide an overview of data resources focusing on high-throughput data and on cancer-related data resources. Although not intended as an exhaustive list, the information included in this review will provide a jumping-off point with descriptions of and links to the various data resources of interest. The review is divided into six sections: (1) compendia of data resources; (2) biomolecular repository “Hubs”; (3) a list of cancer-related data resources, which provides information on contents of the resource and whether the resource enables upload and analysis of investigator provided data; (4) a list of seminal publications containing specific breast cancer data, e.g., publications from METBRIC, Sanger, TCGA; (5) a list of journals focused on data science that include cancer-related “Big Data”; and (6) miscellaneous resources.
Collapse
Affiliation(s)
- Susan E Clare
- Department of Surgery, Feinberg School of Medicine, Northwestern University, 303 E. Superior Street, Lurie 4-113, Chicago, IL 60611
| | - Pamela L Shaw
- Biosciences & Bioinformatics Librarian, NIH Public Access Compliance Reporter (PACR), Galter Health Sciences Library, Feinberg School of Medicine, Northwestern University, 303 E. Chicago Ave., Room 1-285, Chicago, IL 60611
| |
Collapse
|