3
|
Jou J, Gabdank I, Luo Y, Lin K, Sud P, Myers Z, Hilton JA, Kagda MS, Lam B, O'Neill E, Adenekan P, Graham K, Baymuradov UK, R Miyasato S, Strattan JS, Jolanki O, Lee JW, Litton C, Y Tanaka F, Hitz BC, Cherry JM. The ENCODE Portal as an Epigenomics Resource. ACTA ACUST UNITED AC 2020; 68:e89. [PMID: 31751002 PMCID: PMC7307447 DOI: 10.1002/cpbi.89] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine‐readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access
Collapse
Affiliation(s)
- Jennifer Jou
- Department of Genetics, Stanford University, Stanford, California
| | - Idan Gabdank
- Department of Genetics, Stanford University, Stanford, California
| | - Yunhai Luo
- Department of Genetics, Stanford University, Stanford, California
| | - Khine Lin
- Department of Genetics, Stanford University, Stanford, California
| | - Paul Sud
- Department of Genetics, Stanford University, Stanford, California
| | - Zachary Myers
- Department of Genetics, Stanford University, Stanford, California
| | - Jason A Hilton
- Department of Genetics, Stanford University, Stanford, California
| | | | - Bonita Lam
- Department of Genetics, Stanford University, Stanford, California
| | - Emma O'Neill
- Department of Genetics, Stanford University, Stanford, California
| | - Philip Adenekan
- Department of Genetics, Stanford University, Stanford, California
| | - Keenan Graham
- Department of Genetics, Stanford University, Stanford, California
| | | | | | - J Seth Strattan
- Department of Genetics, Stanford University, Stanford, California
| | - Otto Jolanki
- Department of Genetics, Stanford University, Stanford, California
| | - Jin-Wook Lee
- Department of Genetics, Stanford University, Stanford, California
| | - Casey Litton
- Department of Genetics, Stanford University, Stanford, California
| | - Forrest Y Tanaka
- Department of Genetics, Stanford University, Stanford, California
| | - Benjamin C Hitz
- Department of Genetics, Stanford University, Stanford, California
| | - J Michael Cherry
- Department of Genetics, Stanford University, Stanford, California
| |
Collapse
|
4
|
Bernasconi A, Canakoglu A, Masseroli M, Ceri S. The road towards data integration in human genomics: players, steps and interactions. Brief Bioinform 2020; 22:30-44. [PMID: 32496509 DOI: 10.1093/bib/bbaa080] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 03/09/2020] [Accepted: 04/18/2020] [Indexed: 12/15/2022] Open
Abstract
Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.
Collapse
|
5
|
Chen Q, Zhang X, Wan Y, Zobel J, Verspoor K. Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions. J Comput Biol 2018; 26:605-617. [PMID: 30585742 DOI: 10.1089/cmb.2018.0198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.
Collapse
Affiliation(s)
- Qingyu Chen
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Xiuzhen Zhang
- 2 School of Science, RMIT University, Melbourne, Australia
| | - Yu Wan
- 3 Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Parkville, Australia
| | - Justin Zobel
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Karin Verspoor
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| |
Collapse
|