1
|
Xue B, Khoroshevskyi O, Gomez RA, Sheffield NC. Opportunities and challenges in sharing and reusing genomic interval data. Front Genet 2023; 14:1155809. [PMID: 37020996 PMCID: PMC10067617 DOI: 10.3389/fgene.2023.1155809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 03/07/2023] [Indexed: 03/22/2023] Open
Affiliation(s)
- Bingjie Xue
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - R. Ariel Gomez
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- *Correspondence: Nathan C. Sheffield,
| |
Collapse
|
2
|
Sheffield NC, Bonazzi VR, Bourne PE, Burdett T, Clark T, Grossman RL, Spjuth O, Yates AD. From biomedical cloud platforms to microservices: next steps in FAIR data and analysis. Sci Data 2022; 9:553. [PMID: 36075919 PMCID: PMC9458632 DOI: 10.1038/s41597-022-01619-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 08/08/2022] [Indexed: 11/29/2022] Open
Affiliation(s)
- Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA.
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA.
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
| | | | - Philip E Bourne
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Timothy Clark
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA
| | - Robert L Grossman
- Center for Translational Data Science, University of Chicago, Chicago, IL, 60615, USA
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, 75124, Uppsala, Sweden
| | - Andrew D Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
3
|
Sheffield NC, Stolarczyk M, Reuter VP, Rendeiro AF. Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects. Gigascience 2021; 10:6454632. [PMID: 34890448 PMCID: PMC8673555 DOI: 10.1093/gigascience/giab077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 04/20/2021] [Accepted: 11/04/2021] [Indexed: 12/26/2022] Open
Abstract
Background Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. Results To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. Conclusions The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.
Collapse
Affiliation(s)
- Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, VA 22908, USA.,Department of Public Health Sciences, University of Virginia, VA 22908, USA.,Department of Biomedical Engineering, University of Virginia, VA 22908, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, VA 22908, USA
| | - Michał Stolarczyk
- Center for Public Health Genomics, University of Virginia, VA 22908, USA
| | - Vincent P Reuter
- Center for Public Health Genomics, University of Virginia, VA 22908, USA.,Genomics and Computational Biology Graduate Group, University of Pennsylvania, PA 19087, USA
| | - André F Rendeiro
- Institute for Computational Biomedicine, Weill Cornell Medical College, NY 10021, USA.,Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medical College, NY 10021, USA
| |
Collapse
|
4
|
Smith JP, Corces MR, Xu J, Reuter VP, Chang HY, Sheffield NC. PEPATAC: an optimized pipeline for ATAC-seq data analysis with serial alignments. NAR Genom Bioinform 2021; 3:lqab101. [PMID: 34859208 PMCID: PMC8632735 DOI: 10.1093/nargab/lqab101] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 09/30/2021] [Accepted: 11/15/2021] [Indexed: 12/18/2022] Open
Abstract
As chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. PEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project. BSD2-licensed code and documentation are available at https://pepatac.databio.org.
Collapse
Affiliation(s)
- Jason P Smith
- Center for Public Health Genomics, University of Virginia, VA,22908, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, VA 22908 USA
| | - M Ryan Corces
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94304, USA
| | - Jin Xu
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94304, USA
| | - Vincent P Reuter
- Genomics and Computational Biology Graduate Group, University of Pennsylvania, PA 19087, USA
| | - Howard Y Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94304, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, VA,22908, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, VA 22908 USA
- Department of Public Health Sciences, University of Virginia, VA 22908, USA
- Department of Biomedical Engineering, University of Virginia, VA 22908, USA
| |
Collapse
|