1
|
Hitz BC, Lee JW, Jolanki O, Kagda MS, Graham K, Sud P, Gabdank I, Strattan JS, Sloan CA, Dreszer T, Rowe LD, Podduturi NR, Malladi VS, Chan ET, Davidson JM, Ho M, Miyasato S, Simison M, Tanaka F, Luo Y, Whaling I, Hong EL, Lee BT, Sandstrom R, Rynes E, Nelson J, Nishida A, Ingersoll A, Buckley M, Frerker M, Kim DS, Boley N, Trout D, Dobin A, Rahmanian S, Wyman D, Balderrama-Gutierrez G, Reese F, Durand NC, Dudchenko O, Weisz D, Rao SSP, Blackburn A, Gkountaroulis D, Sadr M, Olshansky M, Eliaz Y, Nguyen D, Bochkov I, Shamim MS, Mahajan R, Aiden E, Gingeras T, Heath S, Hirst M, Kent WJ, Kundaje A, Mortazavi A, Wold B, Cherry JM. The ENCODE Uniform Analysis Pipelines. Res Sq 2023:rs.3.rs-3111932. [PMID: 37503119 PMCID: PMC10371165 DOI: 10.21203/rs.3.rs-3111932/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
Collapse
Affiliation(s)
- Benjamin C Hitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jin-Wook Lee
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Otto Jolanki
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Meenakshi S Kagda
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Keenan Graham
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Paul Sud
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Idan Gabdank
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - J Seth Strattan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Cricket A Sloan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Timothy Dreszer
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Laurence D Rowe
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Nikhil R Podduturi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Venkat S Malladi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Esther T Chan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jean M Davidson
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Marcus Ho
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stuart Miyasato
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Matt Simison
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Forrest Tanaka
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Yunhai Luo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Ian Whaling
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Eurie L Hong
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Brian T Lee
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Richard Sandstrom
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Eric Rynes
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Jemma Nelson
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Andrew Nishida
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Alyssa Ingersoll
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Michael Buckley
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Mark Frerker
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Daniel S Kim
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Nathan Boley
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Diane Trout
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - Alex Dobin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Sorena Rahmanian
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Dana Wyman
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | | | - Fairlie Reese
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Neva C Durand
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Computer Science, Rice University, Houston, TX 77030, USA
| | - Olga Dudchenko
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - David Weisz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Suhas S P Rao
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Alyssa Blackburn
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Dimos Gkountaroulis
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Mahdi Sadr
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Moshe Olshansky
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yossi Eliaz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Dat Nguyen
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ivan Bochkov
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Muhammad Saad Shamim
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Bioengineering, Rice University, Houston, TX 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ragini Mahajan
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of BioSciences, Rice University, Houston, TX 77005, USA
| | - Erez Aiden
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Tom Gingeras
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Simon Heath
- CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain. Universitat Pompeu Fabra, Barcelona, Spain
| | - Martin Hirst
- Micheal Smith Laboratories, University of British Columbia, British Columbia, Canada
| | - W James Kent
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anshul Kundaje
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Ali Mortazavi
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - J Michael Cherry
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
2
|
Hitz BC, Jin-Wook L, Jolanki O, Kagda MS, Graham K, Sud P, Gabdank I, Strattan JS, Sloan CA, Dreszer T, Rowe LD, Podduturi NR, Malladi VS, Chan ET, Davidson JM, Ho M, Miyasato S, Simison M, Tanaka F, Luo Y, Whaling I, Hong EL, Lee BT, Sandstrom R, Rynes E, Nelson J, Nishida A, Ingersoll A, Buckley M, Frerker M, Kim DS, Boley N, Trout D, Dobin A, Rahmanian S, Wyman D, Balderrama-Gutierrez G, Reese F, Durand NC, Dudchenko O, Weisz D, Rao SSP, Blackburn A, Gkountaroulis D, Sadr M, Olshansky M, Eliaz Y, Nguyen D, Bochkov I, Shamim MS, Mahajan R, Aiden E, Gingeras T, Heath S, Hirst M, Kent WJ, Kundaje A, Mortazavi A, Wold B, Cherry JM. The ENCODE Uniform Analysis Pipelines. bioRxiv 2023:2023.04.04.535623. [PMID: 37066421 PMCID: PMC10104020 DOI: 10.1101/2023.04.04.535623] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
Collapse
Affiliation(s)
- Benjamin C Hitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Lee Jin-Wook
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Otto Jolanki
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Meenakshi S Kagda
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Keenan Graham
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Paul Sud
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Idan Gabdank
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - J Seth Strattan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Cricket A Sloan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Timothy Dreszer
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Laurence D Rowe
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Nikhil R Podduturi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Venkat S Malladi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Esther T Chan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jean M Davidson
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Marcus Ho
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stuart Miyasato
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Matt Simison
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Forrest Tanaka
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Yunhai Luo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Ian Whaling
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Eurie L Hong
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Brian T Lee
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Richard Sandstrom
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Eric Rynes
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Jemma Nelson
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Andrew Nishida
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Alyssa Ingersoll
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Michael Buckley
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Mark Frerker
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Daniel S Kim
- Dept. of Genetics, Dept. of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Nathan Boley
- Dept. of Genetics, Dept. of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Diane Trout
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - Alex Dobin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Sorena Rahmanian
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Dana Wyman
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | | | - Fairlie Reese
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Neva C Durand
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Computer Science, Rice University, Houston, TX 77030, USA
| | - Olga Dudchenko
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - David Weisz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Suhas S P Rao
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Alyssa Blackburn
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Dimos Gkountaroulis
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Mahdi Sadr
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Moshe Olshansky
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yossi Eliaz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Dat Nguyen
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ivan Bochkov
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Muhammad Saad Shamim
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Bioengineering, Rice University, Houston, TX 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ragini Mahajan
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of BioSciences, Rice University, Houston, TX 77005, USA
| | - Erez Aiden
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Tom Gingeras
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Simon Heath
- CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain. Universitat Pompeu Fabra, Barcelona, Spain
| | - Martin Hirst
- Micheal Smith Laboratories, University of British Columbia, British Columbia, Canada
| | - W James Kent
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anshul Kundaje
- Dept. of Genetics, Dept. of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Ali Mortazavi
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - J Michael Cherry
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
3
|
Gafni E, Harvey A, Jaroszewicz A, Solari OS, Landolin J, Barbirou M, Miller A, Tonellato PJ, Kundaje A, Jeffrey SS, Curtis C, Sledge GW, Giresi P, Boley N. Abstract 2105: Cell-free DNA fragments inform epigenomic mechanisms for early detection of breast cancer. Cancer Res 2021. [DOI: 10.1158/1538-7445.am2021-2105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Introduction: Chromatin accessibility and cell-free DNA fragmentation patterns can be used to identify epigenomic mechanisms (Sharma et al. 2010) and infer cell-types contributing to cfDNA in pathological states such as cancer (Snyder et al. 2016; Ulz et al. 2017). We describe results from a novel blood-based cell-free DNA (cfDNA) assay using epigenomic signatures that have high sensitivity for detecting early stages of breast cancer, a cancer type that is characterized by low tumor burden (Phallen et al. 2017). We present the results from a prospective, case-control study demonstrating improved sensitivity to the screening mammogram and other published blood-based assays.
Methods: Assay performance was evaluated using a case-control study design enrolling 123 total subjects (58% Healthy, 18% Stage I, 13% Stage II, 11% Stage III). Cases were defined as subjects with a confirmatory diagnosis of invasive breast cancer, at any stage, by tissue biopsy. Controls were composed of subjects with either a negative finding by mammography (BI-RADS 1 or 2) or self-declared cancer-free. Whole blood samples were collected in Streck BCT tubes and shipped to a central laboratory for processing. Total cell-free DNA was extracted from plasma and prepped for next-generation sequencing. Sequencing libraries were enriched using a custom panel targeting genomic regions with distinct epigenomic activity in breast cancer. We trained a neural net to predict regulatory events in each of these regions, and then identified those events that were predictive of the presence of breast cancer. Final classification was performed by logistic regression over the predicted regulatory events.
Results: Performance was tested using a held-out test set and achieved an overall sensitivity of 92.5% (95% CI: 88.1%, 97%) at specificity of 88.9% with an overall AUC of 95.8%. Performance of screening mammography is reported to be 86.9% (95% CI: 86.3%, 87.6%) sensitive at 88.9% specificity on data obtained from six Breast Cancer Surveillance Consortium (BCSC) registries on 792808 women (Lehman et al. 2017).
Conclusion: These results support the utility for detecting epigenomic signals from cell-free DNA to enhance early detection of breast cancer. A prospective breast cancer screening study in a larger cohort is needed to further validate performance.
Citation Format: Erik Gafni, Adam Harvey, Artur Jaroszewicz, Omid Shams Solari, Jane Landolin, Mouadh Barbirou, Amanda Miller, Peter J. Tonellato, Anshul Kundaje, Stefanie S. Jeffrey, Christina Curtis, George W. Sledge, Paul Giresi, Nathan Boley. Cell-free DNA fragments inform epigenomic mechanisms for early detection of breast cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 2105.
Collapse
Affiliation(s)
- Erik Gafni
- 1Ravel Biotechnology Inc., San Francisco, CA
| | - Adam Harvey
- 1Ravel Biotechnology Inc., San Francisco, CA
| | | | | | | | - Mouadh Barbirou
- 2Biomedical Informatics, University of Missouri, Columbia, MO
| | - Amanda Miller
- 2Biomedical Informatics, University of Missouri, Columbia, MO
| | | | | | | | | | | | - Paul Giresi
- 1Ravel Biotechnology Inc., San Francisco, CA
| | | |
Collapse
|
4
|
Dachet F, Brown JB, Valyi-Nagy T, Narayan KD, Serafini A, Boley N, Gingeras TR, Celniker SE, Mohapatra G, Loeb JA. Selective time-dependent changes in activity and cell-specific gene expression in human postmortem brain. Sci Rep 2021; 11:6078. [PMID: 33758256 PMCID: PMC7988150 DOI: 10.1038/s41598-021-85801-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Accepted: 02/24/2021] [Indexed: 12/15/2022] Open
Abstract
As a means to understand human neuropsychiatric disorders from human brain samples, we compared the transcription patterns and histological features of postmortem brain to fresh human neocortex isolated immediately following surgical removal. Compared to a number of neuropsychiatric disease-associated postmortem transcriptomes, the fresh human brain transcriptome had an entirely unique transcriptional pattern. To understand this difference, we measured genome-wide transcription as a function of time after fresh tissue removal to mimic the postmortem interval. Within a few hours, a selective reduction in the number of neuronal activity-dependent transcripts occurred with relative preservation of housekeeping genes commonly used as a reference for RNA normalization. Gene clustering indicated a rapid reduction in neuronal gene expression with a reciprocal time-dependent increase in astroglial and microglial gene expression that continued to increase for at least 24 h after tissue resection. Predicted transcriptional changes were confirmed histologically on the same tissue demonstrating that while neurons were degenerating, glial cells underwent an outgrowth of their processes. The rapid loss of neuronal genes and reciprocal expression of glial genes highlights highly dynamic transcriptional and cellular changes that occur during the postmortem interval. Understanding these time-dependent changes in gene expression in post mortem brain samples is critical for the interpretation of research studies on human brain disorders.
Collapse
Affiliation(s)
- Fabien Dachet
- University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - James B Brown
- Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | | | - Anna Serafini
- University of Illinois at Chicago, Chicago, IL, 60612, USA
| | - Nathan Boley
- University of California, Berkeley, CA, 94720, USA
| | | | | | | | - Jeffrey A Loeb
- University of Illinois at Chicago, Chicago, IL, 60612, USA.
| |
Collapse
|
5
|
Ursu O, Boley N, Taranova M, Wang YXR, Yardimci GG, Stafford Noble W, Kundaje A. GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 2019; 34:2701-2707. [PMID: 29554289 DOI: 10.1093/bioinformatics/bty164] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 03/15/2018] [Indexed: 02/04/2023] Open
Abstract
Motivation The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts. Results We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP. Availability and implementation Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Oana Ursu
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Nathan Boley
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Maryna Taranova
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Y X Rachel Wang
- Department of Statistics, Stanford University, Stanford, CA, USA
| | | | - William Stafford Noble
- Department of Genome Sciences, University of Washington, WA, USA.,Department of Computer Science and Engineering, University of Washington, WA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.,Department of Computer Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
6
|
Wan N, Weinberg D, Liu TY, Niehaus K, Ariazi EA, Delubac D, Kannan A, White B, Bailey M, Bertin M, Boley N, Bowen D, Cregg J, Drake AM, Ennis R, Fransen S, Gafni E, Hansen L, Liu Y, Otte GL, Pecson J, Rice B, Sanderson GE, Sharma A, St John J, Tang C, Tzou A, Young L, Putcha G, Haque IS. Machine learning enables detection of early-stage colorectal cancer by whole-genome sequencing of plasma cell-free DNA. BMC Cancer 2019; 19:832. [PMID: 31443703 PMCID: PMC6708173 DOI: 10.1186/s12885-019-6003-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Accepted: 07/31/2019] [Indexed: 02/06/2023] Open
Abstract
Background Blood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumor-derived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer. Methods Whole-genome sequencing was performed on cfDNA extracted from plasma samples (N = 546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validations to assess generalization performance. Results In a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91–0.93) with a mean sensitivity of 85% (95% CI 83–86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance. Conclusions A machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies. Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway. Electronic supplementary material The online version of this article (10.1186/s12885-019-6003-8) contains supplementary material, which is available to authorized users.
Collapse
|
7
|
John JAS, Gafni E, White B, Kannan A, Hansen L, Jaroszewicz A, Kundaje A, Boley N. Abstract 4349: Predicting gene expression from plasma cell-free DNA using both the fragment length and fragment position. Cancer Res 2019. [DOI: 10.1158/1538-7445.am2019-4349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
The ability to use a blood sample to determine the transcriptional state of cells that are releasing DNA into the bloodstream of a patient may be helpful in a variety of clinical applications. Here we present a case study of a gene expression prediction model that uses cell-free DNA (cfDNA) fragment coverage data generated by high-throughput sequencing to predict which genes are highly or lowly expressed in the cells contributing to that cfDNA. We evaluated a number of models, including a convolutional neural network that takes cfDNA fragment information (the density of both fragment midpoint and length by genomic position) over a transcription start site (TSS) as input, and outputs a predicted probability of whether that gene is highly expressed in cfDNA-producing cells. When we trained the convolutional model on a set of 554 genes with TSSs that were either constitutively expressed or unexpressed across leukocyte samples from the NIH Roadmap Epigenome Mapping Consortium, we achieved ~0.97 AUC in cross validation. With other models and splits of the data, we observed AUCs ranging from 0.95 to 0.99 on this gene-expression task. Next, we were interested in whether this trained model could answer specific clinical questions. For example, we hypothesized that we should see an increased influence of colon gene expression profiles in colorectal cancer patients with a higher fraction of circulating tumor DNA. To test this hypothesis, we applied our models to a set of genes with colon-specific expression, which generated a list of probabilities of each gene being expressed in each sample. We then applied simple models on the these lists of probabilities to predict whether a patient had CRC or was healthy. This yielded cross validation AUCs between 0.85 and 0.95 across many of the models we tested in differentiating healthy patients from colorectal cancer patients with tumor fraction over 5%. These results suggest a path forward for modeling transcriptional states using cfDNA sequencing data, which will enable greater insights from cfDNA that could augment those provided by other analytes.
Citation Format: John A. St John, Erik Gafni, Brandon White, Ajay Kannan, Loren Hansen, Artur Jaroszewicz, Anshul Kundaje, Nathan Boley. Predicting gene expression from plasma cell-free DNA using both the fragment length and fragment position [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 4349.
Collapse
Affiliation(s)
| | - Erik Gafni
- 1Freenome, Inc., South San Francisco, CA
| | | | | | | | | | | | | |
Collapse
|
8
|
Chen ZX, Sturgill D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, Muzny DM, Ongeri F, Perales L, Przytycka TM, Pu LL, Robinson G, Thornton RL, Saada N, Scherer SE, Smith HE, Vinson C, Warner CB, Worley KC, Wu YQ, Zou X, Cherbas P, Kellis M, Eisen MB, Piano F, Kionte K, Fitch DH, Sternberg PW, Cutter AD, Duff MO, Hoskins RA, Graveley BR, Gibbs RA, Bickel PJ, Kopp A, Carninci P, Celniker SE, Oliver B, Richards S. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 2015; 24:1209-23. [PMID: 24985915 PMCID: PMC4079975 DOI: 10.1101/gr.159384.113] [Citation(s) in RCA: 111] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.
Collapse
Affiliation(s)
- Zhen-Xia Chen
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - David Sturgill
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Jiaxin Qu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Huaiyang Jiang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Soo Park
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Nathan Boley
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Ana Maria Suzuki
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Anthony R Fletcher
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - David C Plachetzki
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Peter C FitzGerald
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Carlo G Artieri
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Joel Atallah
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Olga Barmina
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - James B Brown
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Kerstin P Blankenburg
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Emily Clough
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Abhijit Dasgupta
- Clinical Trials and Outcomes Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Sai Gubbala
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Joy C Jayaseelan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Divya Kalra
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yoo-Ah Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Christie L Kovar
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Sandra L Lee
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Mingmei Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - James D Malley
- Division of Computational Bioscience, Center For Information Technology, National Institutes of Health, Bethesda, Maryland 20814, USA
| | - John H Malone
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Tittu Mathew
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nicolas R Mattiuzzo
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Mala Munidasa
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fiona Ongeri
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Lora Perales
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Ling-Ling Pu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Garrett Robinson
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Rebecca L Thornton
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Nehad Saada
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Steven E Scherer
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Harold E Smith
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Charles Vinson
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Crystal B Warner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Kim C Worley
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Yuan-Qing Wu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Xiaoyan Zou
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter Cherbas
- Department of Biology, Indiana University, Bloomington, Indiana 47405, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 20139, USA
| | - Michael B Eisen
- Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
| | - Fabio Piano
- Department of Biology, New York University, New York, New York 10003, USA
| | - Karin Kionte
- Department of Biology, New York University, New York, New York 10003, USA
| | - David H Fitch
- Department of Biology, New York University, New York, New York 10003, USA
| | - Paul W Sternberg
- HHMI and Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Asher D Cutter
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, M5S 3B2, Canada
| | - Michael O Duff
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Roger A Hoskins
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brenton R Graveley
- Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Artyom Kopp
- Department of Evolution and Ecology, University of California, Davis, California 95616, USA
| | - Piero Carninci
- Technology Development Group, RIKEN Omics Science Center and RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Yokohama City, Kanagawa, Japan 230-0045
| | - Susan E Celniker
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Brian Oliver
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Stephen Richards
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
9
|
Brown JB, Boley N, Eisman R, May GE, Stoiber MH, Duff MO, Booth BW, Wen J, Park S, Suzuki AM, Wan KH, Yu C, Zhang D, Carlson JW, Cherbas L, Eads BD, Miller D, Mockaitis K, Roberts J, Davis CA, Frise E, Hammonds AS, Olson S, Shenker S, Sturgill D, Samsonova AA, Weiszmann R, Robinson G, Hernandez J, Andrews J, Bickel PJ, Carninci P, Cherbas P, Gingeras TR, Hoskins RA, Kaufman TC, Lai EC, Oliver B, Perrimon N, Graveley BR, Celniker SE. Diversity and dynamics of the Drosophila transcriptome. Nature 2014; 512:393-9. [PMID: 24670639 PMCID: PMC4152413 DOI: 10.1038/nature12962] [Citation(s) in RCA: 470] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Accepted: 12/18/2013] [Indexed: 01/10/2023]
Abstract
Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.
Collapse
|
10
|
Boley N, Stoiber MH, Booth BW, Wan KH, Hoskins RA, Bickel PJ, Celniker SE, Brown JB. Genome-guided transcript assembly by integrative analysis of RNA sequence data. Nat Biotechnol 2014; 32:341-6. [PMID: 24633242 PMCID: PMC4037530 DOI: 10.1038/nbt.2850] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Accepted: 02/11/2014] [Indexed: 01/31/2023]
Abstract
The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in the annotation of genomes. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call Generalized RNA Integration Tool, or GRIT. Applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recovered the vast majority of previously annotated transcripts and doubled the total number of transcripts cataloged. We found that 20% of protein coding genes encode multiple protein-localization signals and that, in 20-d-old adult fly heads, genes with multiple polyadenylation sites are more common than genes with alternative splicing or alternative promoters. GRIT demonstrates 30% higher precision and recall than the most widely used transcript assembly tools. GRIT will facilitate the automated generation of high-quality genome annotations without the need for extensive manual annotation.
Collapse
Affiliation(s)
- Nathan Boley
- Department of Biostatistics, University of California at Berkeley, Berkeley, CA, USA
| | - Marcus H. Stoiber
- Department of Biostatistics, University of California at Berkeley, Berkeley, CA, USA
| | - Benjamin W. Booth
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Kenneth H. Wan
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Roger A. Hoskins
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Peter J. Bickel
- Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
| | - Susan E. Celniker
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - James B. Brown
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
| |
Collapse
|
11
|
Abstract
modENCODE was a 5year NHGRI funded project (2007-2012) to map the function of every base in the genomes of worms and flies characterizing positions of modified histones and other chromatin marks, origins of DNA replication, RNA transcripts and the transcription factor binding sites that control gene expression. Here we describe the Drosophila modENCODE datasets and how best to access and use them for genome wide and individual gene studies.
Collapse
Affiliation(s)
- Nathan Boley
- Department of Biostatistics, University of California Berkeley, Berkeley, CA, United States
| | - Kenneth H Wan
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Peter J Bickel
- Department of Statistics, University of California Berkeley, Berkeley, CA, United States
| | - Susan E Celniker
- Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, CA, United States.
| |
Collapse
|
12
|
Atherton J, Boley N, Brown B, Ogawa N, Davidson SM, Eisen MB, Biggin MD, Bickel P. A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas537] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
|
14
|
Hoskins RA, Landolin JM, Brown JB, Sandler JE, Takahashi H, Lassmann T, Yu C, Booth BW, Zhang D, Wan KH, Yang L, Boley N, Andrews J, Kaufman TC, Graveley BR, Bickel PJ, Carninci P, Carlson JW, Celniker SE. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res 2010; 21:182-92. [PMID: 21177961 DOI: 10.1101/gr.112466.110] [Citation(s) in RCA: 167] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Core promoters are critical regions for gene regulation in higher eukaryotes. However, the boundaries of promoter regions, the relative rates of initiation at the transcription start sites (TSSs) distributed within them, and the functional significance of promoter architecture remain poorly understood. We produced a high-resolution map of promoters active in the Drosophila melanogaster embryo by integrating data from three independent and complementary methods: 21 million cap analysis of gene expression (CAGE) tags, 1.2 million RNA ligase mediated rapid amplification of cDNA ends (RLM-RACE) reads, and 50,000 cap-trapped expressed sequence tags (ESTs). We defined 12,454 promoters of 8037 genes. Our analysis indicates that, due to non-promoter-associated RNA background signal, previous studies have likely overestimated the number of promoter-associated CAGE clusters by fivefold. We show that TSS distributions form a complex continuum of shapes, and that promoters active in the embryo and adult have highly similar shapes in 95% of cases. This suggests that these distributions are generally determined by static elements such as local DNA sequence and are not modulated by dynamic signals such as histone modifications. Transcription factor binding motifs are differentially enriched as a function of promoter shape, and peaked promoter shape is correlated with both temporal and spatial regulation of gene expression. Our results contribute to the emerging view that core promoters are functionally diverse and control patterning of gene expression in Drosophila and mammals.
Collapse
Affiliation(s)
- Roger A Hoskins
- Department of Genome Dynamics, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 97420, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
|