551
|
Abstract
This issue of Genome Research presents new results, methods, and tools from The ENCODE Project (ENCyclopedia of DNA Elements), which collectively represents an important step in moving beyond a parts list of the genome and promises to shape the future of genomic research. This collection sheds light on basic biological questions and frames the current debate over the optimization of tools and methodological challenges necessary to compare and interpret large complex data sets focused on how the genome is organized and regulated. In a number of instances, the authors have highlighted the strengths and limitations of current computational and technical approaches, providing the community with useful standards, which should stimulate development of new tools. In many ways, these papers will ripple through the scientific community, as those in pursuit of understanding the “regulatory genome” will heavily traverse the maps and tools. Similarly, the work should have a substantive impact on how genetic variation contributes to specific diseases and traits by providing a compendium of functional elements for follow-up study. The success of these papers should not only be measured by the scope of the scientific insights and tools but also by their ability to attract new talent to mine existing and future data.
Collapse
Affiliation(s)
- Stephen Chanock
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Advanced Technology Center, Bethesda, Maryland 20892-4605, USA.
| |
Collapse
|
552
|
Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 2013; 22:1790-7. [PMID: 22955989 PMCID: PMC3431494 DOI: 10.1101/gr.137323.112] [Citation(s) in RCA: 2115] [Impact Index Per Article: 176.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.
Collapse
Affiliation(s)
- Alan P Boyle
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
553
|
Abstract
In its first production phase, The ENCODE Project Consortium (ENCODE) has generated thousands of genome-scale data sets, resulting in a genomic “parts list” that encompasses transcripts, sites of transcription factor binding, and other functional features that now number in the millions of distinct elements. These data are reshaping many long-held beliefs concerning the information content of the human and other complex genomes, including the very definition of the gene. Here I discuss and place in context many of the leading findings of ENCODE, as well as trends that are shaping the generation and interpretation of ENCODE data. Finally, I consider prospects for the future, including maximizing the accuracy, completeness, and utility of ENCODE data for the community.
Collapse
Affiliation(s)
- John A Stamatoyannopoulos
- Departments of Genome Sciences and Medicine, University of Washington School of Medicine, Seattle, Washington 98195, USA.
| |
Collapse
|
554
|
Affiliation(s)
- Kelly A Frazer
- Moores UCSD Cancer Center, Department of Pediatrics and Rady Children's Hospital, University of California at San Diego, La Jolla, California 92093, USA.
| |
Collapse
|
555
|
Sivley RM, Fish AE, Bush WS. Knowledge-constrained K-medoids Clustering of Regulatory Rare Alleles for Burden Tests. EVOLUTIONARY COMPUTATION, MACHINE LEARNING AND DATA MINING IN BIOINFORMATICS. EVOBIO (CONFERENCE) 2013; 7833:35-42. [PMID: 25541630 PMCID: PMC4274942 DOI: 10.1007/978-3-642-37189-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Rarely occurring genetic variants are hypothesized to influence human diseases, but statistically associating these rare variants to disease is challenging due to a lack of statistical power in most feasibly sized datasets. Several statistical tests have been developed to either collapse multiple rare variants from a genomic region into a single variable (presence/absence) or to tally the number of rare alleles within a region, relating the burden of rare alleles to disease risk. Both these approaches, however, rely on user-specification of a genomic region to generate these collapsed or burden variables, usually an entire gene. Recent studies indicate that most risk variants for common diseases are found within regulatory regions, not genes. To capture the effect of rare alleles within non-genic regulatory regions for burden tests, we contrast a simple sliding window approach with a knowledge-guided k-medoids clustering method to group rare variants into statistically powerful, biologically meaningful windows. We apply these methods to detect genomic regions that alter expression of nearby genes.
Collapse
Affiliation(s)
- R Michael Sivley
- Center for Human Genetics Research, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | | | | |
Collapse
|
556
|
Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, Kellis M, Noble WS. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res 2012; 41:827-41. [PMID: 23221638 PMCID: PMC3553955 DOI: 10.1093/nar/gks1284] [Citation(s) in RCA: 380] [Impact Index Per Article: 29.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.
Collapse
Affiliation(s)
- Michael M Hoffman
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195-5065, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
557
|
Cell specific CD44 expression in breast cancer requires the interaction of AP-1 and NFκB with a novel cis-element. PLoS One 2012; 7:e50867. [PMID: 23226410 PMCID: PMC3511339 DOI: 10.1371/journal.pone.0050867] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2012] [Accepted: 10/29/2012] [Indexed: 01/19/2023] Open
Abstract
Breast cancers contain a heterogeneous population of cells with a small percentage that possess properties similar to those found in stem cells. One of the widely accepted markers of breast cancer stem cells (BCSCs) is the cell surface marker CD44. As a glycoprotein, CD44 is involved in many cellular processes including cell adhesion, migration and proliferation, making it pro-oncogenic by nature. CD44 expression is highly up-regulated in BCSCs, and has been implicated in tumorigenesis and metastasis. However, the genetic mechanism that leads to a high level of CD44 expression in breast cancer cells and BCSCs is not well understood. Here, we identify a novel cis-element of the CD44 directs gene expression in breast cancer cells in a cell type specific manner. We have further identified key trans-acting factor binding sites and nuclear factors AP-1 and NFκB that are involved in the regulation of cell-specific CD44 expression. These findings provide new insight into the complex regulatory mechanism of CD44 expression, which may help identify more effective therapeutic targets against the breast cancer stem cells and metastatic tumors.
Collapse
|
558
|
|
559
|
Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol 2012; 30:1095-106. [PMID: 23138309 PMCID: PMC3703467 DOI: 10.1038/nbt.2422] [Citation(s) in RCA: 353] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2012] [Accepted: 10/16/2012] [Indexed: 12/13/2022]
Abstract
Association studies provide genome-wide information about the genetic basis of complex disease, but medical research has primarily focused on protein-coding variants, due to the difficulty of interpreting non-coding mutations. This picture has changed with advances in the systematic annotation of functional non-coding elements. Evolutionary conservation, functional genomics, chromatin state, sequence motifs, and molecular quantitative trait loci all provide complementary information about non-coding function. These functional maps can help prioritize variants on risk haplotypes, filter mutations encountered in the clinic, and perform systems-level analyses to reveal processes underlying disease associations. Advances in predictive modeling can enable dataset integration to reveal pathways shared across loci and alleles, and richer regulatory models can guide the search for epistatic interactions. Lastly, new massively parallel reporter experiments can systematically validate regulatory predictions. Ultimately, advances in regulatory and systems genomics can help unleash the value of whole-genome sequencing for personalized genomic risk assessment, diagnosis, and treatment.
Collapse
Affiliation(s)
- Lucas D Ward
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| | | |
Collapse
|
560
|
Abstract
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Collapse
|
561
|
Hardison RC. Genome-wide epigenetic data facilitate understanding of disease susceptibility association studies. J Biol Chem 2012; 287:30932-40. [PMID: 22952232 PMCID: PMC3438926 DOI: 10.1074/jbc.r112.352427] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Complex traits such as susceptibility to diseases are determined in part by variants at multiple genetic loci. Genome-wide association studies can identify these loci, but most phenotype-associated variants lie distal to protein-coding regions and are likely involved in regulating gene expression. Understanding how these genetic variants affect complex traits depends on the ability to predict and test the function of the genomic elements harboring them. Community efforts such as the ENCODE Project provide a wealth of data about epigenetic features associated with gene regulation. These data enable the prediction of testable functions for many phenotype-associated variants.
Collapse
Affiliation(s)
- Ross C Hardison
- Department of Biochemistry and Molecular Biology and Center for Comparative Genomics and Bioinformatics, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16801, USA.
| |
Collapse
|