1
|
Ren Y, Xiong W, Feng C, Yu D, Wang X, Yang Q, Yu S, Zhang H, Huo B, Jiang H, Li Z, Wang J, Su YX, Yang P, Liao Y, Zhong Q, Wang J. Multi-omics insights into the molecular signature and prognosis of hypopharyngeal squamous cell carcinoma. Commun Biol 2025; 8:370. [PMID: 40044946 PMCID: PMC11882983 DOI: 10.1038/s42003-025-07700-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Accepted: 02/07/2025] [Indexed: 03/09/2025] Open
Abstract
Approximately two-thirds of hypopharyngeal squamous cell carcinoma (HPSCC) cases are diagnosed at advanced stages, with the worst prognosis among head and neck squamous cell carcinomas (HNSCCs). Identifying biomarkers for high-risk patients requiring aggressive treatment is crucial. We present mutational, transcriptomic, and proteomic studies of 103 Chinese HPSCC patients and observe a higher prevalence and poorer prognosis in males. Estrogen response pathways are up-regulated, and proteins phosphorylated by protein kinase C (PKC) and cyclin-dependent kinases (CDKs) are aberrantly regulated in HPSCC. We identify aberrant copy number regions including SOX2(3q26.33), FGFR(8p11.23), CCND1(11q13.3), CDKN2A/2B(9p21.3), and MYC(8q24.21). Human papillomavirus (HPV) status combined with highly mutated genes, such as SYNE1 in HPV(-) and MUC4 in HPV(+) patients, were assessed as prognosis markers. A predictive model involving clinical factors and expression of six genes was established and cross-site validated. These findings open new opportunities for stratifying high-risk patients and molecular targets for personalized therapeutic strategies.
Collapse
Affiliation(s)
- Yanxin Ren
- Department of Head and Neck Surgery, Yunnan Cancer Hospital, The Third Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Wei Xiong
- Department of Radiotherapy, Yunnan Cancer Hospital, The Third Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Chun Feng
- Department of Otolaryngology, The First People's Hospital of Yunnan Province, The Affiliated Hospital of Kunming University of Science and Technology, Kunming, China
| | - Dan Yu
- Division of Applied Oral Sciences & Community Dental Care, Faculty of Dentistry, the University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory of Pharmaceutical Biotechnology, the University of Hong Kong, Hong Kong SAR, China
| | - Xiaoyan Wang
- Department of Otolaryngology Head and Neck Surgery, Beijing Tongren Hospital, Capital Medical University, Beijing, China
- Key Laboratory of Otolaryngology Head and Neck Surgery (Capital Medical University), Ministry of Education, Beijing, China
| | - Qing Yang
- Department of Head and Neck Surgery, Yunnan Cancer Hospital, The Third Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Siting Yu
- Department of Radiotherapy, Yunnan Cancer Hospital, The Third Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Hongjiang Zhang
- Department of Radiotherapy, Yunnan Cancer Hospital, The Third Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Bangyun Huo
- Department of Otolaryngology, The First People's Hospital of Yunnan Province, The Affiliated Hospital of Kunming University of Science and Technology, Kunming, China
| | - Honglu Jiang
- Department of Otolaryngology, The First People's Hospital of Yunnan Province, The Affiliated Hospital of Kunming University of Science and Technology, Kunming, China
| | - Zuli Li
- Institute for Viral Hepatitis & Department of Infectious Diseases, Second Affiliated Hospital, Chongqing Medical University, Chongqing, China
- Key Laboratory of Molecular Biology of Infectious Diseases, MOE (Ministry of Education), Chongqing, China
| | - Junlin Wang
- Division of Applied Oral Sciences & Community Dental Care, Faculty of Dentistry, the University of Hong Kong, Hong Kong SAR, China
- Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, the University of Hong Kong, Hong Kong SAR, China
| | - Yu-Xiong Su
- Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, the University of Hong Kong, Hong Kong SAR, China
| | - Ping Yang
- Department of Quantitative Health Science, Mayo Clinic, Scottsdale, USA
| | - Yong Liao
- Institute for Viral Hepatitis & Department of Infectious Diseases, Second Affiliated Hospital, Chongqing Medical University, Chongqing, China
- Key Laboratory of Molecular Biology of Infectious Diseases, MOE (Ministry of Education), Chongqing, China
| | - Qi Zhong
- Department of Otolaryngology Head and Neck Surgery, Beijing Tongren Hospital, Capital Medical University, Beijing, China.
- Key Laboratory of Otolaryngology Head and Neck Surgery (Capital Medical University), Ministry of Education, Beijing, China.
| | - Junwen Wang
- Division of Applied Oral Sciences & Community Dental Care, Faculty of Dentistry, the University of Hong Kong, Hong Kong SAR, China.
- State Key Laboratory of Pharmaceutical Biotechnology, the University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
2
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
3
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
4
|
Felício D, Alves-Ferreira M, Santos M, Quintas M, Lopes AM, Lemos C, Pinto N, Martins S. Integrating functional scoring and regulatory data to predict the effect of non-coding SNPs in a complex neurological disease. Brief Funct Genomics 2024; 23:138-149. [PMID: 37254524 DOI: 10.1093/bfgp/elad020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 03/13/2023] [Accepted: 05/09/2023] [Indexed: 06/01/2023] Open
Abstract
Most SNPs associated with complex diseases seem to lie in non-coding regions of the genome; however, their contribution to gene expression and disease phenotype remains poorly understood. Here, we established a workflow to provide assistance in prioritising the functional relevance of non-coding SNPs of candidate genes as susceptibility loci in polygenic neurological disorders. To illustrate the applicability of our workflow, we considered the multifactorial disorder migraine as a model to follow our step-by-step approach. We annotated the overlap of selected SNPs with regulatory elements and assessed their potential impact on gene expression based on publicly available prediction algorithms and functional genomics information. Some migraine risk loci have been hypothesised to reside in non-coding regions and to be implicated in the neurotransmission pathway. In this study, we used a set of 22 non-coding SNPs from neurotransmission and synaptic machinery-related genes previously suggested to be involved in migraine susceptibility based on our candidate gene association studies. After prioritising these SNPs, we focused on non-reported ones that demonstrated high regulatory potential: (1) VAMP2_rs1150 (3' UTR) was predicted as a target of hsa-mir-5010-3p miRNA, possibly disrupting its own gene expression; (2) STX1A_rs6951030 (proximal enhancer) may affect the binding affinity of zinc-finger transcription factors (namely ZNF423) and disturb TBL2 gene expression; and (3) SNAP25_rs2327264 (distal enhancer) expected to be in a binding site of ONECUT2 transcription factor. This study demonstrated the applicability of our practical workflow to facilitate the prioritisation of potentially relevant non-coding SNPs and predict their functional impact in multifactorial neurological diseases.
Collapse
Affiliation(s)
- Daniela Felício
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto 4200-135, Portugal
- Instituto Ciências Biomédicas Abel Salazar (ICBAS), Universidade do Porto, Porto 4050-313, Portugal
| | - Miguel Alves-Ferreira
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Instituto Ciências Biomédicas Abel Salazar (ICBAS), Universidade do Porto, Porto 4050-313, Portugal
- Unit for Genetic and Epidemiological Research in Neurological Diseases (UnIGENe), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
- Centre for Predictive and Preventive Genetics (CGPP), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
| | - Mariana Santos
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Unit for Genetic and Epidemiological Research in Neurological Diseases (UnIGENe), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
| | - Marlene Quintas
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Instituto Ciências Biomédicas Abel Salazar (ICBAS), Universidade do Porto, Porto 4050-313, Portugal
- Unit for Genetic and Epidemiological Research in Neurological Diseases (UnIGENe), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
| | - Alexandra M Lopes
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto 4200-135, Portugal
- Centre for Predictive and Preventive Genetics (CGPP), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
| | - Carolina Lemos
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Instituto Ciências Biomédicas Abel Salazar (ICBAS), Universidade do Porto, Porto 4050-313, Portugal
- Unit for Genetic and Epidemiological Research in Neurological Diseases (UnIGENe), Instituto de Biologia Molecular e Celular (IBMC), Universidade do Porto, Porto 4200-135, Portugal
| | - Nádia Pinto
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto 4200-135, Portugal
- Centro de Matemática da Universidade do Porto (CMUP), Porto 4169-007, Portugal
| | - Sandra Martins
- Instituto de Investigação e Inovação em Saúde (i3S), Porto 4200-135, Portugal
- Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto 4200-135, Portugal
| |
Collapse
|
5
|
Chen L, Wang Y, Zhao F. Exploiting deep transfer learning for the prediction of functional non-coding variants using genomic sequence. Bioinformatics 2022; 38:3164-3172. [PMID: 35389435 PMCID: PMC9890318 DOI: 10.1093/bioinformatics/btac214] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 03/04/2022] [Accepted: 04/06/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Though genome-wide association studies have identified tens of thousands of variants associated with complex traits and most of them fall within the non-coding regions, they may not be the causal ones. The development of high-throughput functional assays leads to the discovery of experimental validated non-coding functional variants. However, these validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of non-coding causal variants. RESULTS We will exploit a deep transfer learning model, which is based on convolutional neural network, to improve the prediction for functional non-coding variants (NCVs). To address the challenge of small sample size, the transfer learning model leverages both large-scale generic functional NCVs to improve the learning of low-level features and context-specific functional NCVs to learn high-level features toward the context-specific prediction task. By evaluating the deep transfer learning model on three MPRA datasets and 16 GWAS datasets, we demonstrate that the proposed model outperforms deep learning models without pretraining or retraining. In addition, the deep transfer learning model outperforms 18 existing computational methods in both MPRA and GWAS datasets. AVAILABILITY AND IMPLEMENTATION https://github.com/lichen-lab/TLVar. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Li Chen
- To whom correspondence should be addressed.
| | | | - Fengdi Zhao
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 46202, USA,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
6
|
Correard S, Hewitson B, van der Lee R, Wasserman WW. RevUP, an online scoring system for regulatory variants implicated in rare diseases. Bioinformatics 2022; 38:2664-2666. [PMID: 35289834 PMCID: PMC9048665 DOI: 10.1093/bioinformatics/btac157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 03/03/2022] [Accepted: 03/14/2022] [Indexed: 11/22/2022] Open
Abstract
Summary To address the difficulty in assessing the implication of regulatory variants in diseases, a scoring scheme previously published allows the calculation of the Regulatory Variant Evidence score (RVE-score). The score represents the accumulated evidence for a causative role of a regulatory variant in a disease. Regulatory Evidence for Variants Underlying Phenotypes was built to calculate the RVE-score of regulatory variants, based on the 24 criteria, with a hybrid approach combining information retrieved from public databases and user input. Availability and implementation RevUP is freely available at http://www.revup-classifier.ca. The source code is available at https://github.com/wassermanlab/revup. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Solenne Correard
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, Canada V5Z 4H4, BC
| | - Brittany Hewitson
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, Canada V5Z 4H4, BC
| | - Robin van der Lee
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, Canada V5Z 4H4, BC
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, Canada V5Z 4H4, BC
| |
Collapse
|
7
|
Li X, Yung G, Zhou H, Sun R, Li Z, Hou K, Zhang MJ, Liu Y, Arapoglou T, Wang C, Ionita-Laza I, Lin X. A multi-dimensional integrative scoring framework for predicting functional variants in the human genome. Am J Hum Genet 2022; 109:446-456. [PMID: 35216679 PMCID: PMC8948160 DOI: 10.1016/j.ajhg.2022.01.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 01/26/2022] [Indexed: 12/26/2022] Open
Abstract
Attempts to identify and prioritize functional DNA elements in coding and non-coding regions, particularly through use of in silico functional annotation data, continue to increase in popularity. However, specific functional roles can vary widely from one variant to another, making it challenging to summarize different aspects of variant function with a one-dimensional rating. Here we propose multi-dimensional annotation-class integrative estimation (MACIE), an unsupervised multivariate mixed-model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and non-coding variants. Unlike existing one-dimensional scoring methods, MACIE views variant functionality as a composite attribute encompassing multiple characteristics and estimates the joint posterior functional probabilities of each genomic position. This estimate offers more comprehensive and interpretable information in the presence of multiple aspects of functionality. Applied to a variety of independent coding and non-coding datasets, MACIE demonstrates powerful and robust performance in discriminating between functional and non-functional variants. We also show an application of MACIE to fine-mapping and heritability enrichment analysis by using the lipids GWAS summary statistics data from the European Network for Genetic and Genomic Epidemiology Consortium.
Collapse
Affiliation(s)
- Xihao Li
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Godwin Yung
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Methods, Collaboration and Outreach Group, Genentech/Roche, South San Francisco, CA 94080, USA
| | - Hufeng Zhou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Ryan Sun
- Department of Biostatistics, University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA
| | - Zilin Li
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Kangcheng Hou
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Martin Jinye Zhang
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| | - Yaowu Liu
- School of Statistics, Southwestern University of Finance and Economics, Chengdu, Sichuan, China
| | - Theodore Arapoglou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Chen Wang
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY 10032, USA
| | - Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY 10032, USA.
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Statistics, Harvard University, Cambridge, MA, 02138, USA.
| |
Collapse
|
8
|
Wang X, Cheng H, Yang Y, Zuo X, Shao L, Yu D, Yang N, Zhang Y, Li R, Wang X, Shen B, Wang J, Shi X, Cao P, Sun L, Han X, Sun Y. The enhancer rare germline variation rs548071605 contributes to lung cancer development. Hum Mutat 2021; 43:200-214. [PMID: 34859522 DOI: 10.1002/humu.24310] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 10/21/2021] [Accepted: 11/28/2021] [Indexed: 11/05/2022]
Abstract
Rare germline variations contribute to the missing heritability of human complex diseases including cancers. Given their very low frequency, discovering and testing disease-causing rare germline variations remains challenging. The tag-single nucleotide polymorphism rs17728461 in 22q12.2 is highly associated with lung cancer risk. Here, we identified a functional rare germline variation rs548071605 (A>G) in a p65-responsive enhancer located within 22q12.2. The enhancer significantly promoted lung cancer cell proliferation in vitro and in a xenograft mouse model by upregulating the leukemia inhibitory factor (LIF) gene via the formation of a chromatin loop. Differential expression of LIF and its significant correlation with first progression survival time of patients further supported the lung cancer-driving effects of the 22q-Enh enhancer. Importantly, the rare variation was harbored in the p65 binding sequence and dramatically increased the enhancer activity by increasing responsiveness of the enhancer to p65 and B-cell lymphoma 3 protein, an oncoprotein that assisted the p65 binding. Our study revealed a regulatory rare germline variation with a potential lung cancer-driving role in the 22q12.2 risk region, providing intriguing clues for investigating the "missing heritability" of cancers, and also offered a useful experimental model for identifying causal rare variations.
Collapse
Affiliation(s)
- Xuchun Wang
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Personalized Cancer Medicine, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - He Cheng
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Personalized Cancer Medicine, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Yin Yang
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Xianglin Zuo
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Lipei Shao
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Dawei Yu
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Nan Yang
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China
| | - Yu Zhang
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Ruilei Li
- Department of Cancer Biotherapy Center, The Third Affiliated Hospital of Kunming Medical University (Tumor Hospital of Yunnan Province), Kunming, Yunnan, China
| | - Xinyuan Wang
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Bin Shen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Jianying Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Xiao Shi
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Pingping Cao
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Luan Sun
- Department of Cell Biology, Nanjing Medical University, Nanjing, China
| | - Xiao Han
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China
| | - Yujie Sun
- Key Laboratory of Human Functional Genomics of Jiangsu Province, Nanjing Medical University, Nanjing, China.,Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Personalized Cancer Medicine, Nanjing Medical University, Nanjing, China.,Department of Cell Biology, Nanjing Medical University, Nanjing, China
| |
Collapse
|
9
|
Cao Z, Huang Y, Duan R, Jin P, Qin ZS, Zhang S. Disease category-specific annotation of variants using an ensemble learning framework. Brief Bioinform 2021; 23:6394995. [PMID: 34643213 DOI: 10.1093/bib/bbab438] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/03/2021] [Accepted: 09/22/2021] [Indexed: 02/01/2023] Open
Abstract
Understanding the impact of non-coding sequence variants on complex diseases is an essential problem. We present a novel ensemble learning framework-CASAVA, to predict genomic loci in terms of disease category-specific risk. Using disease-associated variants identified by GWAS as training data, and diverse sequencing-based genomics and epigenomics profiles as features, CASAVA provides risk prediction of 24 major categories of diseases throughout the human genome. Our studies showed that CASAVA scores at a genomic locus provide a reasonable prediction of the disease-specific and disease category-specific risk prediction for non-coding variants located within the locus. Taking MHC2TA and immune system diseases as an example, we demonstrate the potential of CASAVA in revealing variant-disease associations. A website (http://zhanglabtools.org/CASAVA) has been built to facilitate easily access to CASAVA scores.
Collapse
Affiliation(s)
- Zhen Cao
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yanting Huang
- Department of Computer Science, Emory University, Atlanta, GA 30322, USA
| | - Ran Duan
- Department of Software Engineering, Yunnan University, Kunming 650500, China
| | - Peng Jin
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Zhaohui S Qin
- Department of Computer Science, Emory University, Atlanta, GA 30322, USA.,Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China.,Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
| |
Collapse
|
10
|
Jin Y, Jiang J, Wang R, Qin ZS. Systematic Evaluation of DNA Sequence Variations on in vivo Transcription Factor Binding Affinity. Front Genet 2021; 12:667866. [PMID: 34567058 PMCID: PMC8458901 DOI: 10.3389/fgene.2021.667866] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 08/02/2021] [Indexed: 02/01/2023] Open
Abstract
The majority of the single nucleotide variants (SNVs) identified by genome-wide association studies (GWAS) fall outside of the protein-coding regions. Elucidating the functional implications of these variants has been a major challenge. A possible mechanism for functional non-coding variants is that they disrupted the canonical transcription factor (TF) binding sites that affect the in vivo binding of the TF. However, their impact varies since many positions within a TF binding motif are not well conserved. Therefore, simply annotating all variants located in putative TF binding sites may overestimate the functional impact of these SNVs. We conducted a comprehensive survey to study the effect of SNVs on the TF binding affinity. A sequence-based machine learning method was used to estimate the change in binding affinity for each SNV located inside a putative motif site. From the results obtained on 18 TF binding motifs, we found that there is a substantial variation in terms of a SNV’s impact on TF binding affinity. We found that only about 20% of SNVs located inside putative TF binding sites would likely to have significant impact on the TF-DNA binding.
Collapse
Affiliation(s)
- Yutong Jin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Jiahui Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Ruixuan Wang
- College of Environmental Sciences and Engineering, Peking University, Beijing, China
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| |
Collapse
|
11
|
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021; 37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 340] [Impact Index Per Article: 85.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. RESULTS To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY AND IMPLEMENTATION The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanrong Ji
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Zhihan Zhou
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - Han Liu
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - Ramana V Davuluri
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
12
|
Wang Y, Jiang Y, Yao B, Huang K, Liu Y, Wang Y, Qin X, Saykin AJ, Chen L. WEVar: a novel statistical learning framework for predicting noncoding regulatory variants. Brief Bioinform 2021; 22:6279833. [PMID: 34021560 DOI: 10.1093/bib/bbab189] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Revised: 04/05/2021] [Accepted: 04/23/2021] [Indexed: 11/15/2022] Open
Abstract
Understanding the functional consequence of noncoding variants is of great interest. Though genome-wide association studies or quantitative trait locus analyses have identified variants associated with traits or molecular phenotypes, most of them are located in the noncoding regions, making the identification of causal variants a particular challenge. Existing computational approaches developed for prioritizing noncoding variants produce inconsistent and even conflicting results. To address these challenges, we propose a novel statistical learning framework, which directly integrates the precomputed functional scores from representative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction. The framework consists of two modes. The first 'context-free' mode is trained using curated causal regulatory variants from a wide range of context and is applicable to predict regulatory variants of unknown and diverse context. The second 'context-dependent' mode further improves the prediction when the training and testing variants are from the same context. By evaluating the framework via both simulation and empirical studies, we demonstrate that it outperforms integrated scoring methods and the ensemble score successfully prioritizes experimentally validated regulatory variants in multiple risk loci.
Collapse
Affiliation(s)
- Ye Wang
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Yuchao Jiang
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Bing Yao
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, 36849, USA
| | - Kun Huang
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Yunlong Liu
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yue Wang
- Department of Human Genetics, Emory University, Atlanta, GA 30322, USA
| | - Xiao Qin
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| | - Li Chen
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, 46202, USA
| |
Collapse
|
13
|
Sobczyk MK, Richardson TG, Zuber V, Min JL, Gaunt TR, Paternoster L. Triangulating molecular evidence to prioritize candidate causal genes at established atopic dermatitis loci. J Invest Dermatol 2021; 141:2620-2629. [PMID: 33901562 PMCID: PMC8592116 DOI: 10.1016/j.jid.2021.03.027] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 03/12/2021] [Accepted: 03/25/2021] [Indexed: 01/16/2023]
Abstract
Genome-wide association studies for atopic dermatitis (AD) have identified 25 reproducible loci. We attempt to prioritize candidate causal genes at these loci using extensive molecular resources compiled into a bioinformatics pipeline. We identified a list of 103 molecular resources for AD aetiology, including expression, protein and DNA methylation QTL datasets in skin or immune-relevant tissues which were tested for overlap with GWAS signals. This was combined with functional annotation using regulatory variant prediction, and features such as promoter-enhancer interactions, expression studies and variant fine-mapping. For each gene at each locus, we condensed the evidence into a prioritization score. Across the investigated loci, we detected significant enrichment of genes with adaptive immune regulatory function and epidermal barrier formation among the top prioritized genes. At 8 loci, we were able to prioritize a single candidate gene (IL6R, ADO, PRR5L, IL7R, ETS1, INPP5D, MDM1, TRAF3). In addition, at 6 of the 25 loci, our analysis prioritizes less familiar candidates (SLC22A5, IL2RA, MDM1, DEXI, ADO, STMN3). Our analysis provides support for previously implicated genes at several AD GWAS loci, as well as evidence for plausible additional candidates at others, which may represent potential targets for drug discovery.
Collapse
Affiliation(s)
- Maria K Sobczyk
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK
| | - Tom G Richardson
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK
| | - Verena Zuber
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, UK; MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Josine L Min
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK
| | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK
| | - Lavinia Paternoster
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK.
| | | |
Collapse
|
14
|
Zhang S, He Y, Liu H, Zhai H, Huang D, Yi X, Dong X, Wang Z, Zhao K, Zhou Y, Wang J, Yao H, Xu H, Yang Z, Sham PC, Chen K, Li MJ. regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variants. Nucleic Acids Res 2020; 47:e134. [PMID: 31511901 PMCID: PMC6868349 DOI: 10.1093/nar/gkz774] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 08/29/2019] [Indexed: 12/19/2022] Open
Abstract
Predicting the functional or pathogenic regulatory variants in the human non-coding genome facilitates the interpretation of disease causation. While numerous prediction methods are available, their performance is inconsistent or restricted to specific tasks, which raises the demand of developing comprehensive integration for those methods. Here, we compile whole genome base-wise aggregations, regBase, that incorporate largest prediction scores. Building on different assumptions of causality, we train three composite models to score functional, pathogenic and cancer driver non-coding regulatory variants respectively. We demonstrate the superior and stable performance of our models using independent benchmarks and show great success to fine-map causal regulatory variants on specific locus or at base-wise resolution. We believe that regBase database together with three composite models will be useful in different areas of human genetic studies, such as annotation-based casual variant fine-mapping, pathogenic variant discovery as well as cancer driver mutation identification. regBase is freely available at https://github.com/mulinlab/regBase.
Collapse
Affiliation(s)
- Shijie Zhang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Yukun He
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Huanhuan Liu
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Haoyu Zhai
- Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA
| | - Dandan Huang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin, China
| | - Xiaobao Dong
- Department of Genetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Zhao Wang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Ke Zhao
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Yao Zhou
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Jianhua Wang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Hongcheng Yao
- School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Hang Xu
- School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Zhenglu Yang
- College of Computer Science, Nankai University, Tianjin, China
| | - Pak Chung Sham
- Centre of Genomics Sciences, State Key Laboratory of Brain and Cognitive Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Kexin Chen
- Department of Epidemiology and Biostatistics, Tianjin Key Laboratory of Molecular Cancer Epidemiology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Mulin Jun Li
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Key Laboratory of Inflammation Biology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China.,Department of Epidemiology and Biostatistics, Tianjin Key Laboratory of Molecular Cancer Epidemiology, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| |
Collapse
|
15
|
Tian R, Pan Y, Etheridge THA, Deshmukh H, Gulick D, Gibson G, Bao G, Lee CM. Pitfalls in Single Clone CRISPR-Cas9 Mutagenesis to Fine-map Regulatory Intervals. Genes (Basel) 2020; 11:E504. [PMID: 32375333 PMCID: PMC7288657 DOI: 10.3390/genes11050504] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 04/15/2020] [Accepted: 04/22/2020] [Indexed: 12/11/2022] Open
Abstract
The majority of genetic variants affecting complex traits map to regulatory regions of genes, and typically lie in credible intervals of 100 or more SNPs. Fine mapping of the causal variant(s) at a locus depends on assays that are able to discriminate the effects of polymorphisms or mutations on gene expression. Here, we evaluated a moderate-throughput CRISPR-Cas9 mutagenesis approach, based on replicated measurement of transcript abundance in single-cell clones, by deleting candidate regulatory SNPs, affecting four genes known to be affected by large-effect expression Quantitative Trait Loci (eQTL) in leukocytes, and using Fluidigm qRT-PCR to monitor gene expression in HL60 pro-myeloid human cells. We concluded that there were multiple constraints that rendered the approach generally infeasible for fine mapping. These included the non-targetability of many regulatory SNPs, clonal variability of single-cell derivatives, and expense. Power calculations based on the measured variance attributable to major sources of experimental error indicated that typical eQTL explaining 10% of the variation in expression of a gene would usually require at least eight biological replicates of each clone. Scanning across credible intervals with this approach is not recommended.
Collapse
Affiliation(s)
- Ruoyu Tian
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA; (R.T.); (D.G.)
| | - Yidan Pan
- Systems, Synthetic, and Physical Biology, Rice University, Houston, TX 77005, USA;
- Department of Bioengineering, Rice University, Houston, TX 77005, USA; (H.D.); (T.H.A.E.)
| | - Thomas H. A. Etheridge
- Department of Bioengineering, Rice University, Houston, TX 77005, USA; (H.D.); (T.H.A.E.)
| | - Harshavardhan Deshmukh
- Department of Bioengineering, Rice University, Houston, TX 77005, USA; (H.D.); (T.H.A.E.)
| | - Dalia Gulick
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA; (R.T.); (D.G.)
| | - Greg Gibson
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA; (R.T.); (D.G.)
| | - Gang Bao
- Systems, Synthetic, and Physical Biology, Rice University, Houston, TX 77005, USA;
- Department of Bioengineering, Rice University, Houston, TX 77005, USA; (H.D.); (T.H.A.E.)
| | - Ciaran M Lee
- APC Microbiome Ireland, University College Cork, Cork T12 YN60, Ireland
| |
Collapse
|
16
|
Comprehensive functional annotation of susceptibility variants associated with asthma. Hum Genet 2020; 139:1037-1053. [PMID: 32240371 DOI: 10.1007/s00439-020-02151-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/18/2020] [Indexed: 01/02/2023]
Abstract
Genome-wide association studies (GWAS) have identified hundreds of primarily non-coding disease-susceptibility variants that further need functional interpretation to prioritize and discriminate the disease-relevant variants. We present a comprehensive genome-wide non-coding variant prioritization scheme followed by validation using Pyrosequencing and TaqMan assays in asthma. We implemented a composite Functional Annotation Score (cFAS) to investigate over 32,000 variants consisting of 1525 GWAS-lead asthma-susceptibility variants and their LD proxies (r2 ≥ 0.80). Functional annotation pipeline in cFAS revealed 274 variants with significant score at 1% false discovery rate. This study implicates a novel locus 4p16 (SLC26A1) with eQTL variant (rs11936407) and known loci in 17q12-21 and 5q22 which encode ORM1-like protein 3 (ORMDL3, rs406527, and rs12936231) and thymic stromal lymphopoietin (TSLP, rs3806932 and rs10073816) epithelial gene, respectively. Follow-up validation analysis through pyrosequencing of CpG sites in and nearby rs4065275 and rs11936407 showed genotype-dependent hypomethylation on asthma cases compared with healthy controls. Prioritized variants are enriched for asthma-specific histone modification associated with active chromatin (H3K4me1 and H3K27ac) in T cells, B cells, lung, and immune-related interferon gamma signaling pathways. Our findings, together with those from prior studies, suggest that SNPs can affect asthma by regulating enhancer activity, and our comprehensive bioinformatics and functional analysis could lead to biological insights into asthma pathogenesis.Graphic abstract.
Collapse
|
17
|
Chimusa ER, Dalvie S, Dandara C, Wonkam A, Mazandu GK. Post genome-wide association analysis: dissecting computational pathway/network-based approaches. Brief Bioinform 2020; 20:690-700. [PMID: 29701762 DOI: 10.1093/bib/bby035] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Revised: 04/04/2018] [Indexed: 02/02/2023] Open
Abstract
Over thousands of genetic associations to diseases have been identified by genome-wide association studies (GWASs), which conceptually is a single-marker-based approach. There are potentially many uses of these identified variants, including a better understanding of the pathogenesis of diseases, new leads for studying underlying risk prediction and clinical prediction of treatment. However, because of inadequate power, GWAS might miss disease genes and/or pathways with weak genetic or strong epistatic effects. Driven by the need to extract useful information from GWAS summary statistics, post-GWAS approaches (PGAs) were introduced. Here, we dissect and discuss advances made in pathway/network-based PGAs, with a particular focus on protein-protein interaction networks that leverage GWAS summary statistics by combining effects of multiple loci, subnetworks or pathways to detect genetic signals associated with complex diseases. We conclude with a discussion of research areas where further work on summary statistic-based methods is needed.
Collapse
Affiliation(s)
- Emile R Chimusa
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Level 3, Wernher and Beit North, Private Bag, Rondebosch, 7700, Anzio road, Observatory Cape Town, South Africa
| | - Shareefa Dalvie
- Department of Psychiatry and Mental Health, University of Cape Town, Observatory, 7925, Cape Town, South Africa
| | - Collet Dandara
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa
| | - Ambroise Wonkam
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa
| | - Gaston K Mazandu
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa; African Institute for Mathematical Sciences, 7945 Muizenberg, Cape Town, South Africa and Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, Anzio Road, Observatory, 7925, Cape Town, South Africa
| |
Collapse
|
18
|
Rojano E, Seoane P, Ranea JAG, Perkins JR. Regulatory variants: from detection to predicting impact. Brief Bioinform 2019; 20:1639-1654. [PMID: 29893792 PMCID: PMC6917219 DOI: 10.1093/bib/bby039] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Revised: 04/18/2018] [Indexed: 02/01/2023] Open
Abstract
Variants within non-coding genomic regions can greatly affect disease. In recent years, increasing focus has been given to these variants, and how they can alter regulatory elements, such as enhancers, transcription factor binding sites and DNA methylation regions. Such variants can be considered regulatory variants. Concurrently, much effort has been put into establishing international consortia to undertake large projects aimed at discovering regulatory elements in different tissues, cell lines and organisms, and probing the effects of genetic variants on regulation by measuring gene expression. Here, we describe methods and techniques for discovering disease-associated non-coding variants using sequencing technologies. We then explain the computational procedures that can be used for annotating these variants using the information from the aforementioned projects, and prediction of their putative effects, including potential pathogenicity, based on rule-based and machine learning approaches. We provide the details of techniques to validate these predictions, by mapping chromatin-chromatin and chromatin-protein interactions, and introduce Clustered Regularly Interspaced Short Palindromic Repeats-Associated Protein 9 (CRISPR-Cas9) technology, which has already been used in this field and is likely to have a big impact on its future evolution. We also give examples of regulatory variants associated with multiple complex diseases. This review is aimed at bioinformaticians interested in the characterization of regulatory variants, molecular biologists and geneticists interested in understanding more about the nature and potential role of such variants from a functional point of views, and clinicians who may wish to learn about variants in non-coding genomic regions associated with a given disease and find out what to do next to uncover how they impact on the underlying mechanisms.
Collapse
Affiliation(s)
- Elena Rojano
- Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - Pedro Seoane
- Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - Juan A G Ranea
- CIBER de Enfermedades Raras, ISCIII, Madrid, Spain and Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - James R Perkins
- Research laboratory, IBIMA-Regional University Hospital of Malaga, UMA, Malaga 29009, Spain
| |
Collapse
|
19
|
Huang D, Yi X, Zhang S, Zheng Z, Wang P, Xuan C, Sham PC, Wang J, Li MJ. GWAS4D: multidimensional analysis of context-specific regulatory variant for human complex diseases and traits. Nucleic Acids Res 2019; 46:W114-W120. [PMID: 29771388 PMCID: PMC6030885 DOI: 10.1093/nar/gky407] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2018] [Accepted: 05/03/2018] [Indexed: 01/04/2023] Open
Abstract
Genome-wide association studies have generated over thousands of susceptibility loci for many human complex traits, and yet for most of these associations the true causal variants remain unknown. Tissue/cell type-specific prediction and prioritization of non-coding regulatory variants will facilitate the identification of causal variants and underlying pathogenic mechanisms for particular complex diseases and traits. By leveraging recent large-scale functional genomics/epigenomics data, we develop an intuitive web server, GWAS4D (http://mulinlab.tmu.edu.cn/gwas4d or http://mulinlab.org/gwas4d), that systematically evaluates GWAS signals and identifies context-specific regulatory variants. The updated web server includes six major features: (i) updates the regulatory variant prioritization method with our new algorithm; (ii) incorporates 127 tissue/cell type-specific epigenomes data; (iii) integrates motifs of 1480 transcriptional regulators from 13 public resources; (iv) uniformly processes Hi-C data and generates significant interactions at 5 kb resolution across 60 tissues/cell types; (v) adds comprehensive non-coding variant functional annotations; (vi) equips a highly interactive visualization function for SNP-target interaction. Using a GWAS fine-mapped set for 161 coronary artery disease risk loci, we demonstrate that GWAS4D is able to efficiently prioritize disease-causal regulatory variants.
Collapse
Affiliation(s)
- Dandan Huang
- Department of Pharmacology, Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin, China
| | - Shijie Zhang
- Department of Pharmacology, Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Zhanye Zheng
- Department of Pharmacology, Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Panwen Wang
- Department of Health Sciences Research & Center for Individualized Medicine, Mayo Clinic, Scottsdale, USA
| | - Chenghao Xuan
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Pak Chung Sham
- Center for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,Departments of Psychiatry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.,State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Junwen Wang
- Department of Health Sciences Research & Center for Individualized Medicine, Mayo Clinic, Scottsdale, USA.,Department of Biomedical Informatics, Arizona State University, Scottsdale, USA
| | - Mulin Jun Li
- Department of Pharmacology, Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| |
Collapse
|
20
|
Li MJ, Yao H, Huang D, Liu H, Liu Z, Xu H, Qin Y, Prinz J, Xia W, Wang P, Yan B, Tran NL, Kocher JP, Sham PC, Wang J. mTCTScan: a comprehensive platform for annotation and prioritization of mutations affecting drug sensitivity in cancers. Nucleic Acids Res 2019; 45:W215-W221. [PMID: 28482068 PMCID: PMC5793836 DOI: 10.1093/nar/gkx400] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 04/27/2017] [Indexed: 12/25/2022] Open
Abstract
Cancer therapies have experienced rapid progress in recent years, with a number of novel small-molecule kinase inhibitors and monoclonal antibodies now being widely used to treat various types of human cancers. During cancer treatments, mutations can have important effects on drug sensitivity. However, the relationship between tumor genomic profiles and the effectiveness of cancer drugs remains elusive. We introduce Mutation To Cancer Therapy Scan (mTCTScan) web server (http://jjwanglab.org/mTCTScan) that can systematically analyze mutations affecting cancer drug sensitivity based on individual genomic profiles. The platform was developed by leveraging the latest knowledge on mutation-cancer drug sensitivity associations and the results from large-scale chemical screening using human cancer cell lines. Using an evidence-based scoring scheme based on current integrative evidences, mTCTScan is able to prioritize mutations according to their associations with cancer drugs and preclinical compounds. It can also show related drugs/compounds with sensitivity classification by considering the context of the entire genomic profile. In addition, mTCTScan incorporates comprehensive filtering functions and cancer-related annotations to better interpret mutation effects and their association with cancer drugs. This platform will greatly benefit both researchers and clinicians for interrogating mechanisms of mutation-dependent drug response, which will have a significant impact on cancer precision medicine.
Collapse
Affiliation(s)
- Mulin Jun Li
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.,Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Hongcheng Yao
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.,School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Dandan Huang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Huanhuan Liu
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Zipeng Liu
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Hang Xu
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.,School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Yiming Qin
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.,School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Jeanette Prinz
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA
| | - Weiyi Xia
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Panwen Wang
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA
| | - Bin Yan
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.,School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Nhan L Tran
- Department of Cancer Biology, Mayo Clinic, Scottsdale, AZ 85259, USA
| | - Jean-Pierre Kocher
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA
| | - Pak C Sham
- Center for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China.,Departments of Psychiatry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR 999077, China
| | - Junwen Wang
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA.,Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| |
Collapse
|
21
|
Liu L, Sanderford MD, Patel R, Chandrashekar P, Gibson G, Kumar S. Biological relevance of computationally predicted pathogenicity of noncoding variants. Nat Commun 2019; 10:330. [PMID: 30659175 PMCID: PMC6338804 DOI: 10.1038/s41467-018-08270-y] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 12/19/2018] [Indexed: 11/15/2022] Open
Abstract
Computational prediction of the phenotypic propensities of noncoding single nucleotide variants typically combines annotation of genomic, functional and evolutionary attributes into a single score. Here, we evaluate if the claimed excellent accuracies of these predictions translate into high rates of success in addressing questions important in biological research, such as fine mapping causal variants, distinguishing pathogenic allele(s) at a given position, and prioritizing variants for genetic risk assessment. A significant disconnect is found to exist between the statistical modelling and biological performance of predictive approaches. We discuss fundamental reasons underlying these deficiencies and suggest that future improvements of computational predictions need to address confounding of allelic, positional and regional effects as well as imbalance of the proportion of true positive variants in candidate lists. Researchers can make use of a variety of computational tools to prioritize genetic variants and predict their pathogenicity. Here, the authors evaluate the performance of six of these tools in three typical biological tasks and find generally low concordance of predictions and experimental confirmation.
Collapse
Affiliation(s)
- Li Liu
- College of Health Solutions, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Maxwell D Sanderford
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
| | - Ravi Patel
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.,Department of Biology, Temple University, Philadelphia, PA, USA
| | - Pramod Chandrashekar
- College of Health Solutions, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Greg Gibson
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA. .,Department of Biology, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
22
|
Khan A, Liu Q, Wang K. iMEGES: integrated mental-disorder GEnome score by deep neural network for prioritizing the susceptibility genes for mental disorders in personal genomes. BMC Bioinformatics 2018; 19:501. [PMID: 30591030 PMCID: PMC6309067 DOI: 10.1186/s12859-018-2469-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND A range of rare and common genetic variants have been discovered to be potentially associated with mental diseases, but many more have not been uncovered. Powerful integrative methods are needed to systematically prioritize both variants and genes that confer susceptibility to mental diseases in personal genomes of individual patients and to facilitate the development of personalized treatment or therapeutic approaches. METHODS Leveraging deep neural network on the TensorFlow framework, we developed a computational tool, integrated Mental-disorder GEnome Score (iMEGES), for analyzing whole genome/exome sequencing data on personal genomes. iMEGES takes as input genetic mutations and phenotypic information from a patient with mental disorders, and outputs the rank of whole genome susceptibility variants and the prioritized disease-specific genes for mental disorders by integrating contributions from coding and non-coding variants, structural variants (SVs), known brain expression quantitative trait loci (eQTLs), and epigenetic information from PsychENCODE. RESULTS iMEGES was evaluated on multiple datasets of mental disorders, and it achieved improved performance than competing approaches when large training dataset is available. CONCLUSION iMEGES can be used in population studies to help the prioritization of novel genes or variants that might be associated with the susceptibility to mental disorders, and also on individual patients to help the identification of genes or variants related to mental diseases.
Collapse
Affiliation(s)
- Atlas Khan
- Division of Nephrology, Department of Medicine, College of Physicians and Surgeons, Columbia University, New York, NY 10032 USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104 USA
| |
Collapse
|
23
|
A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs. Nat Commun 2018; 9:5199. [PMID: 30518757 PMCID: PMC6281617 DOI: 10.1038/s41467-018-07349-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Accepted: 10/18/2018] [Indexed: 01/21/2023] Open
Abstract
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. We propose here a semi-supervised approach, GenoNet, to jointly utilize experimentally confirmed regulatory variants (labeled variants), millions of unlabeled variants genome-wide, and more than a thousand cell/tissue type specific epigenetic annotations to predict functional consequences of non-coding variants. Through the application to several experimental datasets, we demonstrate that the proposed method significantly improves prediction accuracy compared to existing functional prediction methods at the tissue/cell type level, but especially so at the organism level. Importantly, we illustrate how the GenoNet scores can help in fine-mapping at GWAS loci, and in the discovery of disease associated genes in sequencing studies. As more comprehensive lists of experimentally validated variants become available over the next few years, semi-supervised methods like GenoNet can be used to provide increasingly accurate functional predictions for variants genome-wide and across a variety of cell/tissue types. Predicting the functional consequences of non-coding genetic variants is a challenge. Here, He et al. present GenoNet, a semi-supervised method that combines information from experimentally confirmed regulatory variants with cell type- and tissue specific annotation for function prediction.
Collapse
|
24
|
Wong KC. DNA Motif Recognition Modeling from Protein Sequences. iScience 2018; 7:198-211. [PMID: 30267681 PMCID: PMC6153143 DOI: 10.1016/j.isci.2018.09.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Revised: 08/08/2018] [Accepted: 09/04/2018] [Indexed: 12/31/2022] Open
Abstract
Although the existing works on DNA motif discovery on DNA sequences are plethoric, mechanistic knowledge to infer DNA motifs from protein sequences across multiple DNA-binding domain families without conducting any wet-lab experiments is still lacking. Therefore, the k-spectrum recognition modeling is proposed to address the issues at the highest possible resolutions. The k-spectrum model can capture DNA motif patterns from protein sequences at the resolution in which local sequence context and nucleotide dependency can be taken into account completely. Multiple evaluation metrics are adopted and measured on millions of k-mer binding intensities from 92 proteins across 5 DNA-binding families (i.e., bHLH, bZIP, ETS, Forkhead, and Homeodomain), demonstrating its competitive edges. In addition, it not only can contribute to DNA motif recognition modeling but also can help prioritize the observed or even unobserved binding of single nucleotide variants on transcription factor binding sites in a genome-wide manner. DNA motif modeling from protein is fundamental for understanding gene regulation A framework is proposed at the highest possible sequence resolution for the first time It is validated on millions of k-mer intensities from 92 proteins across 5 families It can prioritize the unobserved regulatory single nucleotide variants on DNA motifs
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.
| |
Collapse
|
25
|
Ioannidis NM, Davis JR, DeGorter MK, Larson NB, McDonnell SK, French AJ, Battle AJ, Hastie TJ, Thibodeau SN, Montgomery SB, Bustamante CD, Sieh W, Whittemore AS. FIRE: functional inference of genetic variants that regulate gene expression. Bioinformatics 2018; 33:3895-3901. [PMID: 28961785 DOI: 10.1093/bioinformatics/btx534] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 08/23/2017] [Indexed: 12/18/2022] Open
Abstract
Motivation Interpreting genetic variation in noncoding regions of the genome is an important challenge for personal genome analysis. One mechanism by which noncoding single nucleotide variants (SNVs) influence downstream phenotypes is through the regulation of gene expression. Methods to predict whether or not individual SNVs are likely to regulate gene expression would aid interpretation of variants of unknown significance identified in whole-genome sequencing studies. Results We developed FIRE (Functional Inference of Regulators of Expression), a tool to score both noncoding and coding SNVs based on their potential to regulate the expression levels of nearby genes. FIRE consists of 23 random forests trained to recognize SNVs in cis-expression quantitative trait loci (cis-eQTLs) using a set of 92 genomic annotations as predictive features. FIRE scores discriminate cis-eQTL SNVs from non-eQTL SNVs in the training set with a cross-validated area under the receiver operating characteristic curve (AUC) of 0.807, and discriminate cis-eQTL SNVs shared across six populations of different ancestry from non-eQTL SNVs with an AUC of 0.939. FIRE scores are also predictive of cis-eQTL SNVs across a variety of tissue types. Availability and implementation FIRE scores for genome-wide SNVs in hg19/GRCh37 are available for download at https://sites.google.com/site/fireregulatoryvariation/. Contact nilah@stanford.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Marianne K DeGorter
- Department of Genetics
- Department of Pathology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | | | | | - Amy J French
- Department of Laboratory Medicine & Pathology, Mayo Clinic, Rochester, MN 55905, USA
| | - Alexis J Battle
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Trevor J Hastie
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stephen N Thibodeau
- Department of Laboratory Medicine & Pathology, Mayo Clinic, Rochester, MN 55905, USA
| | - Stephen B Montgomery
- Department of Genetics
- Department of Pathology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Carlos D Bustamante
- Department of Genetics
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Weiva Sieh
- Department of Health Research & Policy
- Department of Population Health Science & Policy
- Department of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Alice S Whittemore
- Department of Health Research & Policy
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
26
|
Backenroth D, He Z, Kiryluk K, Boeva V, Pethukova L, Khurana E, Christiano A, Buxbaum JD, Ionita-Laza I. FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications. Am J Hum Genet 2018; 102:920-942. [PMID: 29727691 PMCID: PMC5986983 DOI: 10.1016/j.ajhg.2018.03.026] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 03/21/2018] [Indexed: 10/17/2022] Open
Abstract
We describe a method based on a latent Dirichlet allocation model for predicting functional effects of noncoding genetic variants in a cell-type- and/or tissue-specific way (FUN-LDA). Using this unsupervised approach, we predict tissue-specific functional effects for every position in the human genome in 127 different tissues and cell types. We demonstrate the usefulness of our predictions by using several validation experiments. Using eQTL data from several sources, including the GTEx project, Geuvadis project, and TwinsUK cohort, we show that eQTLs in specific tissues tend to be most enriched among the predicted functional variants in relevant tissues in Roadmap. We further show how these integrated functional scores can be used for (1) deriving the most likely cell or tissue type causally implicated for a complex trait by using summary statistics from genome-wide association studies and (2) estimating a tissue-based correlation matrix of various complex traits. We found large enrichment of heritability in functional components of relevant tissues for various complex traits, and FUN-LDA yielded higher enrichment estimates than existing methods. Finally, using experimentally validated functional variants from the literature and variants possibly implicated in disease by previous studies, we rigorously compare FUN-LDA with state-of-the-art functional annotation methods and show that FUN-LDA has better prediction accuracy and higher resolution than these methods. In particular, our results suggest that tissue- and cell-type-specific functional prediction methods tend to have substantially better prediction accuracy than organism-level prediction methods. Scores for each position in the human genome and for each ENCODE and Roadmap tissue are available online (see Web Resources).
Collapse
Affiliation(s)
- Daniel Backenroth
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Zihuai He
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Krzysztof Kiryluk
- Department of Medicine, Columbia University, New York, NY 10032, USA
| | - Valentina Boeva
- INSERM, U900, 75005 Paris, France; Institut Curie, Mines ParisTech, PSL Research University, 75005 Paris, France
| | - Lynn Pethukova
- Department of Epidemiology, Columbia University, New York, NY 10032, USA; Department of Dermatology, Columbia University, New York, NY 10032, USA
| | - Ekta Khurana
- Department of Physiology and Biophysics, Weill Medical College, Cornell University, New York, NY 10021, USA
| | - Angela Christiano
- Department of Dermatology, Columbia University, New York, NY 10032, USA; Department of Genetics and Development, Columbia University, New York, NY 10032, USA
| | - Joseph D Buxbaum
- Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount SInai, New York, NY 10029, USA; Friedman Brain Institute and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | | |
Collapse
|
27
|
Ying D, Li MJ, Sham PC, Li M. A powerful approach reveals numerous expression quantitative trait haplotypes in multiple tissues. Bioinformatics 2018; 34:3145-3150. [DOI: 10.1093/bioinformatics/bty318] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Accepted: 04/25/2018] [Indexed: 12/21/2022] Open
Affiliation(s)
- Dingge Ying
- Department of Psychiatry, The Centre for Genomic Sciences, State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - Mulin Jun Li
- Department of Psychiatry, The Centre for Genomic Sciences, State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Pak Chung Sham
- Department of Psychiatry, The Centre for Genomic Sciences, State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - Miaoxin Li
- Department of Psychiatry, The Centre for Genomic Sciences, State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
- Zhongshan School of Medicine, Center for Disease Genomics, Sun Yat-Sen University, Guangzhou, China
- Key Laboratory of Tropical Disease Control (SYSU), Ministry of Education, Guangzhou, China
| |
Collapse
|
28
|
Lee PH, Lee C, Li X, Wee B, Dwivedi T, Daly M. Principles and methods of in-silico prioritization of non-coding regulatory variants. Hum Genet 2018; 137:15-30. [PMID: 29288389 PMCID: PMC5892192 DOI: 10.1007/s00439-017-1861-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Accepted: 12/14/2017] [Indexed: 12/13/2022]
Abstract
Over a decade of genome-wide association, studies have made great strides toward the detection of genes and genetic mechanisms underlying complex traits. However, the majority of associated loci reside in non-coding regions that are functionally uncharacterized in general. Now, the availability of large-scale tissue and cell type-specific transcriptome and epigenome data enables us to elucidate how non-coding genetic variants can affect gene expressions and are associated with phenotypic changes. Here, we provide an overview of this emerging field in human genomics, summarizing available data resources and state-of-the-art analytic methods to facilitate in-silico prioritization of non-coding regulatory mutations. We also highlight the limitations of current approaches and discuss the direction of much-needed future research.
Collapse
Affiliation(s)
- Phil H Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA.
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Christian Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Department of Life Sciences, Harvard University, Cambridge, MA, USA
| | - Xihao Li
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Brian Wee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
| | - Tushar Dwivedi
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Mark Daly
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| |
Collapse
|
29
|
Li M, Li J, Li MJ, Pan Z, Hsu JS, Liu DJ, Zhan X, Wang J, Song Y, Sham PC. Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework. Nucleic Acids Res 2017; 45:e75. [PMID: 28115622 PMCID: PMC5435951 DOI: 10.1093/nar/gkx019] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 01/06/2017] [Indexed: 12/21/2022] Open
Abstract
Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.
Collapse
Affiliation(s)
- Miaoxin Li
- Department of Medical Genetics, Center for Genome Research, Center for Precision Medicine, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China.,The Centre for Genomic Sciences, the University of Hong Kong, Pokfulam, Hong Kong.,Department of Psychiatry, the University of Hong Kong, Pokfulam, Hong Kong.,State Key Laboratory for Cognitive and Brain Sciences, the University of Hong Kong, Pokfulam, Hong Kong
| | - Jiang Li
- School of Biomedical Sciences, the University of Hong Kong, Pokfulam, Hong Kong
| | - Mulin Jun Li
- The Centre for Genomic Sciences, the University of Hong Kong, Pokfulam, Hong Kong
| | - Zhicheng Pan
- Department of Psychiatry, the University of Hong Kong, Pokfulam, Hong Kong
| | - Jacob Shujui Hsu
- Department of Psychiatry, the University of Hong Kong, Pokfulam, Hong Kong
| | - Dajiang J Liu
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA 17033, USA.,Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA 17033, USA
| | - Xiaowei Zhan
- Quantitative Biomedical Research Center, Department of Clinical Science, Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Junwen Wang
- The Centre for Genomic Sciences, the University of Hong Kong, Pokfulam, Hong Kong.,Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA.,Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Youqiang Song
- The Centre for Genomic Sciences, the University of Hong Kong, Pokfulam, Hong Kong.,School of Biomedical Sciences, the University of Hong Kong, Pokfulam, Hong Kong
| | - Pak Chung Sham
- The Centre for Genomic Sciences, the University of Hong Kong, Pokfulam, Hong Kong.,Department of Psychiatry, the University of Hong Kong, Pokfulam, Hong Kong.,State Key Laboratory for Cognitive and Brain Sciences, the University of Hong Kong, Pokfulam, Hong Kong
| |
Collapse
|
30
|
Li MJ, Li M, Liu Z, Yan B, Pan Z, Huang D, Liang Q, Ying D, Xu F, Yao H, Wang P, Kocher JPA, Xia Z, Sham PC, Liu JS, Wang J. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes. Genome Biol 2017; 18:52. [PMID: 28302177 PMCID: PMC5356314 DOI: 10.1186/s13059-017-1177-3] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2016] [Accepted: 02/21/2017] [Indexed: 02/06/2023] Open
Abstract
It remains challenging to predict regulatory variants in particular tissues or cell types due to highly context-specific gene regulation. By connecting large-scale epigenomic profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify critical chromatin features that predict variant regulatory potential. We present cepip, a joint likelihood framework, for estimating a variant’s regulatory probability in a context-dependent manner. Our method exhibits significant GWAS signal enrichment and is superior to existing cell type-specific methods. Furthermore, using phenotypically relevant epigenomes to weight the GWAS single-nucleotide polymorphisms, we improve the statistical power of the gene-based association test.
Collapse
Affiliation(s)
- Mulin Jun Li
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China. .,Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China. .,Department of Statistics, Harvard University, Cambridge, Boston, MA, 02138-2901, USA.
| | - Miaoxin Li
- Department of Medical Genetics, Center for Genome Research, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China.,Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,Department of Psychiatry, The University of Hong Kong, Hong Kong SAR, China.,Centre for Reproduction, Development and Growth, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Zipeng Liu
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,Department of Anaesthesiology, The University of Hong Kong, Hong Kong SAR, China
| | - Bin Yan
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,School of Biomedical Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Zhicheng Pan
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, CA, 90095, USA
| | - Dandan Huang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Qian Liang
- Department of Pharmacology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Dingge Ying
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Feng Xu
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,School of Biomedical Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Hongcheng Yao
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,School of Biomedical Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Panwen Wang
- Department of Health Sciences Research & Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ, 85259, USA
| | - Jean-Pierre A Kocher
- Department of Health Sciences Research & Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ, 85259, USA
| | - Zhengyuan Xia
- Department of Anaesthesiology, The University of Hong Kong, Hong Kong SAR, China
| | - Pak Chung Sham
- Centre for Genomic Sciences, The University of Hong Kong, Hong Kong SAR, China.,Department of Psychiatry, The University of Hong Kong, Hong Kong SAR, China
| | - Jun S Liu
- Department of Statistics, Harvard University, Cambridge, Boston, MA, 02138-2901, USA.
| | - Junwen Wang
- Department of Health Sciences Research & Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ, 85259, USA. .,Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, 85259, USA.
| |
Collapse
|