1
|
Karasawa T, Koshikawa S. Evolution of gene regulatory networks in insects. CURRENT OPINION IN INSECT SCIENCE 2025; 69:101365. [PMID: 40348447 DOI: 10.1016/j.cois.2025.101365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 10/20/2024] [Accepted: 03/07/2025] [Indexed: 05/14/2025]
Abstract
Changes in gene regulatory networks (GRNs) underlying the evolution of traits have been intensively studied, with insects providing excellent model cases. In studies using Drosophila, butterflies, and other insects, several well-known cases have shown that changes in the cis-regulatory region of a gene controlling a trait can result in the co-option of the gene for a role different from that in its original developmental context. When the expression of a regulatory gene that controls the expression of multiple downstream genes is altered, the expression of these downstream genes changes accordingly, representing the simplest form of GRN co-option. Many studies have explored the applicability of this model to the acquisition of new traits, yielding substantial insights. However, no study has yet comprehensively elucidated the co-option of a GRN or the evolution of a network architecture, including associated genes and their regulatory relationships. In the near future, the use of single-cell multiomics and machine learning will allow for larger-scale data analysis, leading to a better understanding of the evolution of traits through the evolution of GRNs.
Collapse
Affiliation(s)
- Takumi Karasawa
- Graduate School of Environmental Science, Hokkaido University, N10W5 Kita-ku, Sapporo, Hokkaido 060-0810, Japan
| | - Shigeyuki Koshikawa
- Graduate School of Environmental Science, Hokkaido University, N10W5 Kita-ku, Sapporo, Hokkaido 060-0810, Japan; Faculty of Environmental Earth Science, Hokkaido University, N10W5 Kita-ku, Sapporo, Hokkaido 060-0810, Japan.
| |
Collapse
|
2
|
Phan MHQ, Zehnder T, Puntieri F, Magg A, Majchrzycka B, Antonović M, Wieler H, Lo BW, Baranasic D, Lenhard B, Müller F, Vingron M, Ibrahim DM. Conservation of regulatory elements with highly diverged sequences across large evolutionary distances. Nat Genet 2025:10.1038/s41588-025-02202-5. [PMID: 40425826 DOI: 10.1038/s41588-025-02202-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 04/22/2025] [Indexed: 05/29/2025]
Abstract
Developmental gene expression is a remarkably conserved process, yet most cis-regulatory elements (CREs) lack sequence conservation, especially at larger evolutionary distances. Some evidence suggests that CREs at the same genomic position remain functionally conserved independent of sequence conservation. However, the extent of such positional conservation remains unclear. Here, we profiled the regulatory genome in mouse and chicken embryonic hearts at equivalent developmental stages and found that most CREs lack sequence conservation. To identify positionally conserved CREs, we introduced the synteny-based algorithm interspecies point projection, which identifies up to fivefold more orthologs than alignment-based approaches. We termed positionally conserved orthologs 'indirectly conserved' and showed that they exhibited chromatin signatures and sequence composition similar to sequence-conserved CREs but greater shuffling of transcription factor binding sites between orthologs. Finally, we validated indirectly conserved chicken enhancers using in vivo reporter assays in mouse. By overcoming alignment-based limitations, we revealed widespread functional conservation of sequence-divergent CREs.
Collapse
Affiliation(s)
- Mai H Q Phan
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Tobias Zehnder
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Fiona Puntieri
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Andreas Magg
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Blanka Majchrzycka
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Milan Antonović
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- Institute of Chemistry and Biochemistry, Freie Universität Berlin, Berlin, Germany
| | - Hannah Wieler
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- Institute of Chemistry and Biochemistry, Freie Universität Berlin, Berlin, Germany
| | - Bai-Wei Lo
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Damir Baranasic
- Division of Electronics, Ruder Boskovic Institute, Zagreb, Croatia
- MRC Laboratoy of Medical Sciences, London, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, London, UK
| | - Boris Lenhard
- MRC Laboratoy of Medical Sciences, London, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, London, UK
| | - Ferenc Müller
- Department of Cancer and Genomic Sciences, Birmingham Centre for Genome Biology, School of Medical Sciences, College of Medicine and Health, University of Birmingham, Birmingham, UK
| | - Martin Vingron
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Daniel M Ibrahim
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center for Regenerative Therapies, Berlin, Germany.
- Max Planck Institute for Molecular Genetics, Berlin, Germany.
| |
Collapse
|
3
|
Jin C, Wang X, Yang J, Kim S, Hudgins AD, Gamliel A, Pei M, Contreras D, Devos M, Guo Q, Vijg J, Conti M, Hoeijmakers J, Campisi J, Lobo R, Williams Z, Rosenfeld MG, Suh Y. Molecular and genetic insights into human ovarian aging from single-nuclei multi-omics analyses. NATURE AGING 2025; 5:275-290. [PMID: 39578560 PMCID: PMC11839473 DOI: 10.1038/s43587-024-00762-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 10/25/2024] [Indexed: 11/24/2024]
Abstract
The ovary is the first organ to age in the human body, affecting both fertility and overall health. However, the biological mechanisms underlying human ovarian aging remain poorly understood. Here we present a comprehensive single-nuclei multi-omics atlas of four young (ages 23-29 years) and four reproductively aged (ages 49-54 years) human ovaries. Our analyses reveal coordinated changes in transcriptomes and chromatin accessibilities across cell types in the ovary during aging, notably mTOR signaling being a prominent ovary-specific aging pathway. Cell-type-specific regulatory networks reveal enhanced activity of the transcription factor CEBPD across cell types in the aged ovary. Integration of our multi-omics data with genetic variants associated with age at natural menopause demonstrates a global impact of functional variants on gene regulatory networks across ovarian cell types. We nominate functional non-coding regulatory variants, their target genes and ovarian cell types and regulatory mechanisms. This atlas provides a valuable resource for understanding the cellular, molecular and genetic basis of human ovarian aging.
Collapse
Affiliation(s)
- Chen Jin
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA.
| | - Xizhe Wang
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Jiping Yang
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Seungsoo Kim
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Adam D Hudgins
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Amir Gamliel
- Howard Hughes Medical Institute, Department and School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Mingzhuo Pei
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Daniela Contreras
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Melody Devos
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Qinghua Guo
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Jan Vijg
- Department of Genetics, Albert Einstein College of Medicine, New York, NY, USA
| | - Marco Conti
- Center for Reproductive Sciences, University of California, San Francico, San Francisco, CA, USA
- Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, San Francisco, CA, USA
- Department of Obstetrics and Gynecology and Reproductive Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Jan Hoeijmakers
- Department of Molecular Genetics, Erasmus University Medical Center Rotterdam, Rotterdam, The Netherlands
- Princess Máxima Center for Pediatric Oncology, Oncode Institute, Utrecht, The Netherlands
- Institute for Genome Stability in Ageing and Disease, Cologne Excellence Cluster for Cellular Stress Responses in Aging-Associated Diseases (CECAD), University Hospital of Cologne, Cologne, Germany
| | - Judith Campisi
- Buck Institute for Research on Aging, Novato, CA, USA
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Rogerio Lobo
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Zev Williams
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Michael G Rosenfeld
- Howard Hughes Medical Institute, Department and School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Yousin Suh
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY, USA.
| |
Collapse
|
4
|
Yuan J, Dong K, Wu H, Zeng X, Liu X, Liu Y, Dai J, Yin J, Chen Y, Guo Y, Luo W, Liu N, Sun Y, Zhang S, Su B. Single-nucleus multi-omics analyses reveal cellular and molecular innovations in the anterior cingulate cortex during primate evolution. CELL GENOMICS 2024; 4:100703. [PMID: 39631404 DOI: 10.1016/j.xgen.2024.100703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 08/17/2024] [Accepted: 11/07/2024] [Indexed: 12/07/2024]
Abstract
The anterior cingulate cortex (ACC) of the human brain is involved in higher-level cognitive functions such as emotion and self-awareness. We generated profiles of human and macaque ACC gene expression and chromatin accessibility at single-nucleus resolution. We characterized the conserved patterns of gene expression, chromatin accessibility, and transcription factor binding in different cell types. Combining the published mouse data, we discovered the molecular identities and cell-lineage origin of the primate von Economo neurons (VENs). Our in vitro and in vivo experiments identified a group of primate-shared and human-specific VEN marker genes, such as PCSK6, ADAMTSL3, and CDHR3, potentially contributing to VEN morphogenesis. We demonstrated that the human-specific sequence changes account for the cellular and functional innovations in the ACC during primate evolution and human origin. These findings provide new insights into understanding the cellular composition and molecular regulation of ACC and its evolutionary role in shaping human-owned higher cognitive skills.
Collapse
Affiliation(s)
- Jiamiao Yuan
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China
| | - Kangning Dong
- School of Mathematics, Renmin University of China, Beijing 100872, China; NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haixu Wu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing 100101, P.R. China
| | - Xuerui Zeng
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing 100101, P.R. China
| | - Xingyan Liu
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yan Liu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing 100101, P.R. China
| | - Jiapei Dai
- Wuhan Institute for Neuroscience and Neuroengineering, South-Central Minzu University, Wuhan 430074, China; Chinese Brain Bank Center, South-Central Minzu University, Wuhan 430074, China
| | - Jichao Yin
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing 100101, P.R. China
| | - Yongjie Chen
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing 100101, P.R. China
| | - Yongbo Guo
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Wenhao Luo
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Na Liu
- Wuhan Institute for Neuroscience and Neuroengineering, South-Central Minzu University, Wuhan 430074, China; Chinese Brain Bank Center, South-Central Minzu University, Wuhan 430074, China
| | - Yan Sun
- Wuhan Institute for Neuroscience and Neuroengineering, South-Central Minzu University, Wuhan 430074, China; Chinese Brain Bank Center, South-Central Minzu University, Wuhan 430074, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China; Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China.
| | - Bing Su
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, P.R. China; Yunnan Key Laboratory of Integrative Anthropology, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650107, China; National Key Laboratory of Genetic Evolution and Animal Model, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; National Resource Center for Non-Human Primates, Kunming Primate Research Center, and National Research Facility for Phenotypic & Genetic Analysis of Model Animals (Primate Facility), Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650107, China; Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China.
| |
Collapse
|
5
|
Vorontsov IE, Kozin I, Abramov S, Boytsov A, Jolma A, Albu M, Ambrosini G, Faltejskova K, Gralak AJ, Gryzunov N, Inukai S, Kolmykov S, Kravchenko P, Kribelbauer-Swietek JF, Laverty KU, Nozdrin V, Patel ZM, Penzar D, Plescher ML, Pour SE, Razavi R, Yang AWH, Yevshin I, Zinkevich A, Weirauch MT, Bucher P, Deplancke B, Fornes O, Grau J, Grosse I, Kolpakov FA, Makeev VJ, Hughes TR, Kulakovskiy IV. Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.619379. [PMID: 39605530 PMCID: PMC11601219 DOI: 10.1101/2024.11.11.619379] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications. We applied ten different DNA motif discovery tools to generate PWMs from the "Codebook" data comprised of 4,237 experiments from five different platforms profiling the DNA-binding specificity of 394 human proteins, focusing on understudied transcription factors of different structural families. For many of the proteins, there was no prior knowledge of a genuine motif. By benchmarking-supported human curation, we constructed an approved subset of experiments comprising about 30% of all experiments and 50% of tested TFs which displayed consistent motifs across platforms and replicates. We present the Codebook Motif Explorer (https://mex.autosome.org), a detailed online catalog of DNA motifs, including the top-ranked PWMs, and the underlying source and benchmarking data. We demonstrate that in the case of high-quality experimental data, most of the popular motif discovery tools detect valid motifs and generate PWMs, which perform well both on genomic and synthetic data. Yet, for each of the algorithms, there were problematic combinations of proteins and platforms, and the basic motif properties such as nucleotide composition and information content offered little help in detecting such pitfalls. By combining multiple PMWs in decision trees, we demonstrate how our setup can be readily adapted to train and test binding specificity models more complex than PWMs. Overall, our study provides a rich motif catalog as a solid baseline for advanced models and highlights the power of the multi-platform multi-tool approach for reliable mapping of DNA binding specificities.
Collapse
Affiliation(s)
- Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
| | - Ivan Kozin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA
| | - Arttu Jolma
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Mihai Albu
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | | | - Katerina Faltejskova
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 160 00 Praha 6, Czech Republic
- Computer Science Institute, Faculty of Mathematics and Physics, Charles University, 118 00 Praha 1, Czech Republic
| | - Antoni J Gralak
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Nikita Gryzunov
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Sachi Inukai
- Chugai Pharmaceutical Co., Ltd, Tokyo, 103-8324, Japan
| | - Semyon Kolmykov
- Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia
| | | | - Judith F Kribelbauer-Swietek
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Kaitlin U Laverty
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Vladimir Nozdrin
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Zain M Patel
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Marie-Luise Plescher
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Sara E Pour
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Rozita Razavi
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Ally W H Yang
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Bart Deplancke
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Fedor A Kolpakov
- Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia
- Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090, Novosibirsk, Russia
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Moscow Center for Advanced Studies, 123592, Moscow, Russia
| | - Timothy R Hughes
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| |
Collapse
|
6
|
Wang S, Wang W. Interpretable prediction of mRNA abundance from promoter sequence using contextual regression models. NAR Genom Bioinform 2024; 6:lqae055. [PMID: 38807713 PMCID: PMC11131020 DOI: 10.1093/nargab/lqae055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 04/08/2024] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
While machine learning models have been successfully applied to predicting gene expression from promoter sequences, it remains a great challenge to derive intuitive interpretation of the model and reveal DNA motif grammar such as motif cooperation and distance constraint between motif sites. Previous interpretation approaches are often time-consuming or have difficulty to learn the combinatory rules. In this work, we designed interpretable neural network models to predict the mRNA expression levels from DNA sequences. By applying the Contextual Regression framework we developed, we extracted weighted features to cluster samples into different groups, which have different gene expression levels. We performed motif analysis in each cluster and found motifs with active or repressive regulation on gene expression. By comparing the co-occurrence locations of discovered motifs, we also uncovered multiple grammars of motif combination including communities of cooperative motifs and distance constraints between motif pairs. These results revealed new insights of the regulatory architecture of promoter sequences.
Collapse
Affiliation(s)
- Song Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
| | - Wei Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359, USA
| |
Collapse
|
7
|
Lai Y, Ramírez-Pardo I, Isern J, An J, Perdiguero E, Serrano AL, Li J, García-Domínguez E, Segalés J, Guo P, Lukesova V, Andrés E, Zuo J, Yuan Y, Liu C, Viña J, Doménech-Fernández J, Gómez-Cabrera MC, Song Y, Liu L, Xu X, Muñoz-Cánoves P, Esteban MA. Multimodal cell atlas of the ageing human skeletal muscle. Nature 2024; 629:154-164. [PMID: 38649488 PMCID: PMC11062927 DOI: 10.1038/s41586-024-07348-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 03/25/2024] [Indexed: 04/25/2024]
Abstract
Muscle atrophy and functional decline (sarcopenia) are common manifestations of frailty and are critical contributors to morbidity and mortality in older people1. Deciphering the molecular mechanisms underlying sarcopenia has major implications for understanding human ageing2. Yet, progress has been slow, partly due to the difficulties of characterizing skeletal muscle niche heterogeneity (whereby myofibres are the most abundant) and obtaining well-characterized human samples3,4. Here we generate a single-cell/single-nucleus transcriptomic and chromatin accessibility map of human limb skeletal muscles encompassing over 387,000 cells/nuclei from individuals aged 15 to 99 years with distinct fitness and frailty levels. We describe how cell populations change during ageing, including the emergence of new populations in older people, and the cell-specific and multicellular network features (at the transcriptomic and epigenetic levels) associated with these changes. On the basis of cross-comparison with genetic data, we also identify key elements of chromatin architecture that mark susceptibility to sarcopenia. Our study provides a basis for identifying targets in the skeletal muscle that are amenable to medical, pharmacological and lifestyle interventions in late life.
Collapse
Affiliation(s)
- Yiwei Lai
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
| | - Ignacio Ramírez-Pardo
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Altos Labs, San Diego Institute of Science, San Diego, CA, USA
| | - Joan Isern
- Altos Labs, San Diego Institute of Science, San Diego, CA, USA
| | - Juan An
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
- Laboratory of Integrative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, China
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Eusebio Perdiguero
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Altos Labs, San Diego Institute of Science, San Diego, CA, USA
| | - Antonio L Serrano
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Altos Labs, San Diego Institute of Science, San Diego, CA, USA
| | - Jinxiu Li
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Esther García-Domínguez
- Freshage Research Group, Department of Physiology, Faculty of Medicine, University of Valencia and CIBERFES, Fundación Investigación Hospital Clínico Universitario/INCLIVA, Valencia, Spain
| | - Jessica Segalés
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Pengcheng Guo
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
- State Key Laboratory for Diagnosis and Treatment of Severe Zoonotic Infectious Diseases, Key Laboratory for Zoonosis Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Jilin, China
| | - Vera Lukesova
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Eva Andrés
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jing Zuo
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
| | - Yue Yuan
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
| | - Chuanyu Liu
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
| | - José Viña
- Freshage Research Group, Department of Physiology, Faculty of Medicine, University of Valencia and CIBERFES, Fundación Investigación Hospital Clínico Universitario/INCLIVA, Valencia, Spain
| | - Julio Doménech-Fernández
- Servicio de Cirugía Ortopédica y Traumatología, Hospital Arnau de Vilanova y Hospital de Liria and Health Care Department Arnau-Lliria, Valencia, Spain
- Department of Orthopedic Surgery, Clinica Universidad de Navarra, Pamplona, Spain
| | - Mari Carmen Gómez-Cabrera
- Freshage Research Group, Department of Physiology, Faculty of Medicine, University of Valencia and CIBERFES, Fundación Investigación Hospital Clínico Universitario/INCLIVA, Valencia, Spain
| | - Yancheng Song
- Department of Orthopedics, The First Affiliated Hospital of Guangdong Pharmaceutical University, Guangzhou, China
| | - Longqi Liu
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Xun Xu
- BGI Research, Hangzhou, China
- BGI Research, Shenzhen, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Pura Muñoz-Cánoves
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Altos Labs, San Diego Institute of Science, San Diego, CA, USA.
- ICREA, Barcelona, Spain.
| | - Miguel A Esteban
- BGI Research, Hangzhou, China.
- BGI Research, Shenzhen, China.
- Laboratory of Integrative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, China.
- State Key Laboratory for Diagnosis and Treatment of Severe Zoonotic Infectious Diseases, Key Laboratory for Zoonosis Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Jilin, China.
- The Fifth Affiliated Hospital of Guangzhou Medical University-BGI Research Center for Integrative Biology, The Fifth Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| |
Collapse
|
8
|
Hu W, Li M, Xiao H, Guan L. Essential genes identification model based on sequence feature map and graph convolutional neural network. BMC Genomics 2024; 25:47. [PMID: 38200437 PMCID: PMC10777564 DOI: 10.1186/s12864-024-09958-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 01/01/2024] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Essential genes encode functions that play a vital role in the life activities of organisms, encompassing growth, development, immune system functioning, and cell structure maintenance. Conventional experimental techniques for identifying essential genes are resource-intensive and time-consuming, and the accuracy of current machine learning models needs further enhancement. Therefore, it is crucial to develop a robust computational model to accurately predict essential genes. RESULTS In this study, we introduce GCNN-SFM, a computational model for identifying essential genes in organisms, based on graph convolutional neural networks (GCNN). GCNN-SFM integrates a graph convolutional layer, a convolutional layer, and a fully connected layer to model and extract features from gene sequences of essential genes. Initially, the gene sequence is transformed into a feature map using coding techniques. Subsequently, a multi-layer GCN is employed to perform graph convolution operations, effectively capturing both local and global features of the gene sequence. Further feature extraction is performed, followed by integrating convolution and fully-connected layers to generate prediction results for essential genes. The gradient descent algorithm is utilized to iteratively update the cross-entropy loss function, thereby enhancing the accuracy of the prediction results. Meanwhile, model parameters are tuned to determine the optimal parameter combination that yields the best prediction performance during training. CONCLUSIONS Experimental evaluation demonstrates that GCNN-SFM surpasses various advanced essential gene prediction models and achieves an average accuracy of 94.53%. This study presents a novel and effective approach for identifying essential genes, which has significant implications for biology and genomics research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Haiyang Xiao
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| |
Collapse
|
9
|
Jiang D, Zhang J. Ascertainment Bias in the Genomic Test of Positive Selection on Regulatory Sequences. Mol Biol Evol 2024; 41:msad284. [PMID: 38149460 PMCID: PMC10766478 DOI: 10.1093/molbev/msad284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 11/12/2023] [Accepted: 12/22/2023] [Indexed: 12/28/2023] Open
Abstract
Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF-binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of 3 TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here, we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.
Collapse
Affiliation(s)
- Daohan Jiang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Present address: Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
10
|
Zhao Z, D’Oliveira Albanus R, Taylor H, Tang X, Han Y, Orchard P, Varshney A, Zhang T, Manickam N, Erdos M, Narisu N, Taylor L, Saavedra X, Zhong A, Li B, Zhou T, Naji A, Liu C, Collins F, Parker SCJ, Chen S. An integrative single-cell multi-omics profiling of human pancreatic islets identifies T1D associated genes and regulatory signals. RESEARCH SQUARE 2023:rs.3.rs-3343318. [PMID: 37886586 PMCID: PMC10602166 DOI: 10.21203/rs.3.rs-3343318/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Genome wide association studies (GWAS) have identified over 100 signals associated with type 1 diabetes (T1D). However, translating any given T1D GWAS signal into mechanistic insights, including putative causal variants and the context (cell type and cell state) in which they function, has been limited. Here, we present a comprehensive multi-omic integrative analysis of single-cell/nucleus resolution profiles of gene expression and chromatin accessibility in healthy and autoantibody+ (AAB+) human islets, as well as islets under multiple T1D stimulatory conditions. We broadly nominate effector cell types for all T1D GWAS signals. We further nominated higher-resolution contexts, including effector cell types, regulatory elements, and genes for three independent T1D risk variants acting through islet cells within the pancreas at the DLK1/MEG3, RASGRP1, and TOX loci. Subsequently, we created isogenic gene knockouts DLK1-/-, RASGRP1-/-, and TOX-/-, and the corresponding regulatory region knockout, RASGRP1Δ, and DLK1Δ hESCs. Loss of RASGRP1 or DLK1, as well as knockout of the regulatory region of RASGRP1 or DLK1, increased β cell apoptosis. Additionally, pancreatic β cells derived from isogenic hESCs carrying the risk allele of rs3783355A/A exhibited increased β cell death. Finally, RNA-seq and ATAC-seq identified five genes upregulated in both RASGRP1-/- and DLK1-/- β-like cells, four of which are associated with T1D. Together, this work reports an integrative approach for combining single cell multi-omics, GWAS, and isogenic hESC-derived β-like cells to prioritize the T1D associated signals and their underlying context-specific cell types, genes, SNPs, and regulatory elements, to illuminate biological functions and molecular mechanisms.
Collapse
Affiliation(s)
- Zeping Zhao
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
- Center for Genomic Health, Weill Cornell Medicine, 1300 York Ave, New York, NY 15 10065, USA
| | | | - Henry Taylor
- Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Xuming Tang
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
- Center for Genomic Health, Weill Cornell Medicine, 1300 York Ave, New York, NY 15 10065, USA
| | - Yuling Han
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
- Center for Genomic Health, Weill Cornell Medicine, 1300 York Ave, New York, NY 15 10065, USA
| | - Peter Orchard
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Arushi Varshney
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Tuo Zhang
- Stem Cell Research Facility, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065, USA
| | - Nandini Manickam
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Mike Erdos
- Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Narisu Narisu
- Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Leland Taylor
- Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Xiaxia Saavedra
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
| | - Aaron Zhong
- Genomic Resource Core Facility, Weill Cornell Medical College, NY 10065, USA
| | - Bo Li
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
| | - Ting Zhou
- Genomic Resource Core Facility, Weill Cornell Medical College, NY 10065, USA
| | - Ali Naji
- Department of Surgery, University of Pennsylvania School of Medicine, Philadelphia, PA19104, USA
| | - Chengyang Liu
- Department of Surgery, University of Pennsylvania School of Medicine, Philadelphia, PA19104, USA
| | - Francis Collins
- Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Stephen CJ Parker
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Shuibing Chen
- Department of Surgery, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
- Center for Genomic Health, Weill Cornell Medicine, 1300 York Ave, New York, NY 15 10065, USA
| |
Collapse
|
11
|
Gaulton KJ, Preissl S, Ren B. Interpreting non-coding disease-associated human variants using single-cell epigenomics. Nat Rev Genet 2023; 24:516-534. [PMID: 37161089 PMCID: PMC10629587 DOI: 10.1038/s41576-023-00598-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/27/2023] [Indexed: 05/11/2023]
Abstract
Genome-wide association studies (GWAS) have linked hundreds of thousands of sequence variants in the human genome to common traits and diseases. However, translating this knowledge into a mechanistic understanding of disease-relevant biology remains challenging, largely because such variants are predominantly in non-protein-coding sequences that still lack functional annotation at cell-type resolution. Recent advances in single-cell epigenomics assays have enabled the generation of cell type-, subtype- and state-resolved maps of the epigenome in heterogeneous human tissues. These maps have facilitated cell type-specific annotation of candidate cis-regulatory elements and their gene targets in the human genome, enhancing our ability to interpret the genetic basis of common traits and diseases.
Collapse
Affiliation(s)
- Kyle J Gaulton
- Department of Paediatrics, Paediatric Diabetes Research Center, University of California San Diego School of Medicine, La Jolla, CA, USA.
| | - Sebastian Preissl
- Center for Epigenomics, University of California San Diego School of Medicine, La Jolla, CA, USA.
- Institute of Experimental and Clinical Pharmacology and Toxicology, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
| | - Bing Ren
- Center for Epigenomics, University of California San Diego School of Medicine, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California San Diego School of Medicine, La Jolla, CA, USA.
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
| |
Collapse
|
12
|
Ober-Reynolds B, Wang C, Ko JM, Rios EJ, Aasi SZ, Davis MM, Oro AE, Greenleaf WJ. Integrated single-cell chromatin and transcriptomic analyses of human scalp identify gene-regulatory programs and critical cell types for hair and skin diseases. Nat Genet 2023; 55:1288-1300. [PMID: 37500727 PMCID: PMC11190942 DOI: 10.1038/s41588-023-01445-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 06/17/2023] [Indexed: 07/29/2023]
Abstract
Genome-wide association studies have identified many loci associated with hair and skin disease, but identification of causal variants requires deciphering of gene-regulatory networks in relevant cell types. We generated matched single-cell chromatin profiles and transcriptomes from scalp tissue from healthy controls and patients with alopecia areata, identifying diverse cell types of the hair follicle niche. By interrogating these datasets at multiple levels of cellular resolution, we infer 50-100% more enhancer-gene links than previous approaches and show that aggregate enhancer accessibility for highly regulated genes predicts expression. We use these gene-regulatory maps to prioritize cell types, genes and causal variants implicated in the pathobiology of androgenetic alopecia (AGA), eczema and other complex traits. AGA genome-wide association studies signals are enriched in dermal papilla regulatory regions, supporting the role of these cells as drivers of AGA pathogenesis. Finally, we train machine learning models to nominate single-nucleotide polymorphisms that affect gene expression through disruption of transcription factor binding, predicting candidate functional single-nucleotide polymorphism for AGA and eczema.
Collapse
Affiliation(s)
| | - Chen Wang
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Division of Dermatology, Department of Medicine, Santa Clara Valley Medical Center, San Jose, CA, USA
- Institute of Immunity, Transplantation and Infection, School of Medicine, Stanford University, Stanford, CA, USA
| | - Justin M Ko
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Eon J Rios
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Division of Dermatology, Department of Medicine, Santa Clara Valley Medical Center, San Jose, CA, USA
| | - Sumaira Z Aasi
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Mark M Davis
- Institute of Immunity, Transplantation and Infection, School of Medicine, Stanford University, Stanford, CA, USA
- Department of Microbiology and Immunology, School of Medicine, Stanford University, Stanford, CA, USA
- Howard Hughes Medical Institute, School of Medicine, Stanford University, Stanford, CA, USA
| | - Anthony E Oro
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - William J Greenleaf
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
- Department of Applied Physics, Stanford University, Stanford, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
13
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
14
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
15
|
Kshirsagar M, Yuan H, Ferres JL, Leslie C. BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin. Genome Biol 2022; 23:174. [PMID: 35971180 PMCID: PMC9380350 DOI: 10.1186/s13059-022-02723-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 06/28/2022] [Indexed: 11/10/2022] Open
Abstract
We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.
Collapse
Affiliation(s)
| | - Han Yuan
- Calico Life Sciences, South San Francisco, CA, USA
| | | | | |
Collapse
|
16
|
Turner AW, Hu SS, Mosquera JV, Ma WF, Hodonsky CJ, Wong D, Auguste G, Song Y, Sol-Church K, Farber E, Kundu S, Kundaje A, Lopez NG, Ma L, Ghosh SKB, Onengut-Gumuscu S, Ashley EA, Quertermous T, Finn AV, Leeper NJ, Kovacic JC, Björkegren JLM, Zang C, Miller CL. Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nat Genet 2022; 54:804-816. [PMID: 35590109 PMCID: PMC9203933 DOI: 10.1038/s41588-022-01069-0] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 03/31/2022] [Indexed: 12/24/2022]
Abstract
Coronary artery disease (CAD) is a complex inflammatory disease involving genetic influences across cell types. Genome-wide association studies have identified over 200 loci associated with CAD, where the majority of risk variants reside in noncoding DNA sequences impacting cis-regulatory elements. Here, we applied single-nucleus assay for transposase-accessible chromatin with sequencing to profile 28,316 nuclei across coronary artery segments from 41 patients with varying stages of CAD, which revealed 14 distinct cellular clusters. We mapped ~320,000 accessible sites across all cells, identified cell-type-specific elements and transcription factors, and prioritized functional CAD risk variants. We identified elements in smooth muscle cell transition states (for example, fibromyocytes) and functional variants predicted to alter smooth muscle cell- and macrophage-specific regulation of MRAS (3q22) and LIPA (10q23), respectively. We further nominated key driver transcription factors such as PRDM16 and TBX2. Together, this single-nucleus atlas provides a critical step towards interpreting regulatory mechanisms across the continuum of CAD risk.
Collapse
Affiliation(s)
- Adam W Turner
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Shengen Shawn Hu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Jose Verdezoto Mosquera
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA
| | - Wei Feng Ma
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Medical Scientist Training Program, University of Virginia, Charlottesville, VA, USA
- Department of Pathology, University of Virginia, Charlottesville, VA, USA
| | - Chani J Hodonsky
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Robert M. Berne Cardiovascular Research Center, University of Virginia, Charlottesville, VA, USA
| | - Doris Wong
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA
- Robert M. Berne Cardiovascular Research Center, University of Virginia, Charlottesville, VA, USA
| | - Gaëlle Auguste
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Yipei Song
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Katia Sol-Church
- Department of Pathology, University of Virginia, Charlottesville, VA, USA
- Genome Analysis & Technology Core, University of Virginia, Charlottesville, VA, USA
| | - Emily Farber
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Genome Sciences Laboratory, University of Virginia, Charlottesville, VA, USA
| | - Soumya Kundu
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Nicolas G Lopez
- Division of Vascular Surgery, Department of Surgery, Stanford University, Stanford, CA, USA
| | - Lijiang Ma
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Suna Onengut-Gumuscu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Genome Sciences Laboratory, University of Virginia, Charlottesville, VA, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA
| | - Euan A Ashley
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University, Stanford, CA, USA
| | - Thomas Quertermous
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University, Stanford, CA, USA
| | | | - Nicholas J Leeper
- Division of Vascular Surgery, Department of Surgery, Stanford University, Stanford, CA, USA
| | - Jason C Kovacic
- Cardiovascular Research Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Victor Chang Cardiac Research Institute, Darlinghurst, New South Wales, Australia
- St. Vincent's Clinical School, University of New South Wales, Sydney, New South Wales, Australia
| | - Johan L M Björkegren
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Huddinge, Sweden
| | - Chongzhi Zang
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
| | - Clint L Miller
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
- Robert M. Berne Cardiovascular Research Center, University of Virginia, Charlottesville, VA, USA.
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
| |
Collapse
|
17
|
Lawler AJ, Ramamurthy E, Brown AR, Shin N, Kim Y, Toong N, Kaplow IM, Wirthlin M, Zhang X, Phan BN, Fox GA, Wade K, He J, Ozturk BE, Byrne LC, Stauffer WR, Fish KN, Pfenning AR. Machine learning sequence prioritization for cell type-specific enhancer design. eLife 2022; 11:e69571. [PMID: 35576146 PMCID: PMC9110026 DOI: 10.7554/elife.69571] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 04/25/2022] [Indexed: 11/22/2022] Open
Abstract
Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.
Collapse
Affiliation(s)
- Alyssa J Lawler
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Biological Sciences Department, Mellon College of Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Easwaran Ramamurthy
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Ashley R Brown
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Naomi Shin
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Yeonju Kim
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Noelle Toong
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Irene M Kaplow
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Morgan Wirthlin
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Xiaoyu Zhang
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - BaDoi N Phan
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
- Medical Scientist Training Program, University of PittsburghPittsburghUnited States
| | - Grant A Fox
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Kirsten Wade
- Department of Psychiatry, Translational Neuroscience Program, University of PittsburghPittsburghUnited States
| | - Jing He
- Department of Neurobiology, University of PittsburghPittsburghUnited States
- Systems Neuroscience Center, Brain Institute, Center for Neuroscience, Center for the Neural Basis of CognitionPittsburghUnited States
| | - Bilge Esin Ozturk
- Department of Ophthalmology, University of PittsburghPittsburghUnited States
| | - Leah C Byrne
- Department of Neurobiology, University of PittsburghPittsburghUnited States
- Department of Ophthalmology, University of PittsburghPittsburghUnited States
- Division of Experimental Retinal Therapies, Department of Clinical Sciences & Advanced Medicine, School of Veterinary Medicine, University of PennsylvaniaPhiladelphiaUnited States
- Department of Bioengineering, University of PittsburghPittsburghUnited States
| | - William R Stauffer
- Department of Neurobiology, University of PittsburghPittsburghUnited States
| | - Kenneth N Fish
- Department of Psychiatry, Translational Neuroscience Program, University of PittsburghPittsburghUnited States
| | - Andreas R Pfenning
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| |
Collapse
|
18
|
Peng L, Tan J, Tian X, Zhou L. EnANNDeep: An Ensemble-based lncRNA-protein Interaction Prediction Framework with Adaptive k-Nearest Neighbor Classifier and Deep Models. Interdiscip Sci 2022; 14:209-232. [PMID: 35006529 DOI: 10.1007/s12539-021-00483-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 09/14/2021] [Accepted: 09/15/2021] [Indexed: 01/08/2023]
Abstract
lncRNA-protein interactions (LPIs) prediction can deepen the understanding of many important biological processes. Artificial intelligence methods have reported many possible LPIs. However, most computational techniques were evaluated mainly on one dataset, which may produce prediction bias. More importantly, they were validated only under cross validation on lncRNA-protein pairs, and did not consider the performance under cross validations on lncRNAs and proteins, thus fail to search related proteins/lncRNAs for a new lncRNA/protein. Under an ensemble learning framework (EnANNDeep) composed of adaptive k-nearest neighbor classifier and Deep models, this study focuses on systematically finding underlying linkages between lncRNAs and proteins. First, five LPI-related datasets are arranged. Second, multiple source features are integrated to depict an lncRNA-protein pair. Third, adaptive k-nearest neighbor classifier, deep neural network, and deep forest are designed to score unknown lncRNA-protein pairs, respectively. Finally, interaction probabilities from the three predictors are integrated based on a soft voting technique. In comparing to five classical LPI identification models (SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF) under fivefold cross validations on lncRNAs, proteins, and LPIs, EnANNDeep computes the best average AUCs of 0.8660, 0.8775, and 0.9166, respectively, and the best average AUPRs of 0.8545, 0.8595, and 0.9054, respectively, indicating its superior LPI prediction ability. Case study analyses indicate that SNHG10 may have dense linkage with Q15717. In the ensemble framework, adaptive k-nearest neighbor classifier can separately pick the most appropriate k for each query lncRNA-protein pair. More importantly, deep models including deep neural network and deep forest can effectively learn the representative features of lncRNAs and proteins.
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China. .,College of Life Sciences and Chemistry, Hunan University of Technology, Zhuzhou, China.
| | - Jingwei Tan
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Xiongfei Tian
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Liqian Zhou
- School of Computer Science, Hunan University of Technology, Zhuzhou, China.
| |
Collapse
|
19
|
Orchard P, Manickam N, Ventresca C, Vadlamudi S, Varshney A, Rai V, Kaplan J, Lalancette C, Mohlke KL, Gallagher K, Burant CF, Parker SCJ. Human and rat skeletal muscle single-nuclei multi-omic integrative analyses nominate causal cell types, regulatory elements, and SNPs for complex traits. Genome Res 2021; 31:2258-2275. [PMID: 34815310 PMCID: PMC8647829 DOI: 10.1101/gr.268482.120] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 09/16/2021] [Indexed: 12/12/2022]
Abstract
Skeletal muscle accounts for the largest proportion of human body mass, on average, and is a key tissue in complex diseases and mobility. It is composed of several different cell and muscle fiber types. Here, we optimize single-nucleus ATAC-seq (snATAC-seq) to map skeletal muscle cell-specific chromatin accessibility landscapes in frozen human and rat samples, and single-nucleus RNA-seq (snRNA-seq) to map cell-specific transcriptomes in human. We additionally perform multi-omics profiling (gene expression and chromatin accessibility) on human and rat muscle samples. We capture type I and type II muscle fiber signatures, which are generally missed by existing single-cell RNA-seq methods. We perform cross-modality and cross-species integrative analyses on 33,862 nuclei and identify seven cell types ranging in abundance from 59.6% to 1.0% of all nuclei. We introduce a regression-based approach to infer cell types by comparing transcription start site-distal ATAC-seq peaks to reference enhancer maps and show consistency with RNA-based marker gene cell type assignments. We find heterogeneity in enrichment of genetic variants linked to complex phenotypes from the UK Biobank and diabetes genome-wide association studies in cell-specific ATAC-seq peaks, with the most striking enrichment patterns in muscle mesenchymal stem cells (∼3.5% of nuclei). Finally, we overlay these chromatin accessibility maps on GWAS data to nominate causal cell types, SNPs, transcription factor motifs, and target genes for type 2 diabetes signals. These chromatin accessibility profiles for human and rat skeletal muscle cell types are a useful resource for nominating causal GWAS SNPs and cell types.
Collapse
Affiliation(s)
- Peter Orchard
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Nandini Manickam
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Christa Ventresca
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Swarooparani Vadlamudi
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| | - Arushi Varshney
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Vivek Rai
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Jeremy Kaplan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Claudia Lalancette
- Epigenomics Core, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Karen L Mohlke
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| | - Katherine Gallagher
- Department of Surgery, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Charles F Burant
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Stephen C J Parker
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
20
|
Sun Y, Li H, Zheng L, Li J, Hong Y, Liang P, Kwok LY, Zuo Y, Zhang W, Zhang H. iProbiotics: a machine learning platform for rapid identification of probiotic properties from whole-genome primary sequences. Brief Bioinform 2021; 23:6444315. [PMID: 34849572 DOI: 10.1093/bib/bbab477] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/28/2021] [Accepted: 10/15/2021] [Indexed: 12/13/2022] Open
Abstract
Lactic acid bacteria consortia are commonly present in food, and some of these bacteria possess probiotic properties. However, discovery and experimental validation of probiotics require extensive time and effort. Therefore, it is of great interest to develop effective screening methods for identifying probiotics. Advances in sequencing technology have generated massive genomic data, enabling us to create a machine learning-based platform for such purpose in this work. This study first selected a comprehensive probiotics genome dataset from the probiotic database (PROBIO) and literature surveys. Then, k-mer (from 2 to 8) compositional analysis was performed, revealing diverse oligonucleotide composition in strain genomes and apparently more probiotic (P-) features in probiotic genomes than non-probiotic genomes. To reduce noise and improve computational efficiency, 87 376 k-mers were refined by an incremental feature selection (IFS) method, and the model achieved the maximum accuracy level at 184 core features, with a high prediction accuracy (97.77%) and area under the curve (98.00%). Functional genomic analysis using annotations from gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Rapid Annotation using Subsystem Technology (RAST) databases, as well as analysis of genes associated with host gastrointestinal survival/settlement, carbohydrate utilization, drug resistance and virulence factors, revealed that the distribution of P-features was biased toward genes/pathways related to probiotic function. Our results suggest that the role of probiotics is not determined by a single gene, but by a combination of k-mer genomic components, providing new insights into the identification and underlying mechanisms of probiotics. This work created a novel and free online bioinformatic tool, iProbiotics, which would facilitate rapid screening for probiotics.
Collapse
Affiliation(s)
- Yu Sun
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Haicheng Li
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Jinzhao Li
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yan Hong
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lai-Yu Kwok
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of life sciences, Inner Mongolia University, Hohhot 010070, China
| | - Wenyi Zhang
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Heping Zhang
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot 010018, China
| |
Collapse
|
21
|
Benner P, Vingron M. Quantifying the tissue-specific regulatory information within enhancer DNA sequences. NAR Genom Bioinform 2021; 3:lqab095. [PMID: 34729474 PMCID: PMC8557370 DOI: 10.1093/nargab/lqab095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 09/23/2021] [Accepted: 09/28/2021] [Indexed: 12/04/2022] Open
Abstract
Recent efforts to measure epigenetic marks across a wide variety of different cell types and tissues provide insights into the cell type-specific regulatory landscape. We use these data to study whether there exists a correlate of epigenetic signals in the DNA sequence of enhancers and explore with computational methods to what degree such sequence patterns can be used to predict cell type-specific regulatory activity. By constructing classifiers that predict in which tissues enhancers are active, we are able to identify sequence features that might be recognized by the cell in order to regulate gene expression. While classification performances vary greatly between tissues, we show examples where our classifiers correctly predict tissue-specific regulation from sequence alone. We also show that many of the informative patterns indeed harbor transcription factor footprints.
Collapse
Affiliation(s)
- Philipp Benner
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| |
Collapse
|
22
|
Patel ZM, Hughes TR. Global properties of regulatory sequences are predicted by transcription factor recognition mechanisms. Genome Biol 2021; 22:285. [PMID: 34620190 PMCID: PMC8496038 DOI: 10.1186/s13059-021-02503-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 09/16/2021] [Indexed: 01/07/2023] Open
Abstract
Background Mammalian genomes contain millions of putative regulatory sequences, which are delineated by binding of multiple transcription factors. The degree to which spacing and orientation constraints among transcription factor binding sites contribute to the recognition and identity of regulatory sequence is an unresolved but important question that impacts our understanding of genome function and evolution. Global mechanisms that underlie phenomena including the size of regulatory sequences, their uniqueness, and their evolutionary turnover remain poorly described. Results Here, we ask whether models incorporating different degrees of spacing and orientation constraints among transcription factor binding sites are broadly consistent with several global properties of regulatory sequence. These properties include length, sequence diversity, turnover rate, and dominance of specific TFs in regulatory site identity and cell type specification. Models with and without spacing and orientation constraints are generally consistent with all observed properties of regulatory sequence, and with regulatory sequences being fundamentally small (~ 1 nucleosome). Uniqueness of regulatory regions and their rapid evolutionary turnover are expected under all models examined. An intriguing issue we identify is that the complexity of eukaryotic regulatory sites must scale with the number of active transcription factors, in order to accomplish observed specificity. Conclusions Models of transcription factor binding with or without spacing and orientation constraints predict that regulatory sequences should be fundamentally short, unique, and turn over rapidly. We posit that the existence of master regulators may be, in part, a consequence of evolutionary pressure to limit the complexity and increase evolvability of regulatory sites. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02503-y.
Collapse
Affiliation(s)
- Zain M Patel
- Donnelly Centre for Cellular and Biomolecular Research and Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 3E1, Canada
| | - Timothy R Hughes
- Donnelly Centre for Cellular and Biomolecular Research and Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 3E1, Canada.
| |
Collapse
|
23
|
Karollus A, Avsec Ž, Gagneur J. Predicting mean ribosome load for 5'UTR of any length using deep learning. PLoS Comput Biol 2021; 17:e1008982. [PMID: 33970899 PMCID: PMC8136849 DOI: 10.1371/journal.pcbi.1008982] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/20/2021] [Accepted: 04/19/2021] [Indexed: 01/07/2023] Open
Abstract
The 5’ untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5’UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)—a proxy for translation rate—directly from 5’UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5’UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5’UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants. The human genome carries a complex code. It consists of genes, which provide blueprints to assemble proteins, and regulatory elements, which control when, where, and how often particular genes are transcribed and translated into protein. To read the genome correctly and specifically to find the causes of inherited diseases, we need to be able to find and interpret these regulatory elements. Here, we focus on particular regions of the genome, the so-called 5’ untranslated regions, which play an important role in determining how often a transcribed gene is translated into protein. We develop deep learning models which can quantitatively interpret regulatory elements in human 5’ untranslated regions and use this information to predict a proxy of the translation efficiency. Our model generalizes a previous model to 5’ untranslated regions of any length, just as they are encountered in natural human genes. Because this model requires only the sequence as input, it can give estimates for the impact of mutations in the sequence, even if these particular mutations are very rare or entirely novel. Such estimates could help pinpoint mutations that disrupt the normal functioning of gene regulation, which could be used to better diagnose patients suffering from rare genetic disorders.
Collapse
Affiliation(s)
- Alexander Karollus
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
- Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, Technical University of Munich, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
- * E-mail:
| |
Collapse
|
24
|
Blakely D, Collins E, Singh R, Norton A, Lanchantin J, Qi Y. FastSK: fast sequence analysis with gapped string kernels. Bioinformatics 2020; 36:i857-i865. [PMID: 33381828 DOI: 10.1093/bioinformatics/btaa817] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task's alphabet size. RESULTS In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. AVAILABILITY AND IMPLEMENTATION Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Derrick Blakely
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Eamon Collins
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Ritambhara Singh
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
| | - Andrew Norton
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
25
|
Krützfeldt LM, Schubach M, Kircher M. The impact of different negative training data on regulatory sequence predictions. PLoS One 2020; 15:e0237412. [PMID: 33259518 PMCID: PMC7707526 DOI: 10.1371/journal.pone.0237412] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 11/12/2020] [Indexed: 01/08/2023] Open
Abstract
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
Collapse
Affiliation(s)
- Louisa-Marie Krützfeldt
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Max Schubach
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Martin Kircher
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
- * E-mail:
| |
Collapse
|
26
|
Ahmed S, Hossain Z, Uddin M, Taherzadeh G, Sharma A, Shatabda S, Dehzangi A. Accurate prediction of RNA 5-hydroxymethylcytosine modification by utilizing novel position-specific gapped k-mer descriptors. Comput Struct Biotechnol J 2020; 18:3528-3538. [PMID: 33304452 PMCID: PMC7701324 DOI: 10.1016/j.csbj.2020.10.032] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 10/30/2020] [Accepted: 10/30/2020] [Indexed: 12/13/2022] Open
Abstract
RNA modification is an essential step towards generation of new RNA structures. Such modification is potentially able to modify RNA function or its stability. Among different modifications, 5-Hydroxymethylcytosine (5hmC) modification of RNA exhibit significant potential for a series of biological processes. Understanding the distribution of 5hmC in RNA is essential to determine its biological functionality. Although conventional sequencing techniques allow broad identification of 5hmC, they are both time-consuming and resource-intensive. In this study, we propose a new computational tool called iRNA5hmC-PS to tackle this problem. To build iRNA5hmC-PS we extract a set of novel sequence-based features called Position-Specific Gapped k-mer (PSG k-mer) to obtain maximum sequential information. Our feature analysis shows that our proposed PSG k-mer features contain vital information for the identification of 5hmC sites. We also use a group-wise feature importance calculation strategy to select a small subset of features containing maximum discriminative information. Our experimental results demonstrate that iRNA5hmC-PS is able to enhance the prediction performance, dramatically. iRNA5hmC-PS achieves 78.3% prediction performance, which is 12.8% better than those reported in the previous studies. iRNA5hmC-PS is publicly available as an online tool at http://103.109.52.8:81/iRNA5hmC-PS. Its benchmark dataset, source codes, and documentation are available at https://github.com/zahid6454/iRNA5hmC-PS.
Collapse
Affiliation(s)
- Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Zahid Hossain
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Mahtab Uddin
- Department of Natural Science, United International University, Dhaka, Bangladesh
| | - Ghazaleh Taherzadeh
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, MD 20742, USA
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD 4111, Australia.,Department of Medical Science Mathematics, Tokyo Medical and Dental University (TMDU), Tokyo, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.,School of Engineering and Physics, University of the South Pacific, Suva, Fiji
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ 08102, USA.,Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA
| |
Collapse
|
27
|
Wang C, Li J. A Deep Learning Framework Identifies Pathogenic Noncoding Somatic Mutations from Personal Prostate Cancer Genomes. Cancer Res 2020; 80:4644-4654. [PMID: 32907840 DOI: 10.1158/0008-5472.can-20-1791] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 07/15/2020] [Accepted: 09/02/2020] [Indexed: 11/16/2022]
Abstract
Our understanding of noncoding mutations in cancer genomes has been derived primarily from mutational recurrence analysis by aggregating clinical samples on a large scale. These cohort-based approaches cannot directly identify individual pathogenic noncoding mutations from personal cancer genomes. Therefore, although most somatic mutations are localized in the noncoding cancer genome, their effects on driving tumorigenesis and progression have not been systematically explored and noncoding somatic alleles have not been leveraged in current clinical practice to guide personalized screening, diagnosis, and treatment. Here, we present a deep learning framework to capture pathogenic noncoding mutations in personal cancer genomes, which perturb gene regulation by altering chromatin architecture. We deployed the system specifically for localized prostate cancer by integrating large-scale prostate cancer genomes and the prostate-specific epigenome. We exhaustively evaluated somatic mutations in each patient's genome and agnostically identified thousands of somatic alleles altering the prostate epigenome. Functional genomic analyses subsequently demonstrated that affected genes displayed differential expression in prostate tumor samples, were vulnerable to expression alterations, and were convergent onto androgen receptor-mediated signaling pathways. Accumulation of pathogenic regulatory mutations in these affected genes was predictive of clinical observations, suggesting potential clinical utility of this approach. Overall, the deep learning framework has significantly expanded our view of somatic mutations in the vast noncoding genome, uncovered novel genes in localized prostate cancer, and will foster the development of personalized screening and therapeutic strategies for prostate cancer. SIGNIFICANCE: This study's characterization of the noncoding genome in prostate cancer reveals mutational signatures predictive of clinical observations, which may serve as a powerful prognostic tool in this disease.
Collapse
Affiliation(s)
- Cheng Wang
- The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, The Parker Institute for Cancer Immunotherapy, The Bakar Computational Health Sciences Institute, Department of Neurology, School of Medicine, University of California, San Francisco, San Francisco, California
| | - Jingjing Li
- The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, The Parker Institute for Cancer Immunotherapy, The Bakar Computational Health Sciences Institute, Department of Neurology, School of Medicine, University of California, San Francisco, San Francisco, California.
| |
Collapse
|
28
|
Corces MR, Shcherbina A, Kundu S, Gloudemans MJ, Frésard L, Granja JM, Louie BH, Eulalio T, Shams S, Bagdatli ST, Mumbach MR, Liu B, Montine KS, Greenleaf WJ, Kundaje A, Montgomery SB, Chang HY, Montine TJ. Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer's and Parkinson's diseases. Nat Genet 2020; 52:1158-1168. [PMID: 33106633 PMCID: PMC7606627 DOI: 10.1038/s41588-020-00721-x] [Citation(s) in RCA: 241] [Impact Index Per Article: 48.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Accepted: 09/18/2020] [Indexed: 02/06/2023]
Abstract
Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.
Collapse
Affiliation(s)
- M Ryan Corces
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Anna Shcherbina
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Soumya Kundu
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Michael J Gloudemans
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Laure Frésard
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| | - Jeffrey M Granja
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Program in Biophysics, Stanford University, Stanford, CA, USA
| | - Bryan H Louie
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Tiffany Eulalio
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Shadi Shams
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - S Tansu Bagdatli
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Maxwell R Mumbach
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Boxiang Liu
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biology, Stanford University, Stanford, CA, USA
- Baidu Research, Sunnyvale, CA, USA
| | - Kathleen S Montine
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| | - William J Greenleaf
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Applied Physics, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Stephen B Montgomery
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Howard Y Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA.
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA.
- Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA.
| | - Thomas J Montine
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
29
|
Wekesa JS, Meng J, Luan Y. A deep learning model for plant lncRNA-protein interaction prediction with graph attention. Mol Genet Genomics 2020; 295:1091-1102. [DOI: 10.1007/s00438-020-01682-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 05/01/2020] [Indexed: 02/06/2023]
|
30
|
Wekesa JS, Meng J, Luan Y. Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction. Genomics 2020; 112:2928-2936. [PMID: 32437848 DOI: 10.1016/j.ygeno.2020.05.005] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 04/22/2020] [Accepted: 05/05/2020] [Indexed: 12/28/2022]
Abstract
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are functionally uncharacterized, thus, accurate prediction of plant lncRNA-protein interaction is imperative for subsequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide composition, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China; School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
| |
Collapse
|
31
|
Phuycharoen M, Zarrineh P, Bridoux L, Amin S, Losa M, Chen K, Bobola N, Rattray M. Uncovering tissue-specific binding features from differential deep learning. Nucleic Acids Res 2020; 48:e27. [PMID: 31974574 PMCID: PMC7049686 DOI: 10.1093/nar/gkaa009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 11/04/2019] [Accepted: 01/07/2020] [Indexed: 01/24/2023] Open
Abstract
Transcription factors (TFs) can bind DNA in a cooperative manner, enabling a mutual increase in occupancy. Through this type of interaction, alternative binding sites can be preferentially bound in different tissues to regulate tissue-specific expression programmes. Recently, deep learning models have become state-of-the-art in various pattern analysis tasks, including applications in the field of genomics. We therefore investigate the application of convolutional neural network (CNN) models to the discovery of sequence features determining cooperative and differential TF binding across tissues. We analyse ChIP-seq data from MEIS, TFs which are broadly expressed across mouse branchial arches, and HOXA2, which is expressed in the second and more posterior branchial arches. By developing models predictive of MEIS differential binding in all three tissues, we are able to accurately predict HOXA2 co-binding sites. We evaluate transfer-like and multitask approaches to regularizing the high-dimensional classification task with a larger regression dataset, allowing for the creation of deeper and more accurate models. We test the performance of perturbation and gradient-based attribution methods in identifying the HOXA2 sites from differential MEIS data. Our results show that deep regularized models significantly outperform shallow CNNs as well as k-mer methods in the discovery of tissue-specific sites bound in vivo.
Collapse
Affiliation(s)
- Mike Phuycharoen
- Department of Computer Science, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Peyman Zarrineh
- School of Health Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Laure Bridoux
- School of Medical Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Shilu Amin
- School of Medical Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Marta Losa
- Department of Orofacial Sciences and Department of Anatomy, University of California San Francisco, 513 Parnassus Avenue, HSW 740, San Francisco, CA 94143, USA
| | - Ke Chen
- Department of Computer Science, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Nicoletta Bobola
- School of Medical Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| | - Magnus Rattray
- School of Health Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK
| |
Collapse
|