1
|
Wu Q, Li Y, Wang Q, Zhao X, Sun D, Liu B. Identification of DNA motif pairs on paired sequences based on composite heterogeneous graph. Front Genet 2024; 15:1424085. [PMID: 38952710 PMCID: PMC11215013 DOI: 10.3389/fgene.2024.1424085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Accepted: 05/22/2024] [Indexed: 07/03/2024] Open
Abstract
Motivation The interaction between DNA motifs (DNA motif pairs) influences gene expression through partnership or competition in the process of gene regulation. Potential chromatin interactions between different DNA motifs have been implicated in various diseases. However, current methods for identifying DNA motif pairs rely on the recognition of single DNA motifs or probabilities, which may result in local optimal solutions and can be sensitive to the choice of initial values. A method for precisely identifying DNA motif pairs is still lacking. Results Here, we propose a novel computational method for predicting DNA Motif Pairs based on Composite Heterogeneous Graph (MPCHG). This approach leverages a composite heterogeneous graph model to identify DNA motif pairs on paired sequences. Compared with the existing methods, MPCHG has greatly improved the accuracy of motifs prediction. Furthermore, the predicted DNA motifs demonstrate heightened DNase accessibility than the background sequences. Notably, the two DNA motifs forming a pair exhibit functional consistency. Importantly, the interacting TF pairs obtained by predicted DNA motif pairs were significantly enriched with known interacting TF pairs, suggesting their potential contribution to chromatin interactions. Collectively, we believe that these identified DNA motif pairs held substantial implications for revealing gene transcriptional regulation under long-range chromatin interactions.
Collapse
Affiliation(s)
- Qiuqin Wu
- School of Mathematics, Shandong University, Jinan, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Qi Wang
- School of Mathematics, Shandong University, Jinan, China
| | - Xiaoyu Zhao
- School of Mathematics, Shandong University, Jinan, China
| | - Duanchen Sun
- School of Mathematics, Shandong University, Jinan, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|
2
|
Li Y, Wang Y, Wang C, Ma A, Ma Q, Liu B. A weighted two-stage sequence alignment framework to identify motifs from ChIP-exo data. PATTERNS (NEW YORK, N.Y.) 2024; 5:100927. [PMID: 38487805 PMCID: PMC10935504 DOI: 10.1016/j.patter.2024.100927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/18/2023] [Accepted: 01/10/2024] [Indexed: 03/17/2024]
Abstract
In this study, we introduce TESA (weighted two-stage alignment), an innovative motif prediction tool that refines the identification of DNA-binding protein motifs, essential for deciphering transcriptional regulatory mechanisms. Unlike traditional algorithms that rely solely on sequence data, TESA integrates the high-resolution chromatin immunoprecipitation (ChIP) signal, specifically from ChIP-exonuclease (ChIP-exo), by assigning weights to sequence positions, thereby enhancing motif discovery. TESA employs a nuanced approach combining a binomial distribution model with a graph model, further supported by a "bookend" model, to improve the accuracy of predicting motifs of varying lengths. Our evaluation, utilizing an extensive compilation of 90 prokaryotic ChIP-exo datasets from proChIPdb and 167 H. sapiens datasets, compared TESA's performance against seven established tools. The results indicate TESA's improved precision in motif identification, suggesting its valuable contribution to the field of genomic research.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yizhong Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
3
|
Wang Y, Li Y, Wang C, Lio CWJ, Ma Q, Liu B. CEMIG: prediction of the cis-regulatory motif using the de Bruijn graph from ATAC-seq. Brief Bioinform 2023; 25:bbad505. [PMID: 38189539 PMCID: PMC10772951 DOI: 10.1093/bib/bbad505] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 11/21/2023] [Accepted: 12/03/2023] [Indexed: 01/09/2024] Open
Abstract
Sequence motif discovery algorithms enhance the identification of novel deoxyribonucleic acid sequences with pivotal biological significance, especially transcription factor (TF)-binding motifs. The advent of assay for transposase-accessible chromatin using sequencing (ATAC-seq) has broadened the toolkit for motif characterization. Nonetheless, prevailing computational approaches have focused on delineating TF-binding footprints, with motif discovery receiving less attention. Herein, we present Cis rEgulatory Motif Influence using de Bruijn Graph (CEMIG), an algorithm leveraging de Bruijn and Hamming distance graph paradigms to predict and map motif sites. Assessment on 129 ATAC-seq datasets from the Cistrome Data Browser demonstrates CEMIG's exceptional performance, surpassing three established methodologies on four evaluative metrics. CEMIG accurately identifies both cell-type-specific and common TF motifs within GM12878 and K562 cell lines, demonstrating its comparative genomic capabilities in the identification of evolutionary conservation and cell-type specificity. In-depth transcriptional and functional genomic studies have validated the functional relevance of CEMIG-identified motifs across various cell types. CEMIG is available at https://github.com/OSU-BMBL/CEMIG, developed in C++ to ensure cross-platform compatibility with Linux, macOS and Windows operating systems.
Collapse
Affiliation(s)
- Yizhong Wang
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Chan-Wang Jerry Lio
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, 250100, China
| |
Collapse
|
4
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
5
|
Structural remodeling of AAA+ ATPase p97 by adaptor protein ASPL facilitates posttranslational methylation by METTL21D. Proc Natl Acad Sci U S A 2023; 120:e2208941120. [PMID: 36656859 PMCID: PMC9942839 DOI: 10.1073/pnas.2208941120] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
p97 is an essential AAA+ ATPase that extracts and unfolds substrate proteins from membranes and protein complexes. Through its mode of action, p97 contributes to various cellular processes, such as membrane fusion, ER-associated protein degradation, DNA repair, and many others. Diverse p97 functions and protein interactions are regulated by a large number of adaptor proteins. Alveolar soft part sarcoma locus (ASPL) is a unique adaptor protein that regulates p97 by disassembling functional p97 hexamers to smaller entities. An alternative mechanism to regulate the activity and interactions of p97 is by posttranslational modifications (PTMs). Although more than 140 PTMs have been identified in p97, only a handful of those have been described in detail. Here we present structural and biochemical data to explain how the p97-remodeling adaptor protein ASPL enables the metastasis promoting methyltransferase METTL21D to bind and trimethylate p97 at a single lysine side chain, which is deeply buried inside functional p97 hexamers. The crystal structure of a heterotrimeric p97:ASPL:METTL21D complex in the presence of cofactors ATP and S-adenosyl homocysteine reveals how structural remodeling by ASPL exposes the crucial lysine residue of p97 to facilitate its trimethylation by METTL21D. The structure also uncovers a role of the second region of homology (SRH) present in the first ATPase domain of p97 in binding of a modifying enzyme to the AAA+ ATPase. Investigation of this interaction in the human, fish, and plant reveals fine details on the mechanism and significance of p97 trimethylation by METTL21D across different organisms.
Collapse
|
6
|
Qi Z, Jung C, Bandilla P, Ludwig C, Heron M, Sophie Kiesel A, Museridze M, Philippou‐Massier J, Nikolov M, Renna Max Schnepf A, Unnerstall U, Ceolin S, Mühlig B, Gompel N, Soeding J, Gaul U. Large-scale analysis of Drosophila core promoter function using synthetic promoters. Mol Syst Biol 2022; 18:e9816. [PMID: 35156763 PMCID: PMC8842121 DOI: 10.15252/msb.20209816] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 01/11/2022] [Accepted: 01/13/2022] [Indexed: 02/02/2023] Open
Abstract
The core promoter plays a central role in setting metazoan gene expression levels, but how exactly it "computes" expression remains poorly understood. To dissect its function, we carried out a comprehensive structure-function analysis in Drosophila. First, we performed a genome-wide bioinformatic analysis, providing an improved picture of the sequence motifs architecture. We then measured synthetic promoters' activities of ~3,000 mutational variants with and without an external stimulus (hormonal activation), at large scale and with high accuracy using robotics and a dual luciferase reporter assay. We observed a strong impact on activity of the different types of mutations, including knockout of individual sequence motifs and motif combinations, variations of motif strength, nucleosome positioning, and flanking sequences. A linear combination of the individual motif features largely accounts for the combinatorial effects on core promoter activity. These findings shed new light on the quantitative assessment of gene expression in metazoans.
Collapse
Affiliation(s)
- Zhan Qi
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Christophe Jung
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Peter Bandilla
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Claudia Ludwig
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Mark Heron
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Anja Sophie Kiesel
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Mariam Museridze
- Department of Biology II, Evolutionary BiologyLudwig‐Maximilians‐Universität MünchenPlanegg‐MartinsriedGermany
| | - Julia Philippou‐Massier
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Miroslav Nikolov
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Alessio Renna Max Schnepf
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Ulrich Unnerstall
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| | - Stefano Ceolin
- Department of Biology II, Evolutionary BiologyLudwig‐Maximilians‐Universität MünchenPlanegg‐MartinsriedGermany
| | - Bettina Mühlig
- Department of Biology II, Evolutionary BiologyLudwig‐Maximilians‐Universität MünchenPlanegg‐MartinsriedGermany
| | - Nicolas Gompel
- Department of Biology II, Evolutionary BiologyLudwig‐Maximilians‐Universität MünchenPlanegg‐MartinsriedGermany
| | - Johannes Soeding
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
- Max Planck Institute for Biophysical ChemistryGöttingenGermany
| | - Ulrike Gaul
- Department of Biochemistry, Gene CenterLudwig‐Maximillians‐Universität MünchenFeodor‐Lynen‐str 25MunichGermany
| |
Collapse
|
7
|
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 2022; 23:bbab374. [PMID: 34607350 PMCID: PMC8769700 DOI: 10.1093/bib/bbab374] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/22/2021] [Accepted: 08/23/2021] [Indexed: 12/28/2022] Open
Abstract
Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.
Collapse
Affiliation(s)
- Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, and Christopher S. Bond Life Science Center, University of Missouri, MO, 65211, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| |
Collapse
|
8
|
Zhang Q, Wang D, Han K, Huang DS. Predicting TF-DNA Binding Motifs from ChIP-seq Datasets Using the Bag-Based Classifier Combined With a Multi-Fold Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1743-1751. [PMID: 32946398 DOI: 10.1109/tcbb.2020.3025007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of high-throughput sequencing technology provides unique opportunities for studying of transcription factor binding sites, but also brings new computational challenges. Recently, a series of discriminative motif discovery (DMD) methods have been proposed and offer promising solutions for addressing these challenges. However, because of the huge computation cost, most of them have to choose approximate schemes that either sacrifice the accuracy of motif representation or tune motif parameter indirectly. In this paper, we propose a bag-based classifier combined with a multi-fold learning scheme (BCMF) to discover motifs from ChIP-seq datasets. First, BCMF formulates input sequences as a labeled bag naturally. Then, a bag-based classifier, combining with a bag feature extracting strategy, is applied to construct the objective function, and a multi-fold learning scheme is used to solve it. Compared with the existing DMD tools, BCMF features three improvements: 1) Learning position weight matrix (PWM) directly in a continuous space; 2) Proposing to represent a positive bag with a feature fused by its k "most positive" patterns. 3) Applying a more advanced learning scheme. The experimental results on 134 ChIP-seq datasets show that BCMF substantially outperforms existing DMD methods (including DREME, HOMER, XXmotif, motifRG, EDCOD and our previous work).
Collapse
|
9
|
Sohrabi-Jahromi S, Söding J. Thermodynamic modeling reveals widespread multivalent binding by RNA-binding proteins. Bioinformatics 2021; 37:i308-i316. [PMID: 34252974 PMCID: PMC8275352 DOI: 10.1093/bioinformatics/btab300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Understanding how proteins recognize their RNA targets is essential to elucidate regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow them to bind to RNA in a multivalent, cooperative manner. They can thereby achieve higher specificity and affinity than proteins with a single RNA-binding domain. However, current approaches to de novo discovery of RNA binding motifs do not take multivalent binding into account. RESULTS We present Bipartite Motif Finder (BMF), which is based on a thermodynamic model of RBPs with two cooperatively binding RNA-binding domains. We show that bivalent binding is a common strategy among RBPs, yielding higher affinity and sequence specificity. We furthermore illustrate that the spatial geometry between the binding sites can be learned from bound RNA sequences. These discovered bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions. AVAILABILITY AND IMPLEMENTATION BMF source code is available at https://github.com/soedinglab/bipartite_motif_finder under a GPL license. The BMF web server is accessible at https://bmf.soedinglab.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Salma Sohrabi-Jahromi
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany.,Campus-Institut Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
10
|
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform 2021; 3:lqab052. [PMID: 34159315 PMCID: PMC8210889 DOI: 10.1093/nargab/lqab052] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/01/2021] [Accepted: 06/14/2021] [Indexed: 02/07/2023] Open
Abstract
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
11
|
Sahlén P, Spalinskas R, Asad S, Mahapatra KD, Höjer P, Anil A, Eisfeldt J, Srivastava A, Nikamo P, Mukherjee A, Kim KH, Bergman O, Ståhle M, Sonkoly E, Pivarcsi A, Wahlgren CF, Nordenskjöld M, Taylan F, Bradley M, Tapia-Páez I. Chromatin interactions in differentiating keratinocytes reveal novel atopic dermatitis- and psoriasis-associated genes. J Allergy Clin Immunol 2020; 147:1742-1752. [PMID: 33069716 DOI: 10.1016/j.jaci.2020.09.035] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 08/14/2020] [Accepted: 09/17/2020] [Indexed: 12/30/2022]
Abstract
BACKGROUND Hundreds of variants associated with atopic dermatitis (AD) and psoriasis, 2 common inflammatory skin disorders, have previously been discovered through genome-wide association studies (GWASs). The majority of these variants are in noncoding regions, and their target genes remain largely unclear. OBJECTIVE We sought to understand the effects of these noncoding variants on the development of AD and psoriasis by linking them to the genes that they regulate. METHODS We constructed genomic 3-dimensional maps of human keratinocytes during differentiation by using targeted chromosome conformation capture (Capture Hi-C) targeting more than 20,000 promoters and 214 GWAS variants and combined these data with transcriptome and epigenomic data sets. We validated our results with reporter assays, clustered regularly interspaced short palindromic repeats activation, and examination of patient gene expression from previous studies. RESULTS We identified 118 target genes of 82 AD and psoriasis GWAS variants. Differential expression of 58 of the 118 target genes (49%) occurred in either AD or psoriatic lesions, many of which were not previously linked to any skin disease. We highlighted the genes AFG1L, CLINT1, ADO, LINC00302, and RP1-140J1.1 and provided further evidence for their potential roles in AD and psoriasis. CONCLUSIONS Our work focused on skin barrier pathology through investigation of the interaction profile of GWAS variants during keratinocyte differentiation. We have provided a catalogue of candidate genes that could modulate the risk of AD and psoriasis. Given that only 35% of the target genes are the gene nearest to the known GWAS variants, we expect that our work will contribute to the discovery of novel pathways involved in AD and psoriasis.
Collapse
Affiliation(s)
- Pelin Sahlén
- KTH Royal Institute of Technology, School of Chemistry, Biotechnology and Health, Science for Life Laboratory, Stockholm, Sweden.
| | - Rapolas Spalinskas
- KTH Royal Institute of Technology, School of Chemistry, Biotechnology and Health, Science for Life Laboratory, Stockholm, Sweden
| | - Samina Asad
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Kunal Das Mahapatra
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Pontus Höjer
- KTH Royal Institute of Technology, School of Chemistry, Biotechnology and Health, Science for Life Laboratory, Stockholm, Sweden
| | - Anandashankar Anil
- KTH Royal Institute of Technology, School of Chemistry, Biotechnology and Health, Science for Life Laboratory, Stockholm, Sweden
| | - Jesper Eisfeldt
- Department of Molecular Medicine and Surgery Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden; Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden
| | - Ankit Srivastava
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Pernilla Nikamo
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Anaya Mukherjee
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Kyu-Han Kim
- Basic Research and Innovation Division, Research and Development Unit, AmorePacific Corporation, Yongin-si, Korea
| | - Otto Bergman
- Division of Cardiovascular Medicine, Center for Molecular Medicine, Department of Medicine Solna, Karolinska Institutet, Stockholm, Karolinska University Hospital, Solna, Sweden
| | - Mona Ståhle
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Enikö Sonkoly
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden; Dermatology Unit, Karolinska University Hospital, Stockholm, Sweden; Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden
| | - Andor Pivarcsi
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden; Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden; Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Carl-Fredrik Wahlgren
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Magnus Nordenskjöld
- Department of Molecular Medicine and Surgery Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden; Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden
| | - Fulya Taylan
- Department of Molecular Medicine and Surgery Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden; Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden
| | - Maria Bradley
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden; Dermatology Unit, Karolinska University Hospital, Stockholm, Sweden
| | - Isabel Tapia-Páez
- Dermatology and Venereology Division, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| |
Collapse
|
12
|
Li Y, Ni P, Zhang S, Li G, Su Z. ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery. Bioinformatics 2020; 35:4632-4639. [PMID: 31070745 DOI: 10.1093/bioinformatics/btz290] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 03/29/2019] [Accepted: 04/18/2019] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. RESULTS We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Pengyu Ni
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
13
|
Li Q, Sapkota M, van der Knaap E. Perspectives of CRISPR/Cas-mediated cis-engineering in horticulture: unlocking the neglected potential for crop improvement. HORTICULTURE RESEARCH 2020; 7:36. [PMID: 32194972 PMCID: PMC7072075 DOI: 10.1038/s41438-020-0258-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Revised: 01/09/2020] [Accepted: 02/11/2020] [Indexed: 05/14/2023]
Abstract
Directed breeding of horticultural crops is essential for increasing yield, nutritional content, and consumer-valued characteristics such as shape and color of the produce. However, limited genetic diversity restricts the amount of crop improvement that can be achieved through conventional breeding approaches. Natural genetic changes in cis-regulatory regions of genes play important roles in shaping phenotypic diversity by altering their expression. Utilization of CRISPR/Cas editing in crop species can accelerate crop improvement through the introduction of genetic variation in a targeted manner. The advent of CRISPR/Cas-mediated cis-regulatory region engineering (cis-engineering) provides a more refined method for modulating gene expression and creating phenotypic diversity to benefit crop improvement. Here, we focus on the current applications of CRISPR/Cas-mediated cis-engineering in horticultural crops. We describe strategies and limitations for its use in crop improvement, including de novo cis-regulatory element (CRE) discovery, precise genome editing, and transgene-free genome editing. In addition, we discuss the challenges and prospects regarding current technologies and achievements. CRISPR/Cas-mediated cis-engineering is a critical tool for generating horticultural crops that are better able to adapt to climate change and providing food for an increasing world population.
Collapse
Affiliation(s)
- Qiang Li
- College of Horticultural Science and Engineering, Shandong Agricultural University, Tai’an, China
- Center for Applied Genetic Technologies, University of Georgia, Athens, GA USA
| | - Manoj Sapkota
- Institute for Plant Breeding, Genetics and Genomics, University of Georgia, Athens, GA USA
| | - Esther van der Knaap
- Center for Applied Genetic Technologies, University of Georgia, Athens, GA USA
- Institute for Plant Breeding, Genetics and Genomics, University of Georgia, Athens, GA USA
- Department of Horticulture, University of Georgia, Athens, GA USA
| |
Collapse
|
14
|
Zhang Q, Zhu L, Bao W, Huang DS. Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:679-689. [PMID: 30106688 DOI: 10.1109/tcbb.2018.2864203] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max, Average, Linear Regression, and Top-Bottom Instances. The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.
Collapse
|
15
|
Hashim FA, Houssein EH, Hussain K, Mabrouk MS, Al-Atabany W. A modified Henry gas solubility optimization for solving motif discovery problem. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04611-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
16
|
Yu Q, Zhao X, Huo H. A new algorithm for DNA motif discovery using multiple sample sequence sets. J Bioinform Comput Biol 2019; 17:1950021. [PMID: 31617465 DOI: 10.1142/s0219720019500215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA motif discovery plays an important role in understanding the mechanisms of gene regulation. Most existing motif discovery algorithms can identify motifs in an efficient and effective manner when dealing with small datasets. However, large datasets generated by high-throughput sequencing technologies pose a huge challenge: it is too time-consuming to process the entire dataset, but if only a small sample sequence set is processed, it is difficult to identify infrequent motifs. In this paper, we propose a new DNA motif discovery algorithm: first divide the input dataset into multiple sample sequence sets, then refine initial motifs of each sample sequence set with the expectation maximization method, and finally combine all the results from each sample sequence set. Besides, we design a new initial motif generation method with the utilization of the entire dataset, which helps to identify infrequent motifs. The experimental results on the simulated data show that the proposed algorithm has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms. Also, we have verified the validity of the proposed algorithm on the real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Xiang Zhao
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| |
Collapse
|
17
|
Kiesel A, Roth C, Ge W, Wess M, Meier M, Söding J. The BaMM web server for de-novo motif discovery and regulatory sequence analysis. Nucleic Acids Res 2019; 46:W215-W220. [PMID: 29846656 PMCID: PMC6030882 DOI: 10.1093/nar/gky431] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 05/09/2018] [Indexed: 12/25/2022] Open
Abstract
The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
Collapse
Affiliation(s)
- Anja Kiesel
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Christian Roth
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Wanwan Ge
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Maximilian Wess
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Meier
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
18
|
Lidschreiber M, Easter AD, Battaglia S, Rodríguez-Molina JB, Casañal A, Carminati M, Baejen C, Grzechnik P, Maier KC, Cramer P, Passmore LA. The APT complex is involved in non-coding RNA transcription and is distinct from CPF. Nucleic Acids Res 2019; 46:11528-11538. [PMID: 30247719 PMCID: PMC6265451 DOI: 10.1093/nar/gky845] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/11/2018] [Indexed: 11/15/2022] Open
Abstract
The 3'-ends of eukaryotic pre-mRNAs are processed in the nucleus by a large multiprotein complex, the cleavage and polyadenylation factor (CPF). CPF cleaves RNA, adds a poly(A) tail and signals transcription termination. CPF harbors four enzymatic activities essential for these processes, but how these are coordinated remains poorly understood. Several subunits of CPF, including two protein phosphatases, are also found in the related 'associated with Pta1' (APT) complex, but the relationship between CPF and APT is unclear. Here, we show that the APT complex is physically distinct from CPF. The 21 kDa Syc1 protein is associated only with APT, and not with CPF, and is therefore the defining subunit of APT. Using ChIP-seq, PAR-CLIP and RNA-seq, we show that Syc1/APT has distinct, but possibly overlapping, functions from those of CPF. Syc1/APT plays a more important role in sn/snoRNA production whereas CPF processes the 3'-ends of protein-coding pre-mRNAs. These results define distinct protein machineries for synthesis of mature eukaryotic protein-coding and non-coding RNAs.
Collapse
Affiliation(s)
- Michael Lidschreiber
- Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.,Karolinska Institutet, Department of Biosciences and Nutrition, Center for Innovative Medicine and Science for Life Laboratory, Novum, Hälsovägen 7, 141 83 Huddinge, Sweden
| | | | - Sofia Battaglia
- Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | | | - Ana Casañal
- MRC Laboratory of Molecular Biology, Cambridge CB2 0QH, UK
| | | | - Carlo Baejen
- Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Pawel Grzechnik
- School of Biosciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
| | - Kerstin C Maier
- Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Patrick Cramer
- Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.,Karolinska Institutet, Department of Biosciences and Nutrition, Center for Innovative Medicine and Science for Life Laboratory, Novum, Hälsovägen 7, 141 83 Huddinge, Sweden
| | | |
Collapse
|
19
|
Zhang S, Liang Y, Wang X, Su Z, Chen Y. FisherMP: fully parallel algorithm for detecting combinatorial motifs from large ChIP-seq datasets. DNA Res 2019; 26:231-242. [PMID: 30957858 PMCID: PMC6589551 DOI: 10.1093/dnares/dsz004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 03/05/2019] [Indexed: 11/14/2022] Open
Abstract
Detecting binding motifs of combinatorial transcription factors (TFs) from chromatin immunoprecipitation sequencing (ChIP-seq) experiments is an important and challenging computational problem for understanding gene regulations. Although a number of motif-finding algorithms have been presented, most are either time consuming or have sub-optimal accuracy for processing large-scale datasets. In this article, we present a fully parallelized algorithm for detecting combinatorial motifs from ChIP-seq datasets by using Fisher combined method and OpenMP parallel design. Large scale validations on both synthetic data and 350 ChIP-seq datasets from the ENCODE database showed that FisherMP has not only super speeds on large datasets, but also has high accuracy when compared with multiple popular methods. By using FisherMP, we successfully detected combinatorial motifs of CTCF, YY1, MAZ, STAT3 and USF2 in chromosome X, suggesting that they are functional co-players in gene regulation and chromosomal organization. Integrative and statistical analysis of these TF-binding peaks clearly demonstrate that they are not only highly coordinated with each other, but that they are also correlated with histone modifications. FisherMP can be applied for integrative analysis of binding motifs and for predicting cis-regulatory modules from a large number of ChIP-seq datasets.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Ying Liang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Xiangyun Wang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Zhengchang Su
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, NC, USA
| | - Yong Chen
- Department of Biological Sciences, Center for Systems Biology, the University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
20
|
Zhang H, Zhu L, Huang DS. DiscMLA: An Efficient Discriminative Motif Learning Algorithm over High-Throughput Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1810-1820. [PMID: 27164602 DOI: 10.1109/tcbb.2016.2561930] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The transcription factors (TFs) can activate or suppress gene expression by binding to specific sites, hence are crucial regulatory elements for transcription. Recently, series of discriminative motif finders have been tailored to offering promising strategy for harnessing the power of large quantities of accumulated high-throughput experimental data. However, in order to achieve high speed, these algorithms have to sacrifice accuracy by employing simplified statistical models during the searching process. In this paper, we propose a novel approach named Discriminative Motif Learning via AUC (DiscMLA) to discover motifs on high-throughput datasets. Unlike previous approaches, DiscMLA tries to optimize with a more comprehensive criterion (AUC) during motifs searching. In addition, based on an experimental observation of motif identification on large-scale datasets, some novel procedures are designed to accelerate DiscMLA. The experimental results on 52 real-world datasets demonstrate that our approach substantially outperforms previous methods on discriminative motif learning problems. DiscMLA' stability, discriminability, and validity will help to exploit high-throughput datasets and answer many fundamental biological questions.
Collapse
|
21
|
Tran NTL, Huang CH. MODSIDE: a motif discovery pipeline and similarity detector. BMC Genomics 2018; 19:755. [PMID: 30340511 PMCID: PMC6194616 DOI: 10.1186/s12864-018-5148-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
Background Previous studies demonstrate the usefulness of using multiple tools and methods for improving the accuracy of motif detection. Over the past years, numerous motif discovery pipelines have been developed. However, they typically report only the top ranked results either from individual motif finders or from a combination of multiple tools and algorithms. Results Here we present MODSIDE, a motif discovery pipeline and similarity detector. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. MODSIDE was designed for delivering not only the predictive results from individual motif finders but also the comparison results for multiple tools. The results include the common significant motifs from multiple tools, the motifs detected by some tools but not by others, and the best matches for each motif in the motif collection of multiple tools. MODSIDE also possesses other useful features for merging similar motifs and clustering motifs into motif trees. Conclusions We evaluated MODSIDE and its adopted motif finders on 16 benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. We also compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines. Electronic supplementary material The online version of this article (10.1186/s12864-018-5148-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA.
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
| |
Collapse
|
22
|
Martins-Santana L, Nora LC, Sanches-Medeiros A, Lovate GL, Cassiano MHA, Silva-Rocha R. Systems and Synthetic Biology Approaches to Engineer Fungi for Fine Chemical Production. Front Bioeng Biotechnol 2018; 6:117. [PMID: 30338257 PMCID: PMC6178918 DOI: 10.3389/fbioe.2018.00117] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/02/2018] [Indexed: 01/16/2023] Open
Abstract
Since the advent of systems and synthetic biology, many studies have sought to harness microbes as cell factories through genetic and metabolic engineering approaches. Yeast and filamentous fungi have been successfully harnessed to produce fine and high value-added chemical products. In this review, we present some of the most promising advances from recent years in the use of fungi for this purpose, focusing on the manipulation of fungal strains using systems and synthetic biology tools to improve metabolic flow and the flow of secondary metabolites by pathway redesign. We also review the roles of bioinformatics analysis and predictions in synthetic circuits, highlighting in silico systemic approaches to improve the efficiency of synthetic modules.
Collapse
Affiliation(s)
- Leonardo Martins-Santana
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Luisa C Nora
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Ananda Sanches-Medeiros
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Gabriel L Lovate
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Murilo H A Cassiano
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Rafael Silva-Rocha
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| |
Collapse
|
23
|
Al-Ouran R, Schmidt R, Naik A, Jones J, Drews F, Juedes D, Elnitski L, Welch L. Discovering Gene Regulatory Elements Using Coverage-Based Heuristics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1290-1300. [PMID: 26540692 DOI: 10.1109/tcbb.2015.2496261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Data mining algorithms and sequencing methods (such as RNA-seq and ChIP-seq) are being combined to discover genomic regulatory motifs that relate to a variety of phenotypes. However, motif discovery algorithms often produce very long lists of putative transcription factor binding sites, hindering the discovery of phenotype-related regulatory elements by making it difficult to select a manageable set of candidate motifs for experimental validation. To address this issue, the authors introduce the motif selection problem and provide coverage-based search heuristics for its solution. Analysis of 203 ChIP-seq experiments from the ENCyclopedia of DNA Elements project shows that our algorithms produce motifs that have high sensitivity and specificity and reveals new insights about the regulatory code of the human genome. The greedy algorithm performs the best, selecting a median of two motifs per ChIP-seq transcription factor group while achieving a median sensitivity of 77 percent.
Collapse
|
24
|
Yu Q, Wei D, Huo H. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinformatics 2018; 19:228. [PMID: 29914360 PMCID: PMC6006848 DOI: 10.1186/s12859-018-2242-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 06/12/2018] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Dingbang Wei
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| |
Collapse
|
25
|
Zhu L, Zhang HB, Huang DS. LMMO: A Large Margin Approach for Refining Regulatory Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:913-925. [PMID: 28391205 DOI: 10.1109/tcbb.2017.2691325] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, they usually have to sacrifice accuracy and may fail to fully leverage the potential of large datasets. Recently, it has been demonstrated that the motifs identified by DMDs can be significantly improved by maximizing the receiver-operating characteristic curve (AUC) metric, which has been widely used in the literature to rank the performance of elicited motifs. However, existing approaches for motif refinement choose to directly maximize the non-convex and discontinuous AUC itself, which is known to be difficult and may lead to suboptimal solutions. In this paper, we propose Large Margin Motif Optimizer (LMMO), a large-margin-type algorithm for refining regulatory motifs. By relaxing the AUC cost function with the surrogate convex hinge loss, we show that the resultant learning problem can be cast as an instance of difference-of-convex (DC) programs, and solve it iteratively using constrained concave-convex procedure (CCCP). To further save computational time, we combine LMMO with existing techniques for improving the scalability of large-margin-type algorithms, such as cutting plane method. Experimental evaluations on synthetic and real data illustrate the performance of the proposed approach. The code of LMMO is freely available at: https://github.com/ekffar/LMMO.
Collapse
|
26
|
Liu L, Zhi Q, Shen M, Gong FR, Zhou BP, Lian L, Shen B, Chen K, Duan W, Wu MY, Tao M, Li W. FH535, a β-catenin pathway inhibitor, represses pancreatic cancer xenograft growth and angiogenesis. Oncotarget 2018; 7:47145-47162. [PMID: 27323403 PMCID: PMC5216931 DOI: 10.18632/oncotarget.9975] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Accepted: 05/17/2016] [Indexed: 12/30/2022] Open
Abstract
The WNT/β-catenin pathway plays an important role in pancreatic cancer carcinogenesis. We evaluated the correlation between aberrant β-catenin pathway activation and the prognosis pancreatic cancer, and the potential of applying the β-catenin pathway inhibitor FH535 to pancreatic cancer treatment. Meta-analysis and immunohistochemistry showed that abnormal β-catenin pathway activation was associated with unfavorable outcome. FH535 repressed pancreatic cancer xenograft growth in vivo. Gene Ontology (GO) analysis of microarray data indicated that target genes responding to FH535 participated in stemness maintenance. Real-time PCR and flow cytometry confirmed that FH535 downregulated CD24 and CD44, pancreatic cancer stem cell (CSC) markers, suggesting FH535 impairs pancreatic CSC stemness. GO analysis of β-catenin chromatin immunoprecipitation sequencing data identified angiogenesis-related gene regulation. Immunohistochemistry showed that higher microvessel density correlated with elevated nuclear β-catenin expression and unfavorable outcome. FH535 repressed the secretion of the proangiogenic cytokines vascular endothelial growth factor (VEGF), interleukin (IL)-6, IL-8, and tumor necrosis factor-α, and also inhibited angiogenesis in vitro and in vivo. Protein and mRNA microarrays revealed that FH535 downregulated the proangiogenic genes ANGPT2, VEGFR3, IFN-γ, PLAUR, THPO, TIMP1, and VEGF. FH535 not only represses pancreatic CSC stemness in vitro, but also remodels the tumor microenvironment by repressing angiogenesis, warranting further clinical investigation.
Collapse
Affiliation(s)
- Lu Liu
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Qiaoming Zhi
- Department of General Surgery, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Meng Shen
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Fei-Ran Gong
- Department of Hematology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Binhua P Zhou
- Markey Cancer Center, University of Kentucky, Lexington, KY, USA.,Departments of Molecular and Cellular Biochemistry, University of Kentucky College of Medicine, Lexington, KY, USA
| | - Lian Lian
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China.,Department of Oncology, Suzhou Xiangcheng People's Hospital, Suzhou, China.,Department of Pathology, Suzhou Xiangcheng People's Hospital, Suzhou, China
| | - Bairong Shen
- Center for Systems Biology, Soochow University, Suzhou, China
| | - Kai Chen
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Weiming Duan
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Meng-Yao Wu
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Min Tao
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China.,PREMED Key Laboratory for Precision Medicine, Soochow University, Suzhou, China.,Jiangsu Institute of Clinical Immunology, Suzhou, China.,Institute of Medical Biotechnology, Soochow University, Suzhou, China
| | - Wei Li
- Department of Oncology, The First Affiliated Hospital of Soochow University, Suzhou, China.,Markey Cancer Center, University of Kentucky, Lexington, KY, USA.,Center for Systems Biology, Soochow University, Suzhou, China.,PREMED Key Laboratory for Precision Medicine, Soochow University, Suzhou, China.,Jiangsu Institute of Clinical Immunology, Suzhou, China
| |
Collapse
|
27
|
Hickman R, Van Verk MC, Van Dijken AJH, Mendes MP, Vroegop-Vos IA, Caarls L, Steenbergen M, Van der Nagel I, Wesselink GJ, Jironkin A, Talbot A, Rhodes J, De Vries M, Schuurink RC, Denby K, Pieterse CMJ, Van Wees SCM. Architecture and Dynamics of the Jasmonic Acid Gene Regulatory Network. THE PLANT CELL 2017; 29:2086-2105. [PMID: 28827376 PMCID: PMC5635973 DOI: 10.1105/tpc.16.00958] [Citation(s) in RCA: 169] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Revised: 07/05/2017] [Accepted: 08/17/2017] [Indexed: 05/18/2023]
Abstract
Jasmonic acid (JA) is a critical hormonal regulator of plant growth and defense. To advance our understanding of the architecture and dynamic regulation of the JA gene regulatory network, we performed a high-resolution RNA-seq time series of methyl JA-treated Arabidopsis thaliana at 15 time points over a 16-h period. Computational analysis showed that methyl JA (MeJA) induces a burst of transcriptional activity, generating diverse expression patterns over time that partition into distinct sectors of the JA response targeting specific biological processes. The presence of transcription factor (TF) DNA binding motifs correlated with specific TF activity during temporal MeJA-induced transcriptional reprogramming. Insight into the underlying dynamic transcriptional regulation mechanisms was captured in a chronological model of the JA gene regulatory network. Several TFs, including MYB59 and bHLH27, were uncovered as early network components with a role in pathogen and insect resistance. Analysis of subnetworks surrounding the TFs ORA47, RAP2.6L, MYB59, and ANAC055, using transcriptome profiling of overexpressors and mutants, provided insights into their regulatory role in defined modules of the JA network. Collectively, our work illuminates the complexity of the JA gene regulatory network, pinpoints and validates previously unknown regulators, and provides a valuable resource for functional studies on JA signaling components in plant defense and development.
Collapse
Affiliation(s)
- Richard Hickman
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Marcel C Van Verk
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
- Bioinformatics, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Anja J H Van Dijken
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Marciel Pereira Mendes
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Irene A Vroegop-Vos
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Lotte Caarls
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Merel Steenbergen
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Ivo Van der Nagel
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Gert Jan Wesselink
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Aleksey Jironkin
- Warwick Systems Biology Centre, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Adam Talbot
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, United Kingdom
- Department of Biology, University of York, York YO10 5DD, United Kingdom
| | - Johanna Rhodes
- Warwick Systems Biology Centre, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Michel De Vries
- Plant Physiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
| | - Robert C Schuurink
- Plant Physiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
| | - Katherine Denby
- Warwick Systems Biology Centre, University of Warwick, Coventry CV4 7AL, United Kingdom
- School of Life Sciences, University of Warwick, Coventry CV4 7AL, United Kingdom
- Department of Biology, University of York, York YO10 5DD, United Kingdom
| | - Corné M J Pieterse
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| | - Saskia C M Van Wees
- Plant-Microbe Interactions, Department of Biology, Utrecht University, 3508 TB, Utrecht, The Netherlands
| |
Collapse
|
28
|
Chandrasekaran U, Yi W, Gupta S, Weng CH, Giannopoulou E, Chinenov Y, Jessberger R, Weaver CT, Bhagat G, Pernis AB. Regulation of Effector Treg Cells in Murine Lupus. Arthritis Rheumatol 2017; 68:1454-66. [PMID: 26816213 DOI: 10.1002/art.39599] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Accepted: 01/14/2016] [Indexed: 01/05/2023]
Abstract
OBJECTIVE Treg cells need to acquire an effector phenotype to function in settings of inflammation. Whether effector Treg cells can limit disease severity in lupus is unknown. Interferon regulatory factor 4 (IRF-4) is an essential controller of effector Treg cells and regulates their ability to express interleukin-10 (IL-10). In non-Treg cells, IRF-4 activity is modulated by interactions with DEF-6 and its homolog switch-associated protein 70 (SWAP-70). Although mice lacking both DEF-6 and SWAP-70 (double-knockout [DKO] mice) develop lupus, they display normal survival, suggesting that in DKO mice, Treg cells can moderate disease development. The purpose of this study was to investigate whether Treg cells from DKO mice have an increased capacity to become effector Treg cells due to the ability of DEF-6 and SWAP-70 to restrain IRF-4 activity. METHODS Treg cells were evaluated by fluorescence-activated cell sorting. The B lymphocyte-induced maturation protein 1 (BLIMP-1)/IL-10 axis was assessed by crossing DKO mice with BLIMP-1-YFP-10BiT dual-reporter mice. Deletion of IRF-4 in Treg cells from DKO mice was achieved by generating FoxP3(Cre) IRF-4(fl/fl) DKO mice. RESULTS The concomitant absence of DEF-6 and SWAP-70 led to increased numbers of Treg cells, which acquired an effector phenotype in a cell-intrinsic manner. In addition, Treg cells from DKO mice exhibited enhanced expression of the BLIMP-1/IL-10 axis. Notably, DKO effector Treg cells survived and expanded as disease progressed. The accumulation of Treg cells from DKO mice was associated with the up-regulation of genes controlling autophagy. IRF-4 was required for the expansion and function of effector Treg cells from DKO mice. CONCLUSION This study revealed the existence of mechanisms that, by acting on IRF-4, can fine-tune the function and survival of effector Treg cells in lupus. These findings suggest that the existence of a powerful effector Treg cell compartment that successfully survives in an unfavorable inflammatory environment could limit disease development.
Collapse
Affiliation(s)
| | - Woelsung Yi
- Hospital for Special Surgery, New York, New York
| | - Sanjay Gupta
- Hospital for Special Surgery, New York, New York
| | - Chien-Huan Weng
- Hospital for Special Surgery and Weill Cornell Graduate School of Medical Sciences, New York, New York
| | - Eugenia Giannopoulou
- Hospital for Special Surgery, New York, and New York City College of Technology, City University of New York, Brooklyn, New York
| | | | | | | | - Govind Bhagat
- Columbia University Medical Center and New York Presbyterian Hospital, New York, New York
| | - Alessandra B Pernis
- Hospital for Special Surgery, Weill Cornell Graduate School of Medical Sciences, and Weill Cornell Medicine, Cornell University, New York, New York
| |
Collapse
|
29
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
30
|
Gallone G, Haerty W, Disanto G, Ramagopalan SV, Ponting CP, Berlanga-Taylor AJ. Identification of genetic variants affecting vitamin D receptor binding and associations with autoimmune disease. Hum Mol Genet 2017; 26:2164-2176. [PMID: 28335003 PMCID: PMC5886188 DOI: 10.1093/hmg/ddx092] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Revised: 02/28/2017] [Accepted: 03/07/2017] [Indexed: 01/24/2023] Open
Abstract
Large numbers of statistically significant associations between sentinel SNPs and case-control status have been replicated by genome-wide association studies. Nevertheless, few underlying molecular mechanisms of complex disease are currently known. We investigated whether variation in binding of a transcription factor, the vitamin D receptor (VDR), whose activating ligand vitamin D has been proposed as a modifiable factor in multiple disorders, could explain any of these associations. VDR modifies gene expression by binding DNA as a heterodimer with the Retinoid X receptor (RXR). We identified 43,332 genetic variants significantly associated with altered VDR binding affinity (VDR-BVs) using a high-resolution (ChIP-exo) genome-wide analysis of 27 HapMap lymphoblastoid cell lines. VDR-BVs are enriched in consensus RXR::VDR binding motifs, yet most fell outside of these motifs, implying that genetic variation often affects the binding affinity only indirectly. Finally, we compared 341 VDR-BVs replicating by position in multiple individuals against background sets of variants lying within VDR-binding regions that had been matched in allele frequency and were independent with respect to linkage disequilibrium. In this stringent test, these replicated VDR-BVs were significantly (q < 0.1) and substantially (>2-fold) enriched in genomic intervals associated with autoimmune and other diseases, including inflammatory bowel disease, Crohn's disease and rheumatoid arthritis. The approach's validity is underscored by RXR::VDR motif sequence being predictive of binding strength and being evolutionarily constrained. Our findings are consistent with altered RXR::VDR binding contributing to immunity-related diseases. Replicated VDR-BVs associated with these disorders could represent causal disease risk alleles whose effect may be modifiable by vitamin D levels.
Collapse
Affiliation(s)
- Giuseppe Gallone
- MRC Functional Genomics Unit
- Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3PT, UK
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
| | - Wilfried Haerty
- MRC Functional Genomics Unit
- Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3PT, UK
| | - Giulio Disanto
- Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3PT, UK
| | | | - Chris P. Ponting
- MRC Functional Genomics Unit
- Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3PT, UK
- MRC Human Genetics Unit, The Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Antonio J. Berlanga-Taylor
- Wellcome Trust Centre for Human Genetics, Nuffield Department of Clinical Medicine, University of Oxford, Oxford OX3 7BN, UK
- CGAT, MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3PT, UK
- MRC-PHE Centre for Environment and Health, Department of Epidemiology & Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, St Mary’s Campus, Norfolk Place, London W2 1PG, UK
| |
Collapse
|
31
|
Yan Q, Xia X, Sun Z, Fang Y. Depletion of Arabidopsis SC35 and SC35-like serine/arginine-rich proteins affects the transcription and splicing of a subset of genes. PLoS Genet 2017; 13:e1006663. [PMID: 28273088 PMCID: PMC5362245 DOI: 10.1371/journal.pgen.1006663] [Citation(s) in RCA: 75] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Revised: 03/22/2017] [Accepted: 02/28/2017] [Indexed: 12/23/2022] Open
Abstract
Serine/arginine-rich (SR) proteins are important splicing factors which play significant roles in spliceosome assembly and splicing regulation. However, little is known regarding their biological functions in plants. Here, we analyzed the phenotypes of mutants upon depleting different subfamilies of Arabidopsis SR proteins. We found that loss of the functions of SC35 and SC35-like (SCL) proteins cause pleiotropic changes in plant morphology and development, including serrated leaves, late flowering, shorter roots and abnormal silique phyllotaxy. Using RNA-seq, we found that SC35 and SCL proteins play roles in the pre-mRNA splicing. Motif analysis revealed that SC35 and SCL proteins preferentially bind to a specific RNA sequence containing the AGAAGA motif. In addition, the transcriptions of a subset of genes are affected by the deletion of SC35 and SCL proteins which interact with NRPB4, a specific subunit of RNA polymerase II. The splicing of FLOWERING LOCUS C (FLC) intron1 and transcription of FLC were significantly regulated by SC35 and SCL proteins to control Arabidopsis flowering. Therefore, our findings provide mechanistic insight into the functions of plant SC35 and SCL proteins in the regulation of splicing and transcription in a direct or indirect manner to maintain the proper expression of genes and development. SR proteins were identified to be important splicing factors. This work generated mutants of different subfamilies of the classic Arabidopsis SR proteins. Genetic analysis revealed that loss of the function of SC35/SCL proteins influences the plant development. This study revealed SC35/SCL proteins regulate alternative splicing, preferentially bind a specific RNA motif, interact with NRPB4, and affect the transcription of a subset of genes. This study further revealed that SC35/SCL proteins control flowering by regulating the splicing and transcription of FLC. These results shed light on the functions of SR proteins in plants.
Collapse
Affiliation(s)
- Qingqing Yan
- National key Laboratory of Plant Molecular Genetics, Chinese Academy of Sciences Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shanghai, China
| | - Xi Xia
- National key Laboratory of Plant Molecular Genetics, Chinese Academy of Sciences Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shanghai, China
| | - Zhenfei Sun
- National key Laboratory of Plant Molecular Genetics, Chinese Academy of Sciences Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shanghai, China
| | - Yuda Fang
- National key Laboratory of Plant Molecular Genetics, Chinese Academy of Sciences Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Shanghai, China
- * E-mail:
| |
Collapse
|
32
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
33
|
Siebert M, Söding J. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. Nucleic Acids Res 2016; 44:6055-69. [PMID: 27288444 PMCID: PMC5291271 DOI: 10.1093/nar/gkw521] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Accepted: 05/29/2016] [Indexed: 01/01/2023] Open
Abstract
Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.
Collapse
Affiliation(s)
- Matthias Siebert
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
34
|
Sloutskin A, Danino YM, Orenstein Y, Zehavi Y, Doniger T, Shamir R, Juven-Gershon T. ElemeNT: a computational tool for detecting core promoter elements. Transcription 2016. [PMID: 26226151 PMCID: PMC4581360 DOI: 10.1080/21541264.2015.1067286] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Core promoter elements play a pivotal role in the transcriptional output, yet they are often detected manually within sequences of interest. Here, we present 2 contributions to the detection and curation of core promoter elements within given sequences. First, the Elements Navigation Tool (ElemeNT) is a user-friendly web-based, interactive tool for prediction and display of putative core promoter elements and their biologically-relevant combinations. Second, the CORE database summarizes ElemeNT-predicted core promoter elements near CAGE and RNA-seq-defined Drosophila melanogaster transcription start sites (TSSs). ElemeNT's predictions are based on biologically-functional core promoter elements, and can be used to infer core promoter compositions. ElemeNT does not assume prior knowledge of the actual TSS position, and can therefore assist in annotation of any given sequence. These resources, freely accessible at http://lifefaculty.biu.ac.il/gershon-tamar/index.php/resources, facilitate the identification of core promoter elements as active contributors to gene expression.
Collapse
Affiliation(s)
- Anna Sloutskin
- a The Mina and Everard Goodman Faculty of Life Sciences ; Bar-Ilan University ; Ramat Gan , Israel
| | | | | | | | | | | | | |
Collapse
|
35
|
Whitington T, Gao P, Song W, Ross-Adams H, Lamb AD, Yang Y, Svezia I, Klevebring D, Mills IG, Karlsson R, Halim S, Dunning MJ, Egevad L, Warren AY, Neal DE, Grönberg H, Lindberg J, Wei GH, Wiklund F. Gene regulatory mechanisms underpinning prostate cancer susceptibility. Nat Genet 2016; 48:387-97. [PMID: 26950096 DOI: 10.1038/ng.3523] [Citation(s) in RCA: 83] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Accepted: 02/08/2016] [Indexed: 12/29/2022]
Abstract
Molecular characterization of genome-wide association study (GWAS) loci can uncover key genes and biological mechanisms underpinning complex traits and diseases. Here we present deep, high-throughput characterization of gene regulatory mechanisms underlying prostate cancer risk loci. Our methodology integrates data from 295 prostate cancer chromatin immunoprecipitation and sequencing experiments with genotype and gene expression data from 602 prostate tumor samples. The analysis identifies new gene regulatory mechanisms affected by risk locus SNPs, including widespread disruption of ternary androgen receptor (AR)-FOXA1 and AR-HOXB13 complexes and competitive binding mechanisms. We identify 57 expression quantitative trait loci at 35 risk loci, which we validate through analysis of allele-specific expression. We further validate predicted regulatory SNPs and target genes in prostate cancer cell line models. Finally, our integrated analysis can be accessed through an interactive visualization tool. This analysis elucidates how genome sequence variation affects disease predisposition via gene regulatory mechanisms and identifies relevant genes for downstream biomarker and drug development.
Collapse
Affiliation(s)
- Thomas Whitington
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Ping Gao
- Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Wei Song
- Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Helen Ross-Adams
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Alastair D Lamb
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.,Department of Urology, Addenbrooke's Hospital, Cambridge, UK
| | - Yuehong Yang
- Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Ilaria Svezia
- Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Daniel Klevebring
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Ian G Mills
- Prostate Cancer Research Group, Centre for Molecular Medicine Norway, Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, Oslo, Norway.,Department of Molecular Oncology, Institute of Cancer Research, Oslo University Hospital, Oslo, Norway.,Prostate Cancer UK/Movember Centre of Excellence for Prostate Cancer Research, Centre for Cancer Research and Cell Biology, Queen's University Belfast, Belfast, UK
| | - Robert Karlsson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Silvia Halim
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.,Cancer Research UK Beatson Institute, Glasgow, UK
| | - Mark J Dunning
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Lars Egevad
- Department of Pathology and Cytology, Karolinska University Hospital, Stockholm, Sweden.,Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden
| | - Anne Y Warren
- Department of Pathology, Addenbrooke's Hospital, Cambridge, UK
| | - David E Neal
- Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
| | - Henrik Grönberg
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Johan Lindberg
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Gong-Hong Wei
- Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Fredrik Wiklund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
36
|
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet 2016; 7:24. [PMID: 26941778 PMCID: PMC4763482 DOI: 10.3389/fgene.2016.00024] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 02/05/2016] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
Collapse
Affiliation(s)
- Valentina Boeva
- Centre de Recherche, Institut CurieParis, France; INSERM, U900Paris, France; Mines ParisTechFontainebleau, France; PSL Research UniversityParis, France; Department of Development, Reproduction and Cancer, Institut CochinParis, France; INSERM, U1016Paris, France; Centre National de la Recherche Scientifique UMR 8104Paris, France; Université Paris Descartes UMR-S1016Paris, France
| |
Collapse
|
37
|
MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data. Comput Biol Chem 2016; 63:62-72. [PMID: 26971251 DOI: 10.1016/j.compbiolchem.2016.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 01/25/2016] [Indexed: 11/21/2022]
Abstract
BACKGROUND As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. RESULTS Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. CONCLUSIONS By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions.
Collapse
|
38
|
Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 2015; 11:e1004509. [PMID: 26575353 PMCID: PMC4648515 DOI: 10.1371/journal.pcbi.1004509] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 08/03/2015] [Indexed: 01/25/2023] Open
Abstract
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning. Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - John D. Storey
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - Manuel Llinás
- Department of Biochemistry and Molecular Biology, and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
39
|
Zacher B, Lidschreiber M, Cramer P, Gagneur J, Tresch A. Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle. Mol Syst Biol 2014; 10:768. [PMID: 25527639 PMCID: PMC4300491 DOI: 10.15252/msb.20145654] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
DNA replication, transcription and repair involve the recruitment of protein complexes that change their composition as they progress along the genome in a directed or strand-specific manner. Chromatin immunoprecipitation in conjunction with hidden Markov models (HMMs) has been instrumental in understanding these processes, as they segment the genome into discrete states that can be related to DNA-associated protein complexes. However, current HMM-based approaches are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g., RNA expression) with non-strand-specific (e.g., ChIP) data, which is indispensable to accurately characterize directed processes. To overcome these limitations, we introduce bidirectional HMMs which infer directed genomic states from occupancy profiles de novo. Application to RNA polymerase II-associated factors in yeast and chromatin modifications in human T cells recovers the majority of transcribed loci, reveals gene-specific variations in the yeast transcription cycle and indicates the existence of directed chromatin state patterns at transcribed, but not at repressed, regions in the human genome. In yeast, we identify 32 new transcribed loci, a regulated initiation–elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination. We anticipate bidirectional HMMs to significantly improve the analyses of genome-associated directed processes.
Collapse
Affiliation(s)
- Benedikt Zacher
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Munich, Germany Institute for Genetics, University of Cologne, Cologne, Germany
| | - Michael Lidschreiber
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Munich, Germany Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Patrick Cramer
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Munich, Germany Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Julien Gagneur
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Achim Tresch
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Munich, Germany Institute for Genetics, University of Cologne, Cologne, Germany Max Planck Institute for Plant Breeding Research, Cologne, Germany
| |
Collapse
|
40
|
Leung KK, Wong HTH, Naftalin CM, Lee SS. A new perspective on sexual mixing among men who have sex with men by body image. PLoS One 2014; 9:e113791. [PMID: 25412266 PMCID: PMC4239110 DOI: 10.1371/journal.pone.0113791] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2013] [Accepted: 09/21/2014] [Indexed: 11/24/2022] Open
Abstract
Background “Casual sex” is seldom as non-selective and random as it may sound. During each sexual encounter, people consciously and unconsciously seek their casual sex partners according to different attributes. Influential to a sexual network, research focusing on quantifying the effects of physical appearance on sexual network has been sparse. Methods We evaluated the application of Log odds score (LOD) to assess the mixing patterns of 326 men who have sex with men (MSM) in Hong Kong in their networking of casual sex partners by Body Image Type (BIT). This involved an analysis of 1,196 respondents-casual sex partner pairs. Seven BITs were used in the study: Bear, Chubby, Slender, Lean toned, Muscular, Average and Other. Results A hierarchical pattern was observed in the preference of MSM for casual sex partners by the latter's BIT. Overall, Muscular men were most preferred, followed by Lean toned while the least preferred was Slender, as illustrated by LOD going down along the hierarchy in the same direction. Marked avoidance was found between men who self-identified as Chubby and men of Other body type (within-group-LOD: 1.25–2.89; between-group-LOD: <−1). None of the respondents reported to have networked a man who self-identified as Average for casual sex. Conclusions We have demonstrated the possibility of adopting a mathematical prototype to investigate the influence of BIT in a sexual network of MSM. Construction of matrix based on culture-specific BIT and cross-cultural comparisons would generate new knowledge on the mixing behaviors of MSM.
Collapse
Affiliation(s)
- Ka-Kit Leung
- Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong, China
| | - Horas T. H. Wong
- Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong, China
| | - Claire M. Naftalin
- Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong, China
| | - Shui Shan Lee
- Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong, China
- * E-mail:
| |
Collapse
|
41
|
Siebert M, Söding J. Universality of core promoter elements? Nature 2014; 511:E11-2. [PMID: 25056067 DOI: 10.1038/nature13587] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2013] [Accepted: 06/12/2014] [Indexed: 11/09/2022]
Affiliation(s)
- Matthias Siebert
- Gene Center Munich and Department of Biochemistry, Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Johannes Söding
- 1] Gene Center Munich and Department of Biochemistry, Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany [2] Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
42
|
Peterson EJR, Reiss DJ, Turkarslan S, Minch KJ, Rustad T, Plaisier CL, Longabaugh WJR, Sherman DR, Baliga NS. A high-resolution network model for global gene regulation in Mycobacterium tuberculosis. Nucleic Acids Res 2014; 42:11291-303. [PMID: 25232098 PMCID: PMC4191388 DOI: 10.1093/nar/gku777] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
The resilience of Mycobacterium tuberculosis (MTB) is largely due to its ability to effectively counteract and even take advantage of the hostile environments of a host. In order to accelerate the discovery and characterization of these adaptive mechanisms, we have mined a compendium of 2325 publicly available transcriptome profiles of MTB to decipher a predictive, systems-scale gene regulatory network model. The resulting modular organization of 98% of all MTB genes within this regulatory network was rigorously tested using two independently generated datasets: a genome-wide map of 7248 DNA-binding locations for 143 transcription factors (TFs) and global transcriptional consequences of overexpressing 206 TFs. This analysis has discovered specific TFs that mediate conditional co-regulation of genes within 240 modules across 14 distinct environmental contexts. In addition to recapitulating previously characterized regulons, we discovered 454 novel mechanisms for gene regulation during stress, cholesterol utilization and dormancy. Significantly, 183 of these mechanisms act uniquely under conditions experienced during the infection cycle to regulate diverse functions including 23 genes that are essential to host-pathogen interactions. These and other insights underscore the power of a rational, model-driven approach to unearth novel MTB biology that operates under some but not all phases of infection.
Collapse
Affiliation(s)
| | - David J Reiss
- Institute for Systems Biology, 401 Terry Ave N, Seattle, WA 98109, USA
| | - Serdar Turkarslan
- Seattle Biomed Research Institute, 307 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA
| | - Kyle J Minch
- Seattle Biomed Research Institute, 307 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA
| | - Tige Rustad
- Seattle Biomed Research Institute, 307 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA
| | | | | | - David R Sherman
- Seattle Biomed Research Institute, 307 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA
| | - Nitin S Baliga
- Institute for Systems Biology, 401 Terry Ave N, Seattle, WA 98109, USA
| |
Collapse
|
43
|
Baejen C, Torkler P, Gressel S, Essig K, Söding J, Cramer P. Transcriptome Maps of mRNP Biogenesis Factors Define Pre-mRNA Recognition. Mol Cell 2014; 55:745-57. [DOI: 10.1016/j.molcel.2014.08.005] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Revised: 07/08/2014] [Accepted: 07/31/2014] [Indexed: 12/15/2022]
|
44
|
Saponaro M, Kantidakis T, Mitter R, Kelly GP, Heron M, Williams H, Söding J, Stewart A, Svejstrup JQ. RECQL5 controls transcript elongation and suppresses genome instability associated with transcription stress. Cell 2014; 157:1037-49. [PMID: 24836610 PMCID: PMC4032574 DOI: 10.1016/j.cell.2014.03.048] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Revised: 01/21/2014] [Accepted: 03/13/2014] [Indexed: 01/03/2023]
Abstract
RECQL5 is the sole member of the RECQ family of helicases associated with RNA polymerase II (RNAPII). We now show that RECQL5 is a general elongation factor that is important for preserving genome stability during transcription. Depletion or overexpression of RECQL5 results in corresponding shifts in the genome-wide RNAPII density profile. Elongation is particularly affected, with RECQL5 depletion causing a striking increase in the average rate, concurrent with increased stalling, pausing, arrest, and/or backtracking (transcription stress). RECQL5 therefore controls the movement of RNAPII across genes. Loss of RECQL5 also results in the loss or gain of genomic regions, with the breakpoints of lost regions located in genes and common fragile sites. The chromosomal breakpoints overlap with areas of elevated transcription stress, suggesting that RECQL5 suppresses such stress and its detrimental effects, and thereby prevents genome instability in the transcribed region of genes.
Collapse
Affiliation(s)
- Marco Saponaro
- Mechanisms of Transcription Laboratory, Clare Hall Laboratories, Cancer Research UK London Research Institute, South Mimms, EN6 3LD, UK
| | - Theodoros Kantidakis
- Mechanisms of Transcription Laboratory, Clare Hall Laboratories, Cancer Research UK London Research Institute, South Mimms, EN6 3LD, UK
| | - Richard Mitter
- Bioinformatics and Biostatistics Group, Cancer Research UK London Research Institute, 44 Lincoln's Inn Fields, London WC2A 3LY, UK
| | - Gavin P Kelly
- Bioinformatics and Biostatistics Group, Cancer Research UK London Research Institute, 44 Lincoln's Inn Fields, London WC2A 3LY, UK
| | - Mark Heron
- Gene Center and Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Hannah Williams
- Mechanisms of Transcription Laboratory, Clare Hall Laboratories, Cancer Research UK London Research Institute, South Mimms, EN6 3LD, UK
| | - Johannes Söding
- Gene Center and Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Aengus Stewart
- Bioinformatics and Biostatistics Group, Cancer Research UK London Research Institute, 44 Lincoln's Inn Fields, London WC2A 3LY, UK
| | - Jesper Q Svejstrup
- Mechanisms of Transcription Laboratory, Clare Hall Laboratories, Cancer Research UK London Research Institute, South Mimms, EN6 3LD, UK.
| |
Collapse
|
45
|
Eser P, Demel C, Maier KC, Schwalb B, Pirkl N, Martin DE, Cramer P, Tresch A. Periodic mRNA synthesis and degradation co-operate during cell cycle gene expression. Mol Syst Biol 2014; 10:717. [PMID: 24489117 PMCID: PMC4023403 DOI: 10.1002/msb.134886] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
During the cell cycle, the levels of hundreds of mRNAs change in a periodic manner, but how this is achieved by alterations in the rates of mRNA synthesis and degradation has not been studied systematically. Here, we used metabolic RNA labeling and comparative dynamic transcriptome analysis (cDTA) to derive mRNA synthesis and degradation rates every 5 min during three cell cycle periods of the yeast Saccharomyces cerevisiae. A novel statistical model identified 479 genes that show periodic changes in mRNA synthesis and generally also periodic changes in their mRNA degradation rates. Peaks of mRNA degradation generally follow peaks of mRNA synthesis, resulting in sharp and high peaks of mRNA levels at defined times during the cell cycle. Whereas the timing of mRNA synthesis is set by upstream DNA motifs and their associated transcription factors (TFs), the synthesis rate of a periodically expressed gene is apparently set by its core promoter.
Collapse
Affiliation(s)
- Philipp Eser
- Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM Ludwig-Maximilians-Universität München, Munich, Germany
| | | | | | | | | | | | | | | |
Collapse
|
46
|
Abstract
MOTIVATION Generating accurate transcription factor (TF) binding site motifs from data generated using the next-generation sequencing, especially ChIP-seq, is challenging. The challenge arises because a typical experiment reports a large number of sequences bound by a TF, and the length of each sequence is relatively long. Most traditional motif finders are slow in handling such enormous amount of data. To overcome this limitation, tools have been developed that compromise accuracy with speed by using heuristic discrete search strategies or limited optimization of identified seed motifs. However, such strategies may not fully use the information in input sequences to generate motifs. Such motifs often form good seeds and can be further improved with appropriate scoring functions and rapid optimization. RESULTS We report a tool named discriminative motif optimizer (DiMO). DiMO takes a seed motif along with a positive and a negative database and improves the motif based on a discriminative strategy. We use area under receiver-operating characteristic curve (AUC) as a measure of discriminating power of motifs and a strategy based on perceptron training that maximizes AUC rapidly in a discriminative manner. Using DiMO, on a large test set of 87 TFs from human, drosophila and yeast, we show that it is possible to significantly improve motifs identified by nine motif finders. The motifs are generated/optimized using training sets and evaluated on test sets. The AUC is improved for almost 90% of the TFs on test sets and the magnitude of increase is up to 39%. AVAILABILITY AND IMPLEMENTATION DiMO is available at http://stormo.wustl.edu/DiMO
Collapse
Affiliation(s)
- Ronak Y Patel
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108, USA
| | | |
Collapse
|
47
|
Li YY, Chang X, Yu WB, Li H, Ye ZQ, Yu H, Liu BH, Zhang Y, Zhang SL, Ye BC, Li YX. Systems perspectives on erythromycin biosynthesis by comparative genomic and transcriptomic analyses of S. erythraea E3 and NRRL23338 strains. BMC Genomics 2013; 14:523. [PMID: 23902230 PMCID: PMC3733707 DOI: 10.1186/1471-2164-14-523] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Accepted: 07/26/2013] [Indexed: 11/20/2022] Open
Abstract
Background S. erythraea is a Gram-positive filamentous bacterium used for the industrial-scale production of erythromycin A which is of high clinical importance. In this work, we sequenced the whole genome of a high-producing strain (E3) obtained by random mutagenesis and screening from the wild-type strain NRRL23338, and examined time-series expression profiles of both E3 and NRRL23338. Based on the genomic data and transcriptpmic data of these two strains, we carried out comparative analysis of high-producing strain and wild-type strain at both the genomic level and the transcriptomic level. Results We observed a large number of genetic variants including 60 insertions, 46 deletions and 584 single nucleotide variations (SNV) in E3 in comparison with NRRL23338, and the analysis of time series transcriptomic data indicated that the genes involved in erythromycin biosynthesis and feeder pathways were significantly up-regulated during the 60 hours time-course. According to our data, BldD, a previously identified ery cluster regulator, did not show any positive correlations with the expression of ery cluster, suggesting the existence of alternative regulation mechanisms of erythromycin synthesis in S. erythraea. Several potential regulators were then proposed by integration analysis of genomic and transcriptomic data. Conclusion This is a demonstration of the functional comparative genomics between an industrial S. erythraea strain and the wild-type strain. These findings help to understand the global regulation mechanisms of erythromycin biosynthesis in S. erythraea, providing useful clues for genetic and metabolic engineering in the future.
Collapse
|