1
|
Raditsa V, Tsukanov A, Bogomolov A, Levitsky V. Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data. NAR Genom Bioinform 2024; 6:lqae090. [PMID: 39071850 PMCID: PMC11282361 DOI: 10.1093/nargab/lqae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/03/2024] [Accepted: 07/19/2024] [Indexed: 07/30/2024] Open
Abstract
Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.
Collapse
Affiliation(s)
- Vladimir V Raditsa
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton V Tsukanov
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton G Bogomolov
- Department of Cell Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Victor G Levitsky
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk 630090, Russia
| |
Collapse
|
2
|
Vishnevsky OV, Bocharnikov AV, Ignatieva EV. Peak Scores Significantly Depend on the Relationships between Contextual Signals in ChIP-Seq Peaks. Int J Mol Sci 2024; 25:1011. [PMID: 38256085 PMCID: PMC10816497 DOI: 10.3390/ijms25021011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/13/2023] [Accepted: 01/09/2024] [Indexed: 01/24/2024] Open
Abstract
Chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq) is a central genome-wide method for in vivo analyses of DNA-protein interactions in various cellular conditions. Numerous studies have demonstrated the complex contextual organization of ChIP-seq peak sequences and the presence of binding sites for transcription factors in them. We assessed the dependence of the ChIP-seq peak score on the presence of different contextual signals in the peak sequences by analyzing these sequences from several ChIP-seq experiments using our fully enumerative GPU-based de novo motif discovery method, Argo_CUDA. Analysis revealed sets of significant IUPAC motifs corresponding to the binding sites of the target and partner transcription factors. For these ChIP-seq experiments, multiple regression models were constructed, demonstrating a significant dependence of the peak scores on the presence in the peak sequences of not only highly significant target motifs but also less significant motifs corresponding to the binding sites of the partner transcription factors. A significant correlation was shown between the presence of the target motifs FOXA2 and the partner motifs HNF4G, which found experimental confirmation in the scientific literature, demonstrating the important contribution of the partner transcription factors to the binding of the target transcription factor to DNA and, consequently, their important contribution to the peak score.
Collapse
Affiliation(s)
- Oleg V. Vishnevsky
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Andrey V. Bocharnikov
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Elena V. Ignatieva
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| |
Collapse
|
3
|
Rasoarahona R, Wattanadilokchatkun P, Panthum T, Jaisamut K, Lisachov A, Thong T, Singchat W, Ahmad SF, Han K, Kraichak E, Muangmai N, Koga A, Duengkae P, Antunes A, Srikulnath K. MicrosatNavigator: exploring nonrandom distribution and lineage-specificity of microsatellite repeat motifs on vertebrate sex chromosomes across 186 whole genomes. Chromosome Res 2023; 31:29. [PMID: 37775555 DOI: 10.1007/s10577-023-09738-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 08/11/2023] [Accepted: 09/05/2023] [Indexed: 10/01/2023]
Abstract
Microsatellites are short tandem DNA repeats, ubiquitous in genomes. They are believed to be under selection pressure, considering their high distribution and abundance beyond chance or random accumulation. However, limited analysis of microsatellites in single taxonomic groups makes it challenging to understand their evolutionary significance across taxonomic boundaries. Despite abundant genomic information, microsatellites have been studied in limited contexts and within a few species, warranting an unbiased examination of their genome-wide distribution in distinct versus closely related-clades. Large-scale comparisons have revealed relevant trends, especially in vertebrates. Here, "MicrosatNavigator", a new tool that allows quick and reliable investigation of perfect microsatellites in DNA sequences, was developed. This tool can identify microsatellites across the entire genome sequences. Using this tool, microsatellite repeat motifs were identified in the genome sequences of 186 vertebrates. A significant positive correlation was noted between the abundance, density, length, and GC bias of microsatellites and specific lineages. The (AC)n motif is the most prevalent in vertebrate genomes, showing distinct patterns in closely related species. Longer microsatellites were observed on sex chromosomes in birds and mammals but not on autosomes. Microsatellites on sex chromosomes of non-fish vertebrates have the lowest GC content, whereas high-GC microsatellites (≥ 50 M% GC) are preferred in bony and cartilaginous fishes. Thus, similar selective forces and mutational processes may constrain GC-rich microsatellites to different clades. These findings should facilitate investigations into the roles of microsatellites in sex chromosome differentiation and provide candidate microsatellites for functional analysis across the vertebrate evolutionary spectrum.
Collapse
Affiliation(s)
- Ryan Rasoarahona
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Sciences for Industry, Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Pish Wattanadilokchatkun
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Thitipong Panthum
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Kitipong Jaisamut
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Artem Lisachov
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Thanyapat Thong
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Worapong Singchat
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Syed Farhan Ahmad
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Kyudong Han
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Microbiology, College of Science & Technology, Dankook University, Cheonan, 31116, Republic of Korea
- Center for Bio-Medical Engineering Core Facility, Dankook University, Cheonan, 31116, Republic of Korea
| | - Ekaphan Kraichak
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Botany, Faculty of Science, Kasetsart University, Bangkok, 10900, Thailand
| | - Narongrit Muangmai
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Fishery Biology, Faculty of Fisheries, Kasetsart University, Chatuchak, Bangkok, 10900, Thailand
| | - Akihiko Koga
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Prateep Duengkae
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros Do Porto de Leixes, Av. General Norton de Matos, S/N, 4450-208, Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal
| | - Kornsorn Srikulnath
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Sciences for Industry, Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Center for Advanced Studies in Tropical Natural Resources, National Research University-Kasetsart University, Kasetsart University, (CASTNAR, NRU-KU, Thailand), Bangkok, 10900, Thailand.
- Center of Excellence on Agricultural Biotechnology (AG-BIO/PERDO-CHE), Bangkok, 10900, Thailand.
| |
Collapse
|
4
|
Patra T, Cunningham DM, Meyer K, Toth K, Ray RB, Heczey A, Ray R. Targeting Lin28 axis enhances glypican-3-CAR T cell efficacy against hepatic tumor initiating cell population. Mol Ther 2023; 31:715-728. [PMID: 36609146 PMCID: PMC10014222 DOI: 10.1016/j.ymthe.2023.01.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 08/01/2022] [Accepted: 01/04/2023] [Indexed: 01/08/2023] Open
Abstract
Overexpression of Lin28 is detected in various cancers with involvement in the self-renewal process and cancer stem cell generation. In the present study, we evaluated how the Lin28 axis plays an immune-protective role for tumor-initiating cancer cells in hepatocellular carcinoma (HCC). Our result using HCC patient samples showed a positive correlation between indoleamine 2,3-dioxygenase-1 (IDO1), a kynurenine-producing enzyme with effects on tumor immune escape, and Lin28B. Using in silico prediction, we identified a Sox2/Oct4 transcriptional motif acting as an enhancer for IDO1. Knockdown of Lin28B reduced Sox2/Oct4 and downregulated IDO1 in tumor-initiating hepatic cancer cells. We further observed that inhibition of Lin28 by a small-molecule inhibitor (C1632) suppressed IDO1 expression. Suppression of IDO1 resulted in a decline in kynurenine production from tumor-initiating cells. Inhibition of the Lin28 axis also impaired PD-L1 expression in HCC cells. Consequently, modulating Lin28B enhanced in vitro cytotoxicity of glypican-3 (GPC3)-chimeric antigen receptor (CAR) T and NK cells. Next, we observed that GPC3-CAR T cell treatment together with C1632 in a HCC xenograft mouse model led to enhanced anti-tumor activity. In conclusion, our results suggest that inhibition of Lin28B reduces IDO1 and PD-L1 expression and enhances immunotherapeutic potential of GPC3-CART cells against HCC.
Collapse
Affiliation(s)
- Tapas Patra
- Department of Internal Medicine, Saint Louis University, St. Louis, MO 63104, USA.
| | - David M Cunningham
- Center for Advanced Innate Cell Therapy, Texas Children's Cancer Center, Division of Pediatric Hematology and Oncology, Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Keith Meyer
- Department of Internal Medicine, Saint Louis University, St. Louis, MO 63104, USA
| | - Karoly Toth
- Department of Molecular Microbiology & Immunology and Saint Louis University, St. Louis, MO 63104, USA
| | - Ratna B Ray
- Department of Pathology, Saint Louis University, St. Louis, MO 63104, USA
| | - Andras Heczey
- Center for Advanced Innate Cell Therapy, Texas Children's Cancer Center, Division of Pediatric Hematology and Oncology, Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ranjit Ray
- Department of Internal Medicine, Saint Louis University, St. Louis, MO 63104, USA; Department of Molecular Microbiology & Immunology and Saint Louis University, St. Louis, MO 63104, USA.
| |
Collapse
|
5
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
6
|
Theepalakshmi P, Reddy US. Freezing firefly algorithm for efficient planted (ℓ, d) motif search. Med Biol Eng Comput 2022; 60:511-530. [PMID: 35020123 DOI: 10.1007/s11517-021-02468-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 11/06/2021] [Indexed: 10/19/2022]
Abstract
The detection of inimitable patterns (motif) occurring in a set of biological sequences could elevate new biological discoveries. Its application in recognition of transcription factors and their binding sites have demonstrated the necessity to attain knowledge of gene function, human diseases, and drug design. The literature identifies (ℓ, d) motif search as the widely studied problem in PMS (Planted Motif Search). This paper proposes an efficient optimization algorithm named "Freezing FireFly (FFF)" to solve (ℓ, d) motif search problem. The new strategy freezing such as local and global was added to increase the performance of the basic Firefly algorithm. It freezes the best possible out coming positions even in the lesser brighter one. The performance of the proposed algorithm is experienced on simulated and real datasets. The experimental results show that the proposed algorithm resolves the instance (50, 21) within 1.47 min in the simulated dataset. For real (such as ChIP-seq (Chromatin Immunoprecipitation)) and synthetic datasets, the proposed algorithm runs much faster in comparison to existing state-of-the-art optimization algorithms, including Samselect, TraverStringRef, PMS8, qPMS9, AlignACE, FMGA, and GSGA.
Collapse
Affiliation(s)
- P Theepalakshmi
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India.
| | - U Srinivasulu Reddy
- Machine Learning and Data Analytics Lab, Center of Excellence in Artificial Intelligence, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India
| |
Collapse
|
7
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
8
|
Soares MAF, Soares DS, Teixeira V, Heskol A, Bressan RB, Pollard SM, Oliveira RA, Castro DS. Hierarchical reactivation of transcription during mitosis-to-G1 transition by Brn2 and Ascl1 in neural stem cells. Genes Dev 2021; 35:1020-1034. [PMID: 34168041 PMCID: PMC8247608 DOI: 10.1101/gad.348174.120] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 05/19/2021] [Indexed: 12/19/2022]
Abstract
During mitosis, chromatin condensation is accompanied by a global arrest of transcription. Recent studies suggest transcriptional reactivation upon mitotic exit occurs in temporally coordinated waves, but the underlying regulatory principles have yet to be elucidated. In particular, the contribution of sequence-specific transcription factors (TFs) remains poorly understood. Here we report that Brn2, an important regulator of neural stem cell identity, associates with condensed chromatin throughout cell division, as assessed by live-cell imaging of proliferating neural stem cells. In contrast, the neuronal fate determinant Ascl1 dissociates from mitotic chromosomes. ChIP-seq analysis reveals that Brn2 mitotic chromosome binding does not result in sequence-specific interactions prior to mitotic exit, relying mostly on electrostatic forces. Nevertheless, surveying active transcription using single-molecule RNA-FISH against immature transcripts reveals differential reactivation kinetics for key targets of Brn2 and Ascl1, with transcription onset detected in early (anaphase) versus late (early G1) phases, respectively. Moreover, by using a mitotic-specific dominant-negative approach, we show that competing with Brn2 binding during mitotic exit reduces the transcription of its target gene Nestin Our study shows an important role for differential binding of TFs to mitotic chromosomes, governed by their electrostatic properties, in defining the temporal order of transcriptional reactivation during mitosis-to-G1 transition.
Collapse
Affiliation(s)
- Mário A F Soares
- Instituto Gulbenkian de Ciência, 2780-156 Oeiras, Portugal
- i3S Instituto de Investigação e Inovação em Saúde, IBMC Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135 Porto, Portugal
| | - Diogo S Soares
- Instituto Gulbenkian de Ciência, 2780-156 Oeiras, Portugal
- i3S Instituto de Investigação e Inovação em Saúde, IBMC Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135 Porto, Portugal
| | - Vera Teixeira
- Instituto Gulbenkian de Ciência, 2780-156 Oeiras, Portugal
| | - Abeer Heskol
- Instituto Gulbenkian de Ciência, 2780-156 Oeiras, Portugal
- i3S Instituto de Investigação e Inovação em Saúde, IBMC Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135 Porto, Portugal
| | - Raul Bardini Bressan
- Centre for Regenerative Medicine, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh EH16 4UU, United Kingdom
| | - Steven M Pollard
- Centre for Regenerative Medicine, Institute for Regeneration and Repair, University of Edinburgh, Edinburgh EH16 4UU, United Kingdom
| | | | - Diogo S Castro
- Instituto Gulbenkian de Ciência, 2780-156 Oeiras, Portugal
- i3S Instituto de Investigação e Inovação em Saúde, IBMC Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135 Porto, Portugal
| |
Collapse
|
9
|
Ge W, Meier M, Roth C, Söding J. Bayesian Markov models improve the prediction of binding motifs beyond first order. NAR Genom Bioinform 2021; 3:lqab026. [PMID: 33928244 PMCID: PMC8057495 DOI: 10.1093/nargab/lqab026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 03/11/2021] [Accepted: 03/30/2021] [Indexed: 12/13/2022] Open
Abstract
Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Accurate models for predicting binding affinities are crucial for quantitatively understanding of transcriptional regulation. Motifs are commonly described by position weight matrices, which assume that each position contributes independently to the binding energy. Models that can learn dependencies between positions, for instance, induced by DNA structure preferences, have yielded markedly improved predictions for most TFs on in vivo data. However, they are more prone to overfit the data and to learn patterns merely correlated with rather than directly involved in TF binding. We present an improved, faster version of our Bayesian Markov model software, BaMMmotif2. We tested it with state-of-the-art motif discovery tools on a large collection of ChIP-seq and HT-SELEX datasets. BaMMmotif2 models of fifth-order achieved a median false-discovery-rate-averaged recall 13.6% and 12.2% higher than the next best tool on 427 ChIP-seq datasets and 164 HT-SELEX datasets, respectively, while being 8 to 1000 times faster. BaMMmotif2 models showed no signs of overtraining in cross-cell line and cross-platform tests, with similar improvements on the next-best tool. These results demonstrate that dependencies beyond first order clearly improve binding models for most TFs.
Collapse
Affiliation(s)
- Wanwan Ge
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Meier
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Christian Roth
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
10
|
Monteiro FA, Miranda RM, Samina MC, Dias AF, Raposo AASF, Oliveira P, Reguenga C, Castro DS, Lima D. Tlx3 Exerts Direct Control in Specifying Excitatory Over Inhibitory Neurons in the Dorsal Spinal Cord. Front Cell Dev Biol 2021; 9:642697. [PMID: 33996801 PMCID: PMC8117147 DOI: 10.3389/fcell.2021.642697] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 03/30/2021] [Indexed: 11/28/2022] Open
Abstract
The spinal cord dorsal horn is a major station for integration and relay of somatosensory information and comprises both excitatory and inhibitory neuronal populations. The homeobox gene Tlx3 acts as a selector gene to control the development of late-born excitatory (dILB) neurons by specifying glutamatergic transmitter fate in dorsal spinal cord. However, since Tlx3 direct transcriptional targets remain largely unknown, it remains to be uncovered how Tlx3 functions to promote excitatory cell fate. Here we combined a genomics approach based on chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) and expression profiling, with validation experiments in Tlx3 null embryos, to characterize the transcriptional program of Tlx3 in mouse embryonic dorsal spinal cord. We found most dILB neuron specific genes previously identified to be directly activated by Tlx3. Surprisingly, we found Tlx3 also directly represses many genes associated with the alternative inhibitory dILA neuronal fate. In both cases, direct targets include transcription factors and terminal differentiation genes, showing that Tlx3 directly controls cell identity at distinct levels. Our findings provide a molecular frame for the master regulatory role of Tlx3 in developing glutamatergic dILB neurons. In addition, they suggest a novel function for Tlx3 as direct repressor of GABAergic dILA identity, pointing to how generation of the two alternative cell fates being tightly coupled.
Collapse
Affiliation(s)
- Filipe A Monteiro
- Unidade de Biologia Experimental, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| | - Rafael M Miranda
- Unidade de Biologia Experimental, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| | - Marta C Samina
- Unidade de Biologia Experimental, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| | - Ana F Dias
- Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Instituto de Ciências Biomédicas Abel Salazar, Universidade do Porto, Porto, Portugal
| | - Alexandre A S F Raposo
- Molecular Neurobiology Group, Instituto Gulbenkian de Ciência, Oeiras, Portugal.,Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina da Universidade de Lisboa, Lisboa, Portugal
| | - Patrícia Oliveira
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Diagnostics, Institute of Molecular Pathology and Immunology, University of Porto, Porto, Portugal
| | - Carlos Reguenga
- Unidade de Biologia Experimental, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| | - Diogo S Castro
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal.,Molecular Neurobiology Group, Instituto Gulbenkian de Ciência, Oeiras, Portugal.,Stem Cells & Neurogenesis Group, Instituto de Biologia Molecular e Celular, Porto, Portugal
| | - Deolinda Lima
- Unidade de Biologia Experimental, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,Pain Research Group, Instituto de Biologia Molecular e Celular, Porto, Portugal.,Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| |
Collapse
|
11
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
12
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
13
|
Transcription factor expression defines subclasses of developing projection neurons highly similar to single-cell RNA-seq subtypes. Proc Natl Acad Sci U S A 2020; 117:25074-25084. [PMID: 32948690 DOI: 10.1073/pnas.2008013117] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
We are only just beginning to catalog the vast diversity of cell types in the cerebral cortex. Such categorization is a first step toward understanding how diversification relates to function. All cortical projection neurons arise from a uniform pool of progenitor cells that lines the ventricles of the forebrain. It is still unclear how these progenitor cells generate the more than 50 unique types of mature cortical projection neurons defined by their distinct gene-expression profiles. Moreover, exactly how and when neurons diversify their function during development is unknown. Here we relate gene expression and chromatin accessibility of two subclasses of projection neurons with divergent morphological and functional features as they develop in the mouse brain between embryonic day 13 and postnatal day 5 in order to identify transcriptional networks that diversify neuron cell fate. We compare these gene-expression profiles with published profiles of single cells isolated from similar populations and establish that layer-defined cell classes encompass cell subtypes and developmental trajectories identified using single-cell sequencing. Given the depth of our sequencing, we identify groups of transcription factors with particularly dense subclass-specific regulation and subclass-enriched transcription factor binding motifs. We also describe transcription factor-adjacent long noncoding RNAs that define each subclass and validate the function of Myt1l in balancing the ratio of the two subclasses in vitro. Our multidimensional approach supports an evolving model of progressive restriction of cell fate competence through inherited transcriptional identities.
Collapse
|
14
|
Ke L, Yang DC, Wang Y, Ding Y, Gao G. AnnoLnc2: the one-stop portal to systematically annotate novel lncRNAs for human and mouse. Nucleic Acids Res 2020; 48:W230-W238. [PMID: 32406920 PMCID: PMC7319567 DOI: 10.1093/nar/gkaa368] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 04/21/2020] [Accepted: 04/29/2020] [Indexed: 12/15/2022] Open
Abstract
With the abundant mammalian lncRNAs identified recently, a comprehensive annotation resource for these novel lncRNAs is an urgent need. Since its first release in November 2016, AnnoLnc has been the only online server for comprehensively annotating novel human lncRNAs on-the-fly. Here, with significant updates to multiple annotation modules, backend datasets and the code base, AnnoLnc2 continues the effort to provide the scientific community with a one-stop online portal for systematically annotating novel human and mouse lncRNAs with a comprehensive functional spectrum covering sequences, structure, expression, regulation, genetic association and evolution. In response to numerous requests from multiple users, a standalone package is also provided for large-scale offline analysis. We believe that updated AnnoLnc2 (http://annolnc.gao-lab.org/) will help both computational and bench biologists identify lncRNA functions and investigate underlying mechanisms.
Collapse
Affiliation(s)
- Lan Ke
- School of Life Sciences, Biomedical Pioneering Innovation Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI) and State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing 100871, China
| | - De-Chang Yang
- School of Life Sciences, Biomedical Pioneering Innovation Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI) and State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing 100871, China
| | - Yu Wang
- School of Life Sciences, Biomedical Pioneering Innovation Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI) and State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing 100871, China
| | - Yang Ding
- Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Ge Gao
- School of Life Sciences, Biomedical Pioneering Innovation Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI) and State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing 100871, China
| |
Collapse
|
15
|
Singh R, Lanchantin J, Robins G, Qi Y. Transfer String Kernel for Cross-Context DNA-Protein Binding Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1524-1536. [PMID: 27654939 DOI: 10.1109/tcbb.2016.2609918] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Through sequence-based classification, this paper tries to accurately predict the DNA binding sites of transcription factors (TFs) in an unannotated cellular context. Related methods in the literature fail to perform such predictions accurately, since they do not consider sample distribution shift of sequence segments from an annotated (source) context to an unannotated (target) context. We, therefore, propose a method called "Transfer String Kernel" (TSK) that achieves improved prediction of transcription factor binding site (TFBS) using knowledge transfer via cross-context sample adaptation. TSK maps sequence segments to a high-dimensional feature space using a discriminative mismatch string kernel framework. In this high-dimensional space, labeled examples of the source context are re-weighted so that the revised sample distribution matches the target context more closely. We have experimentally verified TSK for TFBS identifications on 14 different TFs under a cross-organism setting. We find that TSK consistently outperforms the state-of-the-art TFBS tools, especially when working with TFs whose binding sequences are not conserved across contexts. We also demonstrate the generalizability of TSK by showing its cutting-edge performance on a different set of cross-context tasks for the MHC peptide binding predictions.
Collapse
|
16
|
Sun CX, Yang Y, Wang H, Wang WH. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. ENTROPY (BASEL, SWITZERLAND) 2019; 21:E802. [PMID: 33267515 PMCID: PMC7515331 DOI: 10.3390/e21080802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 08/04/2019] [Accepted: 08/15/2019] [Indexed: 12/25/2022]
Abstract
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.
Collapse
Affiliation(s)
- Chun-xiao Sun
- College of Science, Northwest A&F University, Yangling 712100, China
| | - Yu Yang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hua Wang
- College of Software, Nankai University, Tianjin 300071, China
- Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA
| | - Wen-hu Wang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
| |
Collapse
|
17
|
Rozenberg JM, Taylor JM, Mack CP. RBPJ binds to consensus and methylated cis elements within phased nucleosomes and controls gene expression in human aortic smooth muscle cells in cooperation with SRF. Nucleic Acids Res 2019; 46:8232-8244. [PMID: 29931229 PMCID: PMC6144787 DOI: 10.1093/nar/gky562] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 06/07/2018] [Indexed: 11/15/2022] Open
Abstract
Given our previous demonstration that RBPJ binds a methylated repressor element and regulates smooth muscle cell (SMC)-specific gene expression, we used genome-wide approaches to identify RBPJ binding regions in human aortic SMC and to assess RBPJ's effects on chromatin structure and gene expression. RBPJ bound to consensus cis elements, but also to TCmGGGA sequences within Alu repeats that were less transcriptionally active as assessed by DNAse hypersensitivity, H3K9 acetylation, and Notch3 and RNA Pol II binding. Interestingly, RBPJ binding was frequently detected at the borders of open chromatin, and a large fraction of genes induced or repressed by RBPJ depletion were associated with this cluster of RBPJ binding sites. RBPJ binding dramatically co-localized with serum response factor (SRF) and RNA seq experiments in RBPJ- and SRF-depleted SMC demonstrated that these factors interact functionally to regulate the contraction and inflammatory gene programs that help define SMC phenotype. Finally, we showed that RBPJ bound preferentially to phased nucleosomes independent of active chromatin marks and to cis elements positioned at the beginning and middle of the nucleosome dyad. These novel findings add important insight into RBPJ's role in chromatin structure and gene expression in SMC.
Collapse
Affiliation(s)
- Julian M Rozenberg
- Department of Pathology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Joan M Taylor
- Department of Pathology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Christopher P Mack
- Department of Pathology, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
18
|
Hashim FA, Mabrouk MS, Al-Atabany W. Review of Different Sequence Motif Finding Algorithms. Avicenna J Med Biotechnol 2019; 11:130-148. [PMID: 31057715 PMCID: PMC6490410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/26/2018] [Indexed: 11/05/2022] Open
Abstract
The DNA motif discovery is a primary step in many systems for studying gene function. Motif discovery plays a vital role in identification of Transcription Factor Binding Sites (TFBSs) that help in learning the mechanisms for regulation of gene expression. Over the past decades, different algorithms were used to design fast and accurate motif discovery tools. These algorithms are generally classified into consensus or probabilistic approaches that many of them are time-consuming and easily trapped in a local optimum. Nature-inspired algorithms and many of combinatorial algorithms are recently proposed to overcome these problems. This paper presents a general classification of motif discovery algorithms with new sub-categories that facilitate building a successful motif discovery algorithm. It also presents a summary of comparison between them.
Collapse
Affiliation(s)
- Fatma A. Hashim
- Department of Biomedical Engineering, Helwan University, Egypt
| | - Mai S. Mabrouk
- Department of Biomedical Engineering, Misr University for Science and Technology (MUST), Egypt
| | | |
Collapse
|
19
|
Hashim FA, Mabrouk MS, Atabany WA. Comparative Analysis of DNA Motif Discovery Algorithms: A Systemic Review. CURRENT CANCER THERAPY REVIEWS 2019. [DOI: 10.2174/1573394714666180417161728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Bioinformatics is an interdisciplinary field that combines biology and information
technology to study how to deal with the biological data. The DNA motif discovery
problem is the main challenge of genome biology and its importance is directly proportional to increasing
sequencing technologies which produce large amounts of data. DNA motif is a repeated
portion of DNA sequences of major biological interest with important structural and functional
features. Motif discovery plays a vital role in the antibody-biomarker identification which is useful
for diagnosis of disease and to identify Transcription Factor Binding Sites (TFBSs) that help in
learning the mechanisms for regulation of gene expression. Recently, scientists discovered that the
TFs have a mutation rate five times higher than the flanking sequences, so motif discovery also
has a crucial role in cancer discovery.
Methods:
Over the past decades, many attempts use different algorithms to design fast and accurate
motif discovery tools. These algorithms are generally classified into consensus or probabilistic
approach.
Results:
Many of DNA motif discovery algorithms are time-consuming and easily trapped in a local
optimum.
Conclusion:
Nature-inspired algorithms and many of combinatorial algorithms are recently proposed
to overcome the problems of consensus and probabilistic approaches. This paper presents a
general classification of motif discovery algorithms with new sub-categories. It also presents a
summary comparison between them.
Collapse
Affiliation(s)
- Fatma A. Hashim
- Department of Biomedical Engineering, Helwan University, Helwan, Egypt
| | - Mai S. Mabrouk
- Department of Biomedical Engineering, Misr University for Science and Technology (MUST), Cairo, Egypt
| | | |
Collapse
|
20
|
Tran NTL, Huang CH. Performance evaluation for MOTIFSIM. Biol Proced Online 2018; 20:23. [PMID: 30574025 PMCID: PMC6299673 DOI: 10.1186/s12575-018-0088-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 12/07/2018] [Indexed: 11/10/2022] Open
Abstract
Background Previous studies show various results obtained from different motif finders for an identical dataset. This is largely due to the fact that these tools use different strategies and possess unique features for discovering the motifs. Hence, using multiple tools and methods has been suggested because the motifs commonly reported by them are more likely to be biologically significant. Results The common significant motifs from multiple tools can be obtained by using MOTIFSIM tool. In this work, we evaluated the performance of MOTIFSIM in three aspects. First, we compared the pair-wise comparison technique of MOTIFSIM with the un-gapped Smith-Waterman algorithm and four common distance metrics: average Kullback-Leibler, average log-likelihood ratio, Chi-Square distance, and Pearson Correlation Coefficient. Second, we compared the performance of MOTIFSIM with RSAT Matrix-clustering tool for motif clustering. Lastly, we evaluated the performances of nineteen motif finders and the reliability of MOTIFSIM for identifying the common significant motifs from multiple tools. Conclusions The pair-wise comparison results reveal that MOTIFSIM attains better performance than the un-gapped Smith-Waterman algorithm and four distance metrics. The clustering results also demonstrate that MOTIFSIM achieves similar or even better performance than RSAT Matrix-clustering. Furthermore, the findings indicate if the motif detection does not require a special tool for detecting a specific type of motif then using multiple motif finders and combining with MOTIFSIM for obtaining the common significant motifs, it improved the results for DNA motif detection. Electronic supplementary material The online version of this article (10.1186/s12575-018-0088-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA
| |
Collapse
|
21
|
Atypical GATA transcription factor TRPS1 represses gene expression by recruiting CHD4/NuRD(MTA2) and suppresses cell migration and invasion by repressing TP63 expression. Oncogenesis 2018; 7:96. [PMID: 30563971 PMCID: PMC6299095 DOI: 10.1038/s41389-018-0108-9] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/30/2018] [Accepted: 11/26/2018] [Indexed: 01/10/2023] Open
Abstract
Transcriptional repressor GATA binding 1 (TRPS1), an atypical GATA transcription factor, functions as a transcriptional repressor and is also implicated in human cancers. However, the underlying mechanism of TRPS1 contributing to malignancy remains obscure. In the current study, we report that TRPS1 recognizes both gene proximal and distal transcription start site (TSS) sequences to repress gene expression. Co-IP mass spectrometry and biochemical studies showed that TRPS1 binds to CHD4/NuRD(MTA2). Genome-wide and molecular studies revealed that CHD4/NuRD(MTA2) is required for TRPS1 transcriptional repression. Mechanically, TRPS1 and CHD4/NuRD(MTA2) form precision-guided transcriptional repression machinery in which TRPS1 guides the machinery to specific target sites by recognizing GATA elements, and CHD4/NuRD(MTA2) represses the transcription of target genes. Furthermore, TP63 was identified and validated to be a direct target of TRPS1-CHD4/NuRD(MTA2) complex, which represses TP63 expression by involving decommission of TP63 enhancer in the described precision-guided manner, leading to a reduction of the ΔNp63 level and contributing to migration and invasion of cancer cells.
Collapse
|
22
|
Malik V, Zimmer D, Jauch R. Diversity among POU transcription factors in chromatin recognition and cell fate reprogramming. Cell Mol Life Sci 2018; 75:1587-1612. [PMID: 29335749 PMCID: PMC11105716 DOI: 10.1007/s00018-018-2748-5] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Revised: 12/23/2017] [Accepted: 01/08/2018] [Indexed: 12/28/2022]
Abstract
The POU (Pit-Oct-Unc) protein family is an evolutionary ancient group of transcription factors (TFs) that bind specific DNA sequences to direct gene expression programs. The fundamental importance of POU TFs to orchestrate embryonic development and to direct cellular fate decisions is well established, but the molecular basis for this activity is insufficiently understood. POU TFs possess a bipartite 'two-in-one' DNA binding domain consisting of two independently folding structural units connected by a poorly conserved and flexible linker. Therefore, they represent a paradigmatic example to study the molecular basis for the functional versatility of TFs. Their modular architecture endows POU TFs with the capacity to accommodate alternative composite DNA sequences by adopting different quaternary structures. Moreover, associations with partner proteins crucially influence the selection of their DNA binding sites. The plentitude of DNA binding modes confers the ability to POU TFs to regulate distinct genes in the context of different cellular environments. Likewise, different binding modes of POU proteins to DNA could trigger alternative regulatory responses in the context of different genomic locations of the same cell. Prominent POU TFs such as Oct4, Brn2, Oct6 and Brn4 are not only essential regulators of development but have also been successfully employed to reprogram somatic cells to pluripotency and neural lineages. Here we review biochemical, structural, genomic and cellular reprogramming studies to examine how the ability of POU TFs to select regulatory DNA, alone or with partner factors, is tied to their capacity to epigenetically remodel chromatin and drive specific regulatory programs that give cells their identities.
Collapse
Affiliation(s)
- Vikas Malik
- CAS Key Laboratory of Regenerative Biology, Joint School of Life Sciences, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou Medical University, Guangzhou, 511436, China
- Genome Regulation Laboratory, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, China
| | - Dennis Zimmer
- CAS Key Laboratory of Regenerative Biology, Joint School of Life Sciences, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou Medical University, Guangzhou, 511436, China
- Genome Regulation Laboratory, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, China
| | - Ralf Jauch
- CAS Key Laboratory of Regenerative Biology, Joint School of Life Sciences, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou Medical University, Guangzhou, 511436, China.
- Genome Regulation Laboratory, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, China.
| |
Collapse
|
23
|
Dang LT, Tondl M, Chiu MHH, Revote J, Paten B, Tano V, Tokolyi A, Besse F, Quaife-Ryan G, Cumming H, Drvodelic MJ, Eichenlaub MP, Hallab JC, Stolper JS, Rossello FJ, Bogoyevitch MA, Jans DA, Nim HT, Porrello ER, Hudson JE, Ramialison M. TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets. BMC Genomics 2018; 19:238. [PMID: 29621972 PMCID: PMC5887194 DOI: 10.1186/s12864-018-4630-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 03/27/2018] [Indexed: 12/14/2022] Open
Abstract
Background A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57–74, 2012; Nat 507:462–70, 2014; Nat 507:455–61, 2014; Nat 518:317–30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. Results We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563–5, 2007; Nat Protoc 5:323–34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. Conclusions TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au. Electronic supplementary material The online version of this article (10.1186/s12864-018-4630-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Louis T Dang
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Markus Tondl
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Man Ho H Chiu
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Jerico Revote
- eResearch, Monash University, Clayton, VIC, Australia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Vincent Tano
- Department of Biochemistry and Molecular Biology, Bio21 Institute and Cell Signalling Research Laboratories, The University of Melbourne, Melbourne, VIC, Australia
| | - Alex Tokolyi
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Florence Besse
- CNRS, Inserm, Institute of Biology Valrose, Université Côte d'Azur, Parc Valrose, Nice, France
| | - Greg Quaife-Ryan
- School of Biomedical Sciences, The University of Queensland, QLD, Brisbane, Australia
| | - Helen Cumming
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Monash University, Clayton, VIC, Australia
| | - Mark J Drvodelic
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Michael P Eichenlaub
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Jeannette C Hallab
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Julian S Stolper
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Fernando J Rossello
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Marie A Bogoyevitch
- Department of Biochemistry and Molecular Biology, Bio21 Institute and Cell Signalling Research Laboratories, The University of Melbourne, Melbourne, VIC, Australia
| | - David A Jans
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia
| | - Hieu T Nim
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia.,Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Enzo R Porrello
- Murdoch Children's Research Institute, The Royal Children's Hospital, Parkville, VIC, Australia.,Department of Physiology, School of Biomedical Sciences, The University of Melbourne, Parkville, VIC, Australia
| | - James E Hudson
- School of Biomedical Sciences, The University of Queensland, QLD, Brisbane, Australia
| | - Mirana Ramialison
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia.
| |
Collapse
|
24
|
Mistri TK, Arindrarto W, Ng WP, Wang C, Lim LH, Sun L, Chambers I, Wohland T, Robson P. Dynamic changes in Sox2 spatio-temporal expression promote the second cell fate decision through Fgf4/ Fgfr2 signaling in preimplantation mouse embryos. Biochem J 2018; 475:1075-1089. [PMID: 29487166 PMCID: PMC5896025 DOI: 10.1042/bcj20170418] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Revised: 02/26/2018] [Accepted: 02/27/2018] [Indexed: 12/22/2022]
Abstract
Oct4 and Sox2 regulate the expression of target genes such as Nanog, Fgf4, and Utf1, by binding to their respective regulatory motifs. Their functional cooperation is reflected in their ability to heterodimerize on adjacent cis regulatory motifs, the composite Sox/Oct motif. Given that Oct4 and Sox2 regulate many developmental genes, a quantitative analysis of their synergistic action on different Sox/Oct motifs would yield valuable insights into the mechanisms of early embryonic development. In the present study, we measured binding affinities of Oct4 and Sox2 to different Sox/Oct motifs using fluorescence correlation spectroscopy. We found that the synergistic binding interaction is driven mainly by the level of Sox2 in the case of the Fgf4 Sox/Oct motif. Taking into account Sox2 expression levels fluctuate more than Oct4, our finding provides an explanation on how Sox2 controls the segregation of the epiblast and primitive endoderm populations within the inner cell mass of the developing rodent blastocyst.
Collapse
Affiliation(s)
- Tapan Kumar Mistri
- School of Chemical Engineering and Physical Sciences, Lovely Professional University, Phagwara, Punjab 144411, India
- Department of Chemistry, National University of Singapore, Singapore
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh EH16 4UU, U.K
| | - Wibowo Arindrarto
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
| | - Wei Ping Ng
- Department of Chemistry, National University of Singapore, Singapore
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
| | - Choayang Wang
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
| | - Leng Hiong Lim
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
| | - Lili Sun
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
| | - Ian Chambers
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh EH16 4UU, U.K.
| | - Thorsten Wohland
- Department of Chemistry, National University of Singapore, Singapore
- Department of Biological Sciences, National University of Singapore, Singapore
- Centre for Bioimaging Sciences, National University of Singapore, Singapore
| | - Paul Robson
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, U.S.A
| |
Collapse
|
25
|
Vishnevsky OV, Bocharnikov AV, Kolchanov NA. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets. J Bioinform Comput Biol 2017; 16:1740012. [PMID: 29281953 DOI: 10.1142/s0219720017400121] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.
Collapse
Affiliation(s)
- Oleg V Vishnevsky
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| | | | - Nikolay A Kolchanov
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| |
Collapse
|
26
|
The role of Cdx2 as a lineage specific transcriptional repressor for pluripotent network during the first developmental cell lineage segregation. Sci Rep 2017; 7:17156. [PMID: 29214996 PMCID: PMC5719399 DOI: 10.1038/s41598-017-16009-w] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Accepted: 11/06/2017] [Indexed: 01/08/2023] Open
Abstract
The first cellular differentiation event in mouse development leads to the formation of the blastocyst consisting of the inner cell mass (ICM) and trophectoderm (TE). The transcription factor CDX2 is required for proper TE specification, where it promotes expression of TE genes, and represses expression of Pou5f1 (OCT4). However its downstream network in the developing embryo is not fully characterized. Here, we performed high-throughput single embryo qPCR analysis in Cdx2 null embryos to identify CDX2-regulated targets in vivo. To identify genes likely to be regulated by CDX2 directly, we performed CDX2 ChIP-Seq on trophoblast stem (TS) cells. In addition, we examined the dynamics of gene expression changes using inducible CDX2 embryonic stem (ES) cells, so that we could predict which CDX2-bound genes are activated or repressed by CDX2 binding. By integrating these data with observations of chromatin modifications, we identify putative novel regulatory elements that repress gene expression in a lineage-specific manner. Interestingly, we found CDX2 binding sites within regulatory elements of key pluripotent genes such as Pou5f1 and Nanog, pointing to the existence of a novel mechanism by which CDX2 maintains repression of OCT4 in trophoblast. Our study proposes a general mechanism in regulating lineage segregation during mammalian development.
Collapse
|
27
|
Pei C, Wang SL, Fang J, Zhang W. GSMC: Combining Parallel Gibbs Sampling with Maximal Cliques for Hunting DNA Motif. J Comput Biol 2017; 24:1243-1253. [PMID: 29116820 DOI: 10.1089/cmb.2017.0100] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Regulatory elements are responsible for regulating gene transcription. Therefore, identification of these elements is a tremendous challenge in the field of gene expression. Transcription factors (TFs) play a key role in gene regulation by binding to target promoter sequences. A set of conserved sequence patterns with a highly similar structure that is bound by a TF is called a motif. Motif discovery has been a difficult problem over the past decades. Meanwhile, it is a foundation stone in meeting this challenge. Recent advances in obtaining genomic sequences and high-throughput gene expression analysis techniques have enabled the rapid development of computational methods for motif discovery. As a result, a large number of motif-finding algorithms aiming at various motif models have sprung up in the past few years. However, most of them are not suitable for analysis of the large data sets generated by next-generation sequencing. To better handle large-scale ChIP-Seq data and achieve better performance in computational time and motif detection accuracy, we propose an excellent motif-finding algorithm known as GSMC (Combining Parallel Gibbs Sampling with Maximal Cliques for hunting DNA Motif). The GSMC algorithm consists of two steps. First, we employ the commonly used Gibbs sampling to generating initial motifs. Second, we utilize maximal cliques to cluster motifs according to Similarity with Position Information Contents (SPIC). Consequently, we raise the detection accuracy in a great degree, in the meantime holding comparative computation efficiency. In addition, we can find much more credible cofactor interacting motifs.
Collapse
Affiliation(s)
- Chao Pei
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| | - Shu-Lin Wang
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| | - Jianwen Fang
- 2 Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute , Rockville, MD 20850
| | - Wei Zhang
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| |
Collapse
|
28
|
Tran NTL, Huang CH. MOTIFSIM 2.1: An Enhanced Software Platform for Detecting Similarity in Multiple DNA Motif Data Sets. J Comput Biol 2017. [PMID: 28632401 PMCID: PMC5610392 DOI: 10.1089/cmb.2017.0005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Finding binding site motifs plays an important role in bioinformatics as it reveals the transcription factors that control the gene expression. The development for motif finders has flourished in the past years with many tools have been introduced to the research community. Although these tools possess exceptional features for detecting motifs, they report different results for an identical data set. Hence, using multiple tools is recommended because motifs reported by several tools are likely biologically significant. However, the results from multiple tools need to be compared for obtaining common significant motifs. MOTIFSIM web tool and command-line tool were developed for this purpose. In this work, we present several technical improvements as well as additional features to further support the motif analysis in our new release MOTIFSIM 2.1.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut , Storrs, Connecticut
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut , Storrs, Connecticut
| |
Collapse
|
29
|
Spanier KI, Jansen M, Decaestecker E, Hulselmans G, Becker D, Colbourne JK, Orsini L, De Meester L, Aerts S. Conserved Transcription Factors Steer Growth-Related Genomic Programs in Daphnia. Genome Biol Evol 2017; 9:1821-1842. [PMID: 28854641 PMCID: PMC5569996 DOI: 10.1093/gbe/evx127] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2017] [Indexed: 02/06/2023] Open
Abstract
Ecological genomics aims to understand the functional association between environmental gradients and the genes underlying adaptive traits. Many genes that are identified by genome-wide screening in ecologically relevant species lack functional annotations. Although gene functions can be inferred from sequence homology, such approaches have limited power. Here, we introduce ecological regulatory genomics by presenting an ontology-free gene prioritization method. Specifically, our method combines transcriptome profiling with high-throughput cis-regulatory sequence analysis in the water fleas Daphnia pulex and Daphnia magna. It screens coexpressed genes for overrepresented DNA motifs that serve as transcription factor binding sites, thereby providing insight into conserved transcription factors and gene regulatory networks shaping the expression profile. We first validated our method, called Daphnia-cisTarget, on a D. pulex heat shock data set, which revealed a network driven by the heat shock factor. Next, we performed RNA-Seq in D. magna exposed to the cyanobacterium Microcystis aeruginosa. Daphnia-cisTarget identified coregulated gene networks that associate with the moulting cycle and potentially regulate life history changes in growth rate and age at maturity. These networks are predicted to be regulated by evolutionary conserved transcription factors such as the homologues of Drosophila Shavenbaby and Grainyhead, nuclear receptors, and a GATA family member. In conclusion, our approach allows prioritising candidate genes in Daphnia without bias towards prior knowledge about functional gene annotation and represents an important step towards exploring the molecular mechanisms of ecological responses in organisms with poorly annotated genomes.
Collapse
Affiliation(s)
- Katina I. Spanier
- Department of Biology, Laboratory of Aquatic Ecology, Evolution and Conservation, KU Leuven, Belgium
- Department of Human Genetics, Laboratory of Computational Biology, KU Leuven, Belgium
- VIB Center for Brain and Disease Research, KU Leuven, Belgium
| | - Mieke Jansen
- Department of Biology, Laboratory of Aquatic Ecology, Evolution and Conservation, KU Leuven, Belgium
| | - Ellen Decaestecker
- Department of Biology, Laboratory of Aquatic Biology, Science and Technology, KU Leuven Campus Kulak, Kortrjik, Belgium
| | - Gert Hulselmans
- Department of Human Genetics, Laboratory of Computational Biology, KU Leuven, Belgium
- VIB Center for Brain and Disease Research, KU Leuven, Belgium
| | - Dörthe Becker
- Environmental Genomics Group, School of Biosciences, College of Life and Environmental Sciences, University of Birmingham, United Kingdom
- Department of Animal and Plant Sciences, University of Sheffield, Western Bank, United Kingdom
| | - John K. Colbourne
- Environmental Genomics Group, School of Biosciences, College of Life and Environmental Sciences, University of Birmingham, United Kingdom
| | - Luisa Orsini
- Environmental Genomics Group, School of Biosciences, College of Life and Environmental Sciences, University of Birmingham, United Kingdom
| | - Luc De Meester
- Department of Biology, Laboratory of Aquatic Ecology, Evolution and Conservation, KU Leuven, Belgium
| | - Stein Aerts
- Department of Human Genetics, Laboratory of Computational Biology, KU Leuven, Belgium
- VIB Center for Brain and Disease Research, KU Leuven, Belgium
| |
Collapse
|
30
|
Tran NTL, Huang CH. Cloud-based MOTIFSIM: Detecting Similarity in Large DNA Motif Data Sets. J Comput Biol 2017; 24:450-459. [DOI: 10.1089/cmb.2016.0080] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Ngoc Tam L. Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut
| |
Collapse
|
31
|
Crebbp loss cooperates with Bcl2 overexpression to promote lymphoma in mice. Blood 2017; 129:2645-2656. [PMID: 28288979 DOI: 10.1182/blood-2016-08-733469] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Accepted: 03/05/2017] [Indexed: 12/16/2022] Open
Abstract
CREBBP is targeted by inactivating mutations in follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL). Here, we provide evidence from transgenic mouse models that Crebbp deletion results in deficits in B-cell development and can cooperate with Bcl2 overexpression to promote B-cell lymphoma. Through transcriptional and epigenetic profiling of these B cells, we found that Crebbp inactivation was associated with broad transcriptional alterations, but no changes in the patterns of histone acetylation at the proximal regulatory regions of these genes. However, B cells with Crebbp inactivation showed high expression of Myc and patterns of altered histone acetylation that were localized to intragenic regions, enriched for Myc DNA binding motifs, and showed Myc binding. Through the analysis of CREBBP mutations from a large cohort of primary human FL and DLBCL, we show a significant difference in the spectrum of CREBBP mutations in these 2 diseases, with higher frequencies of nonsense/frameshift mutations in DLBCL compared with FL. Together, our data therefore provide important links between Crebbp inactivation and Bcl2 dependence and show a role for Crebbp inactivation in the induction of Myc expression. We suggest this may parallel the role of CREBBP frameshift/nonsense mutations in DLBCL that result in loss of the protein, but may contrast the role of missense mutations in the lysine acetyltransferase domain that are more frequently observed in FL and yield an inactive protein.
Collapse
|
32
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
33
|
Czeizler E, Hirvola T, Karhu K. A graph-theoretical approach for motif discovery in protein sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:121-130. [PMID: 28055896 DOI: 10.1109/tcbb.2015.2511750] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
Collapse
|
34
|
Fuxman Bass JI, Pons C, Kozlowski L, Reece-Hoyes JS, Shrestha S, Holdorf AD, Mori A, Myers CL, Walhout AJ. A gene-centered C. elegans protein-DNA interaction network provides a framework for functional predictions. Mol Syst Biol 2016; 12:884. [PMID: 27777270 PMCID: PMC5081483 DOI: 10.15252/msb.20167131] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Transcription factors (TFs) play a central role in controlling spatiotemporal gene expression and the response to environmental cues. A comprehensive understanding of gene regulation requires integrating physical protein–DNA interactions (PDIs) with TF regulatory activity, expression patterns, and phenotypic data. Although great progress has been made in mapping PDIs using chromatin immunoprecipitation, these studies have only characterized ~10% of TFs in any metazoan species. The nematode C. elegans has been widely used to study gene regulation due to its compact genome with short regulatory sequences. Here, we delineated the largest gene‐centered metazoan PDI network to date by examining interactions between 90% of C. elegans TFs and 15% of gene promoters. We used this network as a backbone to predict TF binding sites for 77 TFs, two‐thirds of which are novel, as well as integrate gene expression, protein–protein interaction, and phenotypic data to predict regulatory and biological functions for multiple genes and TFs.
Collapse
Affiliation(s)
- Juan I Fuxman Bass
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Carles Pons
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, MN, USA
| | - Lucie Kozlowski
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - John S Reece-Hoyes
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Shaleen Shrestha
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Amy D Holdorf
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Akihiro Mori
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, MN, USA
| | - Albertha Jm Walhout
- Program in Systems Biology and Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| |
Collapse
|
35
|
Yu Q, Huo H, Feng D. PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4986707. [PMID: 27843946 PMCID: PMC5098105 DOI: 10.1155/2016/4986707] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 09/04/2016] [Accepted: 09/27/2016] [Indexed: 11/18/2022]
Abstract
Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Dazheng Feng
- School of Electronic Engineering, Xidian University, Xi'an 710071, China
| |
Collapse
|
36
|
The EMT regulator ZEB2 is a novel dependency of human and murine acute myeloid leukemia. Blood 2016; 129:497-508. [PMID: 27756750 DOI: 10.1182/blood-2016-05-714493] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2016] [Accepted: 10/07/2016] [Indexed: 01/01/2023] Open
Abstract
Acute myeloid leukemia (AML) is a heterogeneous disease with complex molecular pathophysiology. To systematically characterize AML's genetic dependencies, we conducted genome-scale short hairpin RNA screens in 17 AML cell lines and analyzed dependencies relative to parallel screens in 199 cell lines of other cancer types. We identified 353 genes specifically required for AML cell proliferation. To validate the in vivo relevance of genetic dependencies observed in human cell lines, we performed a secondary screen in a syngeneic murine AML model driven by the MLL-AF9 oncogenic fusion protein. Integrating the results of these interference RNA screens and additional gene expression data, we identified the transcription factor ZEB2 as a novel AML dependency. ZEB2 depletion impaired the proliferation of both human and mouse AML cells and resulted in aberrant differentiation of human AML cells. Mechanistically, we showed that ZEB2 transcriptionally represses genes that regulate myeloid differentiation, including genes involved in cell adhesion and migration. In addition, we found that epigenetic silencing of the miR-200 family microRNAs affects ZEB2 expression. Our results extend the role of ZEB2 beyond regulating epithelial-mesenchymal transition (EMT) and establish ZEB2 as a novel regulator of AML proliferation and differentiation.
Collapse
|
37
|
Singh S, Howell D, Trivedi N, Kessler K, Ong T, Rosmaninho P, Raposo AA, Robinson G, Roussel MF, Castro DS, Solecki DJ. Zeb1 controls neuron differentiation and germinal zone exit by a mesenchymal-epithelial-like transition. eLife 2016; 5. [PMID: 27178982 PMCID: PMC4891180 DOI: 10.7554/elife.12717] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 05/03/2016] [Indexed: 12/13/2022] Open
Abstract
In the developing mammalian brain, differentiating neurons mature morphologically via neuronal polarity programs. Despite discovery of polarity pathways acting concurrently with differentiation, it's unclear how neurons traverse complex polarity transitions or how neuronal progenitors delay polarization during development. We report that zinc finger and homeobox transcription factor-1 (Zeb1), a master regulator of epithelial polarity, controls neuronal differentiation by transcriptionally repressing polarity genes in neuronal progenitors. Necessity-sufficiency testing and functional target screening in cerebellar granule neuron progenitors (GNPs) reveal that Zeb1 inhibits polarization and retains progenitors in their germinal zone (GZ). Zeb1 expression is elevated in the Sonic Hedgehog (SHH) medulloblastoma subgroup originating from GNPs with persistent SHH activation. Restored polarity signaling promotes differentiation and rescues GZ exit, suggesting a model for future differentiative therapies. These results reveal unexpected parallels between neuronal differentiation and mesenchymal-to-epithelial transition and suggest that active polarity inhibition contributes to altered GZ exit in pediatric brain cancers. DOI:http://dx.doi.org/10.7554/eLife.12717.001 During the formation of the brain, developing neurons are faced with a logistical problem. After newborn neurons form they must change in shape and move to their final location in the brain. Despite much speculation, little is known about these processes. Neurons mature via the activity of several pathways that control the activity, or expression, of the neuron’s genes. One way of controlling such gene expression is through proteins called transcription factors. At the same time, the developing neurons go through a process called polarization, where different regions of the cell develop different characteristics. However, it was not known how the maturation and polarization processes are linked, or how the developing neurons actively regulate polarization. By studying the developing mouse brain, Singh et al. found that a transcription factor called Zeb1 keeps neurons in a immature state, stopping them from becoming polarized. Further investigation revealed that Zeb1 does this by preventing the production of a group of proteins that helps to polarize the cells. The most common type of malignant brain tumour in children is called a medulloblastoma. Singh et al. analyzed the genes expressed in mice that have a type of medulloblastoma that results from the constant activity of a gene called Sonic Hedgehog in developing neurons. This revealed that these tumour cells contain abnormally high levels of Zeb1, and so do not take on a polarized form. However, artificially restoring other factors that encourage the cells to polarize caused the neurons to mature normally. Further investigation is now needed to find out whether the activity of the Sonic Hedgehog gene regulates Zeb1 activity, and to discover whether inhibiting Zeb1 could prevent brain tumours from developing. DOI:http://dx.doi.org/10.7554/eLife.12717.002
Collapse
Affiliation(s)
- Shalini Singh
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, United States
| | - Danielle Howell
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, United States
| | - Niraj Trivedi
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, United States
| | | | - Taren Ong
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, United States
| | - Pedro Rosmaninho
- Department of Molecular Neurobiology, Instituto Gulbenkian de Ciência Oeiras, Oeiras, Portugal
| | - Alexandre Asf Raposo
- Department of Molecular Neurobiology, Instituto Gulbenkian de Ciência Oeiras, Oeiras, Portugal
| | - Giles Robinson
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, United States
| | - Martine F Roussel
- Department of Tumor Cell Biology, St. Jude Children's Research Hospital, Memphis, United States
| | - Diogo S Castro
- Department of Molecular Neurobiology, Instituto Gulbenkian de Ciência Oeiras, Oeiras, Portugal
| | - David J Solecki
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, United States
| |
Collapse
|
38
|
MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data. Comput Biol Chem 2016; 63:62-72. [PMID: 26971251 DOI: 10.1016/j.compbiolchem.2016.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 01/25/2016] [Indexed: 11/21/2022]
Abstract
BACKGROUND As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. RESULTS Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. CONCLUSIONS By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions.
Collapse
|
39
|
Liu Z, Han J, Lv H, Liu J, Liu R. Computational identification of circular RNAs based on conformational and thermodynamic properties in the flanking introns. Comput Biol Chem 2016; 61:221-5. [PMID: 26917277 DOI: 10.1016/j.compbiolchem.2016.02.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2015] [Revised: 02/03/2016] [Accepted: 02/03/2016] [Indexed: 01/08/2023]
Abstract
Circular RNAs (circRNAs) were found more than 30 years ago, but have been treated as molecular flukes in a long time. Combining deep sequencing studies with bioinformatics technique, thousands of endogenous circRNAs have been found in mammalian cells, and some researchers have proved that several circRNAs act as competing endogenous RNAs (ceRNAs) to regulate gene expression. However, the mechanism by which the precursor mRNA to be transformed into a circular RNA or a linear mRNA is largely unknown. In this paper, we attempted to bioinformatically identify shared genomic features that might further elucidate the mechanism of formation and proposed a SVM-based model to distinguish circRNAs from non-circularized, expressed exons. Firstly, conformational and thermodynamic dinucleotide properties in the flanking introns were extracted as potential features. Secondly, two feature selection methods were applied to gain the optimal feature subset. Our 10-fold cross-validation results showed that the model can be used to distinguish circRNAs from non-circularized, expressed exons with an Sn of 0.884, Sp of 0.900, ACC of 0.892, MCC of 0.784, respectively. The identification results suggest that conformational and thermodynamic properties in the flanking introns are closely related to the formation of circRNAs. Datasets and the tool involved in this paper are all available at https://sourceforge.net/projects/predicircrnatool/files/.
Collapse
Affiliation(s)
- Ze Liu
- School of Electronics and Information Engineering, Xi'an jiaotong University, Xi'an 710049, PR China
| | - Jiuqiang Han
- School of Electronics and Information Engineering, Xi'an jiaotong University, Xi'an 710049, PR China.
| | - Hongqiang Lv
- School of Electronics and Information Engineering, Xi'an jiaotong University, Xi'an 710049, PR China
| | - Jun Liu
- School of Electronics and Information Engineering, Xi'an jiaotong University, Xi'an 710049, PR China; School of Electrical Engineering, Xi'an Jiaotong University, Xi'an 710049, PR China
| | - Ruiling Liu
- School of Electronics and Information Engineering, Xi'an jiaotong University, Xi'an 710049, PR China
| |
Collapse
|
40
|
Mistri TK, Devasia AG, Chu LT, Ng WP, Halbritter F, Colby D, Martynoga B, Tomlinson SR, Chambers I, Robson P, Wohland T. Selective influence of Sox2 on POU transcription factor binding in embryonic and neural stem cells. EMBO Rep 2015; 16:1177-91. [PMID: 26265007 PMCID: PMC4576985 DOI: 10.15252/embr.201540467] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2015] [Accepted: 07/06/2015] [Indexed: 12/19/2022] Open
Abstract
Embryonic stem cell (ESC) identity is orchestrated by co-operativity between the transcription factors (TFs) Sox2 and the class V POU-TF Oct4 at composite Sox/Oct motifs. Neural stem cells (NSCs) lack Oct4 but express Sox2 and class III POU-TFs Oct6, Brn1 and Brn2. This raises the question of how Sox2 interacts with POU-TFs to transcriptionally specify ESCs versus NSCs. Here, we show that Oct4 alone binds the Sox/Oct motif and the octamer-containing palindromic MORE equally well. Sox2 binding selectively increases the affinity of Oct4 for the Sox/Oct motif. In contrast, Oct6 binds preferentially to MORE and is unaffected by Sox2. ChIP-Seq in NSCs shows the MORE to be the most enriched motif for class III POU-TFs, including MORE subtypes, and that the Sox/Oct motif is not enriched. These results suggest that in NSCs, co-operativity between Sox2 and class III POU-TFs may not occur and that POU-TF-driven transcription uses predominantly the MORE cis architecture. Thus, distinct interactions between Sox2 and POU-TF subclasses distinguish pluripotent ESCs from multipotent NSCs, providing molecular insight into how Oct4 alone can convert NSCs to pluripotency.
Collapse
Affiliation(s)
- Tapan Kumar Mistri
- Department of Chemistry, National University of Singapore, Singapore, Singapore Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore, Singapore MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Arun George Devasia
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Lee Thean Chu
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Wei Ping Ng
- Department of Chemistry, National University of Singapore, Singapore, Singapore
| | - Florian Halbritter
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Douglas Colby
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Ben Martynoga
- Division of Molecular Neurobiology, MRC-National Institute for Medical Research, Mill Hill, London, UK
| | - Simon R Tomlinson
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Ian Chambers
- MRC Centre for Regenerative Medicine, Institute for Stem Cell Research, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| | - Paul Robson
- Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore, Singapore Department of Biological Sciences, National University of Singapore, Singapore, Singapore The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Thorsten Wohland
- Department of Chemistry, National University of Singapore, Singapore, Singapore Department of Biological Sciences, National University of Singapore, Singapore, Singapore Centre for Bioimaging Sciences, National University of Singapore, Singapore, Singapore
| |
Collapse
|
41
|
Zhang Y, Wang P. A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2015; 2015:218068. [PMID: 26236718 PMCID: PMC4509496 DOI: 10.1155/2015/218068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 06/04/2015] [Indexed: 11/17/2022]
Abstract
New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the (l, d) motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the (l, d) motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.
Collapse
Affiliation(s)
- Yipu Zhang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| | - Ping Wang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| |
Collapse
|
42
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
43
|
Systematic discovery of cofactor motifs from ChIP-seq data by SIOMICS. Methods 2015; 79-80:47-51. [DOI: 10.1016/j.ymeth.2014.08.006] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 07/19/2014] [Accepted: 08/06/2014] [Indexed: 11/19/2022] Open
|
44
|
Yu Q, Huo H, Chen X, Guo H, Vitter JS, Huan J. An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets. IEEE Trans Nanobioscience 2015; 14:535-44. [PMID: 25872217 DOI: 10.1109/tnb.2015.2421340] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The planted (l,d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l,d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l,d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l,d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l,d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.
Collapse
|
45
|
Gentsch GE, Patrushev I, Smith JC. Genome-wide snapshot of chromatin regulators and states in Xenopus embryos by ChIP-Seq. J Vis Exp 2015. [PMID: 25742027 PMCID: PMC4354678 DOI: 10.3791/52535] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
The recruitment of chromatin regulators and the assignment of chromatin states to specific genomic loci are pivotal to cell fate decisions and tissue and organ formation during development. Determining the locations and levels of such chromatin features in vivo will provide valuable information about the spatio-temporal regulation of genomic elements, and will support aspirations to mimic embryonic tissue development in vitro. The most commonly used method for genome-wide and high-resolution profiling is chromatin immunoprecipitation followed by next-generation sequencing (ChIP-Seq). This protocol outlines how yolk-rich embryos such as those of the frog Xenopus can be processed for ChIP-Seq experiments, and it offers simple command lines for post-sequencing analysis. Because of the high efficiency with which the protocol extracts nuclei from formaldehyde-fixed tissue, the method allows easy upscaling to obtain enough ChIP material for genome-wide profiling. Our protocol has been used successfully to map various DNA-binding proteins such as transcription factors, signaling mediators, components of the transcription machinery, chromatin modifiers and post-translational histone modifications, and for this to be done at various stages of embryogenesis. Lastly, this protocol should be widely applicable to other model and non-model organisms as more and more genome assemblies become available.
Collapse
Affiliation(s)
- George E Gentsch
- Division of Systems Biology, MRC National Institute for Medical Research;
| | - Ilya Patrushev
- Division of Systems Biology, MRC National Institute for Medical Research
| | - James C Smith
- Division of Systems Biology, MRC National Institute for Medical Research
| |
Collapse
|
46
|
Discovery of CTCF-sensitive Cis-spliced fusion RNAs between adjacent genes in human prostate cells. PLoS Genet 2015; 11:e1005001. [PMID: 25658338 PMCID: PMC4450057 DOI: 10.1371/journal.pgen.1005001] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 01/13/2015] [Indexed: 11/19/2022] Open
Abstract
Genes or their encoded products are not expected to mingle with each other unless in some disease situations. In cancer, a frequent mechanism that can produce gene fusions is chromosomal rearrangement. However, recent discoveries of RNA trans-splicing and cis-splicing between adjacent genes (cis-SAGe) support for other mechanisms in generating fusion RNAs. In our transcriptome analyses of 28 prostate normal and cancer samples, 30% fusion RNAs on average are the transcripts that contain exons belonging to same-strand neighboring genes. These fusion RNAs may be the products of cis-SAGe, which was previously thought to be rare. To validate this finding and to better understand the phenomenon, we used LNCaP, a prostate cell line as a model, and identified 16 additional cis-SAGe events by silencing transcription factor CTCF and paired-end RNA sequencing. About half of the fusions are expressed at a significant level compared to their parental genes. Silencing one of the in-frame fusions resulted in reduced cell motility. Most out-of-frame fusions are likely to function as non-coding RNAs. The majority of the 16 fusions are also detected in other prostate cell lines, as well as in the 14 clinical prostate normal and cancer pairs. By studying the features associated with these fusions, we developed a set of rules: 1) the parental genes are same-strand-neighboring genes; 2) the distance between the genes is within 30kb; 3) the 5′ genes are actively transcribing; and 4) the chimeras tend to have the second-to-last exon in the 5′ genes joined to the second exon in the 3′ genes. We then randomly selected 20 neighboring genes in the genome, and detected four fusion events using these rules in prostate cancer and non-cancerous cells. These results suggest that splicing between neighboring gene transcripts is a rather frequent phenomenon, and it is not a feature unique to cancer cells. Genes are considered the units of hereditary information; thus, neither genes nor their encoded products are expected to mingle with each other unless in some disease situations. However, the genes are not alone in the genome. Genes have neighbors, some close, some far. With RNA-seq, many fusion RNAs involving neighboring genes are being identified. However, little is done to validate and characterize the fusion RNAs. Using one prostate cell line and a discovery pipeline for cis-splicing between adjacent genes (cis-SAGe), we found 16 new such events. We then developed a set of rules based on the characteristics of these fusion RNAs, and applied them to 20 random neighboring gene pairs. Four turned out to be true. The majority of the fusions are found in cancer cells, as well as in non-cancer cells. These results suggest that the genes are “leaky”, and the fusions are not limited to cancer cells.
Collapse
|
47
|
Ikebata H, Yoshida R. Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics 2015; 31:1561-8. [PMID: 25583120 PMCID: PMC4426842 DOI: 10.1093/bioinformatics/btv017] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 01/06/2015] [Indexed: 11/14/2022] Open
Abstract
Motivation: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods. Results: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover. Availability and implementation: A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif. Contact:ikebata.hisaki@ism.ac.jp, yoshidar@ism.ac.jp Supplementary information:Supplementary data are available from Bioinformatics online.
Collapse
Affiliation(s)
- Hisaki Ikebata
- Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan
| | - Ryo Yoshida
- Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Depar
| |
Collapse
|
48
|
|
49
|
Zheng Y, Li X, Hu H. Comprehensive discovery of DNA motifs in 349 human cells and tissues reveals new features of motifs. Nucleic Acids Res 2015; 43:74-83. [PMID: 25505144 PMCID: PMC4288161 DOI: 10.1093/nar/gku1261] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Revised: 11/13/2014] [Accepted: 11/17/2014] [Indexed: 01/15/2023] Open
Abstract
Comprehensive motif discovery under experimental conditions is critical for the global understanding of gene regulation. To generate a nearly complete list of human DNA motifs under given conditions, we employed a novel approach to de novo discover significant co-occurring DNA motifs in 349 human DNase I hypersensitive site datasets. We predicted 845 to 1325 motifs in each dataset, for a total of 2684 non-redundant motifs. These 2684 motifs contained 54.02 to 75.95% of the known motifs in seven large collections including TRANSFAC. In each dataset, we also discovered 43 663 to 2 013 288 motif modules, groups of motifs with their binding sites co-occurring in a significant number of short DNA regions. Compared with known interacting transcription factors in eight resources, the predicted motif modules on average included 84.23% of known interacting motifs. We further showed new features of the predicted motifs, such as motifs enriched in proximal regions rarely overlapped with motifs enriched in distal regions, motifs enriched in 5' distal regions were often enriched in 3' distal regions, etc. Finally, we observed that the 2684 predicted motifs classified the cell or tissue types of the datasets with an accuracy of 81.29%. The resources generated in this study are available at http://server.cs.ucf.edu/predrem/.
Collapse
Affiliation(s)
- Yiyu Zheng
- Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
50
|
Yamagishi J, Wakaguri H, Yokoyama N, Yamashita R, Suzuki Y, Xuan X, Igarashi I. The Babesia bovis gene and promoter model: an update from full-length EST analysis. BMC Genomics 2014; 15:678. [PMID: 25124460 PMCID: PMC4148916 DOI: 10.1186/1471-2164-15-678] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2013] [Accepted: 07/08/2014] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Babesia bovis is an apicomplexan parasite that causes babesiosis in infected cattle. Genomes of pathogens contain promising information that can facilitate the development of methods for controlling infections. Although the genome of B. bovis is publically available, annotated gene models are not highly reliable prior to experimental validation. Therefore, we validated a preproposed gene model of B. bovis and extended the associated annotations on the basis of experimentally obtained full-length expressed sequence tags (ESTs). RESULTS From in vitro cultured merozoites, 12,286 clones harboring full-length cDNAs were sequenced from both ends using the Sanger method, and 6,787 full-length cDNAs were assembled. These were then clustered, and a nonredundant referential data set of 2,115 full-length cDNA sequences was constructed. The comparison of the preproposed gene model with our data set identified 310 identical genes, 342 almost identical genes, 1,054 genes with potential structural inconsistencies, and 409 novel genes. The median length of 5' untranslated regions (UTRs) was 152 nt. Subsequently, we identified 4,086 transcription start sites (TSSs) and 2,023 transcriptionally active regions (TARs) by examining 5' ESTs. We identified ATGGGG and CCCCAT sites as consensus motifs in TARs that were distributed around -50 bp from TSSs. In addition, we found ACACA, TGTGT, and TATAT sites, which were distributed periodically around TSSs in cycles of approximately 150 bp. Moreover, related periodical distributions were not observed in mammalian promoter regions. CONCLUSIONS The observations in this study indicate the utility of integrated bioinformatics and experimental data for improving genome annotations. In particular, full-length cDNAs with one-base resolution for TSSs enabled the identification of consensus motifs in promoter sequences and demonstrated clear distributions of identified motifs. These observations allowed the illustration of a model promoter composition, which supports the differences in transcriptional regulation frameworks between apicomplexan parasites and mammals.
Collapse
Affiliation(s)
- Junya Yamagishi
- />Tohoku Medical Megabank Organization, Tohoku University, 6-3-09, aza Aoba, Sendai, Miyagi 980-8579 Japan
- />National Research Center for Protozoan Diseases, Obihiro University of Agriculture and Veterinary Medicine, Inada-cho west 2-13, Obihiro, Hokkaido 080-8555 Japan
| | - Hiroyuki Wakaguri
- />Department of Medical Genome Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562 Japan
| | - Naoaki Yokoyama
- />National Research Center for Protozoan Diseases, Obihiro University of Agriculture and Veterinary Medicine, Inada-cho west 2-13, Obihiro, Hokkaido 080-8555 Japan
| | - Riu Yamashita
- />Tohoku Medical Megabank Organization, Tohoku University, 6-3-09, aza Aoba, Sendai, Miyagi 980-8579 Japan
| | - Yutaka Suzuki
- />Department of Medical Genome Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562 Japan
| | - Xuenan Xuan
- />National Research Center for Protozoan Diseases, Obihiro University of Agriculture and Veterinary Medicine, Inada-cho west 2-13, Obihiro, Hokkaido 080-8555 Japan
| | - Ikuo Igarashi
- />National Research Center for Protozoan Diseases, Obihiro University of Agriculture and Veterinary Medicine, Inada-cho west 2-13, Obihiro, Hokkaido 080-8555 Japan
| |
Collapse
|