1
|
Jalili V, Cremona MA, Palluzzi F. Rescuing biologically relevant consensus regions across replicated samples. BMC Bioinformatics 2023; 24:240. [PMID: 37286963 PMCID: PMC10246347 DOI: 10.1186/s12859-023-05340-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 05/16/2023] [Indexed: 06/09/2023] Open
Abstract
BACKGROUND Protein-DNA binding sites of ChIP-seq experiments are identified where the binding affinity is significant based on a given threshold. The choice of the threshold is a trade-off between conservative region identification and discarding weak, but true binding sites. RESULTS We rescue weak binding sites using MSPC, which efficiently exploits replicates to lower the threshold required to identify a site while keeping a low false-positive rate, and we compare it to IDR, a widely used post-processing method for identifying highly reproducible peaks across replicates. We observe several master transcription regulators (e.g., SP1 and GATA3) and HDAC2-GATA1 regulatory networks on rescued regions in K562 cell line. CONCLUSIONS We argue the biological relevance of weak binding sites and the information they add when rescued by MSPC. An implementation of the proposed extended MSPC methodology and the scripts to reproduce the performed analysis are freely available at https://genometric.github.io/MSPC/ ; MSPC is distributed as a command-line application and an R package available from Bioconductor ( https://doi.org/doi:10.18129/B9.bioc.rmspc ).
Collapse
Affiliation(s)
- Vahid Jalili
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Marzia A Cremona
- Department of Operations and Decision Systems, Université Laval, Quebec, Canada.
- CHU de Québec - Université Laval Research Center, Quebec, Canada.
| | - Fernando Palluzzi
- Department of Brain and Behavioral Sciences, Università di Pavia, Pavia, Italy.
| |
Collapse
|
2
|
ChIP-GSM: Inferring active transcription factor modules to predict functional regulatory elements. PLoS Comput Biol 2021; 17:e1009203. [PMID: 34292930 PMCID: PMC8330942 DOI: 10.1371/journal.pcbi.1009203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 08/03/2021] [Accepted: 06/20/2021] [Indexed: 11/19/2022] Open
Abstract
Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIP-seq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIP-GSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.
Collapse
|
3
|
Wong KC. Big data challenges in genome informatics. Biophys Rev 2019; 11:51-54. [PMID: 30684131 DOI: 10.1007/s12551-018-0493-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 12/13/2018] [Indexed: 12/19/2022] Open
Abstract
In recent years, we have witnessed a big data explosion in genomics, thanks to the improvement in high-throughput technologies at drastically decreasing costs. We are entering the era of millions of available genomes. Notably, each genome can be composed of billions of nucleotides stored as plain text files in gigabytes (GBs). It is undeniable that those genome data impose unprecedented data challenges for us. In this article, we briefly discuss the big data challenges associated with genomics in recent years.
Collapse
Affiliation(s)
- Ka-Chun Wong
- City University of Hong Kong, Kowloon, Hong Kong.
| |
Collapse
|
4
|
Banerjee S, Zhu H, Tang M, Feng WC, Wu X, Xie H. Identifying Transcriptional Regulatory Modules Among Different Chromatin States in Mouse Neural Stem Cells. Front Genet 2019; 9:731. [PMID: 30697231 PMCID: PMC6341026 DOI: 10.3389/fgene.2018.00731] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Accepted: 12/22/2018] [Indexed: 12/19/2022] Open
Abstract
Gene expression regulation is a complex process involving the interplay between transcription factors and chromatin states. Significant progress has been made toward understanding the impact of chromatin states on gene expression. Nevertheless, the mechanism of transcription factors binding combinatorially in different chromatin states to enable selective regulation of gene expression remains an interesting research area. We introduce a nonparametric Bayesian clustering method for inhomogeneous Poisson processes to detect heterogeneous binding patterns of multiple proteins including transcription factors to form regulatory modules in different chromatin states. We applied this approach on ChIP-seq data for mouse neural stem cells containing 21 proteins and observed different groups or modules of proteins clustered within different chromatin states. These chromatin-state-specific regulatory modules were found to have significant influence on gene expression. We also observed different motif preferences for certain TFs between different chromatin states. Our results reveal a degree of interdependency between chromatin states and combinatorial binding of proteins in the complex transcriptional regulatory process. The software package is available on Github at - https://github.com/BSharmi/DPM-LGCP.
Collapse
Affiliation(s)
- Sharmi Banerjee
- Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, United States.,Biocomplexity Institute of Virginia Tech, Blacksburg, VA, United States
| | - Hongxiao Zhu
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Man Tang
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Wu-Chun Feng
- Department of Computer Science, Virginia Tech, Blacksburg, VA, United States
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Hehuang Xie
- Biocomplexity Institute of Virginia Tech, Blacksburg, VA, United States.,Department of Biomedical Sciences and Pathobiology, Virginia-Maryland College of Veterinary Medicine, Blacksburg, VA, United States.,Department of Biological Sciences, Virginia Tech, Blacksburg, VA, United States.,School of Neuroscience, Virginia Tech, Blacksburg, VA, United States
| |
Collapse
|
5
|
Nakato R, Shirahige K. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform 2017; 18:279-290. [PMID: 26979602 PMCID: PMC5444249 DOI: 10.1093/bib/bbw023] [Citation(s) in RCA: 78] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Indexed: 02/06/2023] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis can detect protein/DNA-binding and histone-modification sites across an entire genome. Recent advances in sequencing technologies and analyses enable us to compare hundreds of samples simultaneously; such large-scale analysis has potential to reveal the high-dimensional interrelationship level for regulatory elements and annotate novel functional genomic regions de novo. Because many experimental considerations are relevant to the choice of a method in a ChIP-seq analysis, the overall design and quality management of the experiment are of critical importance. This review offers guiding principles of computation and sample preparation for ChIP-seq analyses, highlighting the validity and limitations of the state-of-the-art procedures at each step. We also discuss the latest challenges of single-cell analysis that will encourage a new era in this field.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan
| | - Katsuhiko Shirahige
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan.,Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency, Kawaguchi, Japan
| |
Collapse
|
6
|
Hu J, Li Y, Zhang M, Yang X, Shen HB, Yu DJ. Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1389-1398. [PMID: 27740495 DOI: 10.1109/tcbb.2016.2616469] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues solely from protein sequences is an important but challenging task for protein function annotations and drug discovery, especially in the post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein's evolutionary information and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classifiers are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The final ensembled predictor is obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.
Collapse
|
7
|
Lee ESA, Sze-To HYA, Wong MH, Leung KS, Lau TCK, Wong AKC. Discovering Protein-DNA Binding Cores by Aligned Pattern Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:254-263. [PMID: 26336137 DOI: 10.1109/tcbb.2015.2474376] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity. This study presents a new representation for modeling binding cores by incorporating variations and an algorithm to discover them from only sequence data. Our algorithm takes protein and DNA sequences from TRANSFAC (a Protein-DNA Binding Database) as input; discovers from both sets of sequences conserved regions in Aligned Pattern Clusters (APCs); associates them as Protein-DNA Co-Occurring APCs; ranks the Protein-DNA Co-Occurring APCs according to their co-occurrence, and among the top ones, finds three-dimensional structures to support each binding core candidate. If successful, candidates are verified as binding cores. Otherwise, homology modeling is applied to their close matches in PDB to attain new chemically feasible binding cores. Our algorithm obtains binding cores with higher precision and much faster runtime ( ≥ 1,600x) than that of its contemporaries, discovering candidates that do not co-occur as one-to-one associated patterns in the raw data. AVAILABILITY http://www.pami.uwaterloo.ca/~ealee/files/tcbbPnDna2015/Release.zip.
Collapse
|
8
|
Wong KC, Peng C, Li Y. Evolving Transcription Factor Binding Site Models From Protein Binding Microarray Data. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:415-424. [PMID: 26887021 DOI: 10.1109/tcyb.2016.2519380] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner. In this paper, we describe the PBM motif model building problem. We apply several evolutionary computation methods and compare their performance with the interior point method, demonstrating their performance advantages. In addition, given the PBM domain knowledge, we propose and describe a novel method called kmerGA which makes domain-specific assumptions to exploit PBM data properties to build more accurate models than the other models built. The effectiveness and robustness of kmerGA is supported by comprehensive performance benchmarking on more than 200 datasets, time complexity analysis, convergence analysis, parameter analysis, and case studies. To demonstrate its utility further, kmerGA is applied to two real world applications: 1) PBM rotation testing and 2) ChIP-Seq peak sequence prediction. The results support the biological relevance of the models learned by kmerGA, and thus its real world applicability.
Collapse
|
9
|
Wong KC, Peng C, Yan S, Liang C. Probabilistic Inference on Multiple Normalized Genome-Wide Signal Profiles With Model Regularization. IEEE Trans Nanobioscience 2016; 16:43-50. [PMID: 27893398 DOI: 10.1109/tnb.2016.2631406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Understanding genome-wide protein-DNA interaction signals forms the basis for further focused studies in gene regulation. In particular, the chromatin immunoprecipitation with massively parallel DNA sequencing technology (ChIP-Seq) can enable us to measure the in vivo genome-wide occupancy of the DNA-binding protein of interest in a single run. Multiple ChIP-Seq runs thus inherent the potential for us to decipher the combinatorial occupancies of multiple DNA-binding proteins. To handle the genome-wide signal profiles from those multiple runs, we propose to integrate regularized regression functions (i.e., LASSO, Elastic Net, and Ridge Regression) into the well-established SignalRanker and FullSignalRanker frameworks, resulting in six additional probabilistic models for inference on multiple normalized genome-wide signal profiles. The corresponding model training algorithms are devised with computational complexity analysis. Comprehensive benchmarking is conducted to demonstrate and compare the performance of nine related probabilistic models on the ENCODE ChIP-Seq datasets. The results indicate that the regularized SignalRanker models, in contrast to the original SignalRanker models, can demonstrate excellent inference performance comparable to the FullSignalRanker models with low model complexities and time complexities. Such a feature is especially valuable in the context of the rapidly growing genome-wide signal profile data in the recent years.
Collapse
|
10
|
Zhu L, Guo WL, Lu C, Huang DS. Collaborative Completion of Transcription Factor Binding Profiles via Local Sensitive Unified Embedding. IEEE Trans Nanobioscience 2016; 15:946-958. [PMID: 27845669 DOI: 10.1109/tnb.2016.2625823] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Although the newly available ChIP-seq data provides immense opportunities for comparative study of regulatory activities across different biological conditions, due to cost, time or sample material availability, it is not always possible for researchers to obtain binding profiles for every protein in every sample of interest, which considerably limits the power of integrative studies. Recently, by leveraging related information from measured data, Ernst et al. proposed ChromImpute for predicting additional ChIP-seq and other types of datasets, it is demonstrated that the imputed signal tracks accurately approximate the experimentally measured signals, and thereby could potentially enhance the power of integrative analysis. Despite the success of ChromImpute, in this paper, we reexamine its learning process, and show that its performance may degrade substantially and sometimes may even fail to output a prediction when the available data is scarce. This limitation could hurt its applicability to important predictive tasks, such as the imputation of TF binding data. To alleviate this problem, we propose a novel method called Local Sensitive Unified Embedding (LSUE) for imputing new ChIP-seq datasets. In LSUE, the ChIP-seq data compendium are fused together by mapping proteins, samples, and genomic positions simultaneously into the Euclidean space, thereby making their underling associations directly evaluable using simple calculations. In contrast to ChromImpute which mainly makes use of the local correlations between available datasets, LSUE can better estimate the overall data structure by formulating the representation learning of all involved entities as a single unified optimization problem. Meanwhile, a novel form of local sensitive low rank regularization is also proposed to further improve the performance of LSUE. Experimental evaluations on the ENCODE TF ChIP-seq data illustrate the performance of the proposed model. The code of LSUE is available at https://github.com/ekffar/LSUE.
Collapse
|
11
|
Abstract
Chromatin immunoprecipitation followed by sequencing is an invaluable assay for identifying the genomic binding sites of transcription factors. However, transcription factors rarely bind chromatin alone but often bind together with other cofactors, forming protein complexes. Here, we describe a computational method that integrates multiple ChIP-seq and RNA-seq datasets to discover protein complexes and determine their role as activators or repressors. This chapter outlines a detailed computational pipeline for discovering and predicting binding partners from ChIP-seq data and inferring their role in regulating gene expression. This work aims at developing hypotheses about gene regulation via binding partners and deciphering the combinatorial nature of DNA-binding proteins.
Collapse
|
12
|
Abstract
Background Peak calling is a fundamental step in the analysis of data generated by ChIP-seq or similar techniques to acquire epigenetics information. Current peak callers are often hard to parameterise and may therefore be difficult to use for non-bioinformaticians. In this paper, we present the ChIP-seq analysis tool available in CLC Genomics Workbench and CLC Genomics Server (version 7.5 and up), a user-friendly peak-caller designed to be not specific to a particular *-seq protocol. Results We illustrate the advantages of a shape-based approach and describe the algorithmic principles underlying the implementation. Thanks to the generality of the idea and the fact the algorithm is able to learn the peak shape from the data, the implementation requires only minimal user input, while still being applicable to a range of *-seq protocols. Using independently validated benchmark datasets, we compare our implementation to other state-of-the-art algorithms explicitly designed to analyse ChIP-seq data and provide an evaluation in terms of receiver-operator characteristic (ROC) plots. In order to show the applicability of the method to similar *-seq protocols, we also investigate algorithmic performances on DNase-seq data. Conclusions The results show that CLC shape-based peak caller ranks well among popular state-of-the-art peak callers while providing flexibility and ease-of-use.
Collapse
Affiliation(s)
| | - Michael Lappe
- Qiagen Aarhus, Silkeborgvej 2, Aarhus, 8000, DK, Denmark.
| |
Collapse
|
13
|
Chen X, Jung JG, Shajahan-Haq AN, Clarke R, Shih IM, Wang Y, Magnani L, Wang TL, Xuan J. ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles. Nucleic Acids Res 2016; 44:e65. [PMID: 26704972 PMCID: PMC4838354 DOI: 10.1093/nar/gkv1491] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2015] [Revised: 11/16/2015] [Accepted: 12/09/2015] [Indexed: 11/16/2022] Open
Abstract
Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding signals at gene promoter regions. We applied ChIP-BIT to find target genes from NOTCH3 and PBX1 ChIP-seq data acquired from MCF-7 breast cancer cells. TF knockdown experiments have initially validated about 30% of co-regulated target genes identified by ChIP-BIT as being differentially expressed in MCF-7 cells. Functional analysis on these genes further revealed the existence of crosstalk between Notch and Wnt signaling pathways.
Collapse
Affiliation(s)
- Xi Chen
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| | - Jin-Gyoung Jung
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Ayesha N Shajahan-Haq
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3970 Reservoir Road NW, Washington, DC 20057, USA
| | - Robert Clarke
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3970 Reservoir Road NW, Washington, DC 20057, USA
| | - Ie-Ming Shih
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Yue Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| | - Luca Magnani
- Department of Surgery and Cancer, Imperial College London, ICTEM building, Hammersmith Hospital, DuCane Road, London W120NN, UK
| | - Tian-Li Wang
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Jianhua Xuan
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| |
Collapse
|
14
|
Wong KC, Li Y, Peng C, Wong HS. A Comparison Study for DNA Motif Modeling on Protein Binding Microarray. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:261-271. [PMID: 27045826 DOI: 10.1109/tcbb.2015.2443782] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Transcription factor binding sites (TFBSs) are relatively short (5-15 bp) and degenerate. Identifying them is a computationally challenging task. In particular, protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner; for instance, a typical PBM experiment can measure binding signal intensities of a protein to all possible DNA k-mers (k = 8∼10). Since proteins can often bind to DNA with different binding intensities, one of the major challenges is to build TFBS (also known as DNA motif) models which can fully capture the quantitative binding affinity data. To learn DNA motif models from the non-convex objective function landscape, several optimization methods are compared and applied to the PBM motif model building problem. In particular, representative methods from different optimization paradigms have been chosen for modeling performance comparison on hundreds of PBM datasets. The results suggest that the multimodal optimization methods are very effective for capturing the binding preference information from PBM data. In particular, we observe a general performance improvement if choosing di-nucleotide modeling over mono-nucleotide modeling. In addition, the models learned by the best-performing method are applied to two independent applications: PBM probe rotation testing and ChIP-Seq peak sequence prediction, demonstrating its biological applicability.
Collapse
|
15
|
Wong KC, Li Y, Peng C, Moses AM, Zhang Z. Computational learning on specificity-determining residue-nucleotide interactions. Nucleic Acids Res 2015; 43:10180-9. [PMID: 26527718 PMCID: PMC4666365 DOI: 10.1093/nar/gkv1134] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2015] [Accepted: 10/18/2015] [Indexed: 01/02/2023] Open
Abstract
The protein–DNA interactions between transcription factors and transcription factor binding sites are essential activities in gene regulation. To decipher the binding codes, it is a long-standing challenge to understand the binding mechanism across different transcription factor DNA binding families. Past computational learning studies usually focus on learning and predicting the DNA binding residues on protein side. Taking into account both sides (protein and DNA), we propose and describe a computational study for learning the specificity-determining residue-nucleotide interactions of different known DNA-binding domain families. The proposed learning models are compared to state-of-the-art models comprehensively, demonstrating its competitive learning performance. In addition, we describe and propose two applications which demonstrate how the learnt models can provide meaningful insights into protein–DNA interactions across different DNA binding families.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Yue Li
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA
| | - Chengbin Peng
- CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Alan M Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
16
|
Wong KC, Li Y, Peng C. Identification of coupling DNA motif pairs on long-range chromatin interactions in human K562 cells. Bioinformatics 2015; 32:321-4. [PMID: 26411866 DOI: 10.1093/bioinformatics/btv555] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 09/15/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The protein-DNA interactions between transcription factors (TFs) and transcription factor binding sites (TFBSs, also known as DNA motifs) are critical activities in gene transcription. The identification of the DNA motifs is a vital task for downstream analysis. Unfortunately, the long-range coupling information between different DNA motifs is still lacking. To fill the void, as the first-of-its-kind study, we have identified the coupling DNA motif pairs on long-range chromatin interactions in human. RESULTS The coupling DNA motif pairs exhibit substantially higher DNase accessibility than the background sequences. Half of the DNA motifs involved are matched to the existing motif databases, although nearly all of them are enriched with at least one gene ontology term. Their motif instances are also found statistically enriched on the promoter and enhancer regions. Especially, we introduce a novel measurement called motif pairing multiplicity which is defined as the number of motifs that are paired with a given motif on chromatin interactions. Interestingly, we observe that motif pairing multiplicity is linked to several characteristics such as regulatory region type, motif sequence degeneracy, DNase accessibility and pairing genomic distance. Taken into account together, we believe the coupling DNA motif pairs identified in this study can shed lights on the gene transcription mechanism under long-range chromatin interactions. AVAILABILITY AND IMPLEMENTATION The identified motif pair data is compressed and available in the supplementary materials associated with this manuscript. CONTACT kc.w@cityu.edu.hk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Yue Li
- CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA and
| | - Chengbin Peng
- CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Kingdom of Saudi Arabia
| |
Collapse
|
17
|
Liao B, Ding S, Chen H, Li Z, Cai L. Identifying human microRNA–disease associations by a new diffusion-based method. J Bioinform Comput Biol 2015; 13:1550014. [DOI: 10.1142/s0219720015500146] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Identifying the microRNA–disease relationship is vital for investigating the pathogenesis of various diseases. However, experimental verification of disease-related microRNAs remains considerable challenge to many researchers, particularly for the fact that numerous new microRNAs are discovered every year. As such, development of computational methods for disease-related microRNA prediction has recently gained eminent attention. In this paper, first, we construct a miRNA functional network and a disease similarity network by integrating different information sources. Then, we further introduce a new diffusion-based method (NDBM) to explore global network similarity for miRNA–disease association inference. Even though known miRNA–disease associations in the database are rare, NDBM still achieves an area under the ROC curve (AUC) of 85.62% in the leave-one-out cross-validation in improving the prediction accuracy of previous methods significantly. Moreover, our method is applicable to diseases with no known related miRNAs as well as new miRNAs with unknown target diseases. Some associations who strongly predicted by our method are confirmed by public databases. These superior performances suggest that NDBM could be an effective and important tool for biomedical research.
Collapse
Affiliation(s)
- Bo Liao
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Sumei Ding
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Haowen Chen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zejun Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| |
Collapse
|
18
|
Wu DY, Bittencourt D, Stallcup MR, Siegmund KD. Identifying differential transcription factor binding in ChIP-seq. Front Genet 2015; 6:169. [PMID: 25972895 PMCID: PMC4413818 DOI: 10.3389/fgene.2015.00169] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2015] [Accepted: 04/14/2015] [Indexed: 12/19/2022] Open
Abstract
ChIP seq is a widely used assay to measure genome-wide protein binding. The decrease in costs associated with sequencing has led to a rise in the number of studies that investigate protein binding across treatment conditions or cell lines. In addition to the identification of binding sites, new studies evaluate the variation in protein binding between conditions. A number of approaches to study differential transcription factor binding have recently been developed. Several of these methods build upon established methods from RNA-seq to quantify differences in read counts. We compare how these new approaches perform on different data sets from the ENCODE project to illustrate the impact of data processing pipelines under different study designs. The performance of normalization methods for differential ChIP-seq depends strongly on the variation in total amount of protein bound between conditions, with total read count outperforming effective library size, or variants thereof, when a large variation in binding was studied. Use of input subtraction to correct for non-specific binding showed a relatively modest impact on the number of differential peaks found and the fold change accuracy to biological validation, however a larger impact might be expected for samples with more extreme copy number variations between them. Still, it did identify a small subset of novel differential regions while excluding some differential peaks in regions with high background signal. These results highlight proper scaling for between-sample data normalization as critical for differential transcription factor binding analysis and suggest bioinformaticians need to know about the variation in level of total protein binding between conditions to select the best analysis method. At the same time, validation using fold-change estimates from qRT-PCR suggests there is still room for further method improvement.
Collapse
Affiliation(s)
- Dai-Ying Wu
- Department of Biochemistry and Molecular Biology, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA
| | - Danielle Bittencourt
- Department of Biochemistry and Molecular Biology, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA
| | - Michael R Stallcup
- Department of Biochemistry and Molecular Biology, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA
| | - Kimberly D Siegmund
- Department of Preventive Medicine, University of Southern California Norris Comprehensive Cancer Center, University of Southern California Los Angeles, CA, USA
| |
Collapse
|