1
|
Raditsa V, Tsukanov A, Bogomolov A, Levitsky V. Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data. NAR Genom Bioinform 2024; 6:lqae090. [PMID: 39071850 PMCID: PMC11282361 DOI: 10.1093/nargab/lqae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/03/2024] [Accepted: 07/19/2024] [Indexed: 07/30/2024] Open
Abstract
Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.
Collapse
Affiliation(s)
- Vladimir V Raditsa
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton V Tsukanov
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton G Bogomolov
- Department of Cell Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Victor G Levitsky
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk 630090, Russia
| |
Collapse
|
2
|
Li Y, Wang Y, Wang C, Ma A, Ma Q, Liu B. A weighted two-stage sequence alignment framework to identify motifs from ChIP-exo data. PATTERNS (NEW YORK, N.Y.) 2024; 5:100927. [PMID: 38487805 PMCID: PMC10935504 DOI: 10.1016/j.patter.2024.100927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/18/2023] [Accepted: 01/10/2024] [Indexed: 03/17/2024]
Abstract
In this study, we introduce TESA (weighted two-stage alignment), an innovative motif prediction tool that refines the identification of DNA-binding protein motifs, essential for deciphering transcriptional regulatory mechanisms. Unlike traditional algorithms that rely solely on sequence data, TESA integrates the high-resolution chromatin immunoprecipitation (ChIP) signal, specifically from ChIP-exonuclease (ChIP-exo), by assigning weights to sequence positions, thereby enhancing motif discovery. TESA employs a nuanced approach combining a binomial distribution model with a graph model, further supported by a "bookend" model, to improve the accuracy of predicting motifs of varying lengths. Our evaluation, utilizing an extensive compilation of 90 prokaryotic ChIP-exo datasets from proChIPdb and 167 H. sapiens datasets, compared TESA's performance against seven established tools. The results indicate TESA's improved precision in motif identification, suggesting its valuable contribution to the field of genomic research.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yizhong Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
3
|
Noorani MS, Baig MS, Khan JA, Pravej A. Whole genome characterization and diagnostics of prunus necrotic ringspot virus (PNRSV) infecting apricot in India. Sci Rep 2023; 13:4393. [PMID: 36928763 PMCID: PMC10020458 DOI: 10.1038/s41598-023-31172-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 03/07/2023] [Indexed: 03/18/2023] Open
Abstract
Prunus necrotic ringspot virus (PNRSV) is a pathogen that infects Prunus species worldwide, causing major economic losses. Using one and two-step RT-PCR and multiplex RT-PCR, the whole genome of the PNRSV-infecting apricot was obtained and described in this study. Computational approaches were used to investigate the participation of several regulatory motifs and domains of the Replicase1, Replicase2, MP, and CP. A single degenerated reverse and three forward oligo primers were used to amplify PNRSV's tripartite genome. The size of RNA1 was 3.332 kb, RNA2 was 2.591 kb, and RNA3 was 1.952 kb, according to the sequencing analysis. The Sequence Demarcation Tool analysis determined a percentage pair-wise identity ranging between 91 and 99% for RNA1 and 2, and 87-98% for RNA3. Interestingly, the phylogenetic analysis revealed that closely related RNA1, RNA2, and RNA3 sequences of PNRSV strains from various geographical regions of the world are classified into distinct clades or groups. This is the first report on the characterization of the whole genome of PNRSV from India, which provides the cornerstone for further studies on the molecular evolution of this virus. This study will assist in molecular diagnostics and management of the diseases caused by PNRSV.
Collapse
Affiliation(s)
- Md Salik Noorani
- Department of Botany, School of Chemical and Life Sciences, Jamia Hamdard (A Deemed-to-Be University), New Delhi, India.
- Plant Virus Laboratory, Department of Biosciences, Jamia Millia Islamia (A Central University), New Delhi, India.
| | - Mirza Sarwar Baig
- Department of Molecular Medicine, School of Interdisciplinary Sciences, Jamia Hamdard (A Deemed-to-Be University), New Delhi, India
- Plant Virus Laboratory, Department of Biosciences, Jamia Millia Islamia (A Central University), New Delhi, India
| | - Jawaid Ahmad Khan
- Plant Virus Laboratory, Department of Biosciences, Jamia Millia Islamia (A Central University), New Delhi, India
| | - Alam Pravej
- Biology Department, College of Science and Humanities, Prince Sattam Bin Abdulaziz University (PSAU), 11942, Alkharj, Kingdom of Saudi Arabia
| |
Collapse
|
4
|
Prosperi M, Marini S, Boucher C. Fast and exact quantification of motif occurrences in biological sequences. BMC Bioinformatics 2021; 22:445. [PMID: 34537012 PMCID: PMC8449872 DOI: 10.1186/s12859-021-04355-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . CONCLUSIONS The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
Collapse
Affiliation(s)
- Mattia Prosperi
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.
| | - Simone Marini
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
5
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
6
|
Ashraf FB, Shafi MSR. MFEA: An evolutionary approach for motif finding in DNA sequences. INFORMATICS IN MEDICINE UNLOCKED 2020. [DOI: 10.1016/j.imu.2020.100466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
7
|
Abstract
Our understanding of the expanded genetic alphabet has been growing rapidly over the last two decades, and many of these developments came more than 80 years after the original discovery of a modified guanine in tuberculosis DNA. These new understandings, leading to the field of epigenetics, have led to exciting new fundamental and applied knowledge and to the development of novel classes of drugs exploiting this new biology. The number of methyl modifications to RNA is about seven times greater than those found on DNA, and our ability to interrogate these enigmatic nucleobases has lagged significantly until recent years as an explosion in technologies and understanding has revealed the roles and regulation of RNA methylation in several fundamental and disease-associated biological processes. Here, we outline how the technology has evolved and which strategies are commonly used in the modern epitranscriptomics revolution and give a foundation in the understanding and application of the rich variety of these methods to novel biological questions.
Collapse
Affiliation(s)
- Nigel P. Mongan
- School of Veterinary Medicine and Sciences, University of Nottingham, Sutton Bonington Campus, Loughborough, UK
- Department of Pharmacology, Weill Cornell Medical Center, New York, NY, USA
| | - Richard D. Emes
- School of Veterinary Medicine and Sciences, University of Nottingham, Sutton Bonington Campus, Loughborough, UK
- Advanced Data Analysis Centre , University of Nottingham, Sutton Bonington Campus, Loughborough, UK
| | - Nathan Archer
- School of Veterinary Medicine and Sciences, University of Nottingham, Sutton Bonington Campus, Loughborough, UK
| |
Collapse
|
8
|
Choi K, Ratner N. iGEAK: an interactive gene expression analysis kit for seamless workflow using the R/shiny platform. BMC Genomics 2019; 20:177. [PMID: 30841853 PMCID: PMC6404331 DOI: 10.1186/s12864-019-5548-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Accepted: 02/20/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The use of microarrays and RNA-seq technologies is ubiquitous for transcriptome analyses in modern biology. With proper analysis tools, the differential gene expression analysis process can be significantly accelerated. Many open-source programs provide cutting-edge techniques, but these often require programming skills and lack intuitive and interactive or graphical user interfaces. To avoid bottlenecks impeding seamless analysis processing, we have developed an Interactive Gene Expression Analysis Kit, we term iGEAK, focusing on usability and interactivity. iGEAK is designed to be a simple, intuitive, light-weight that contrasts with heavy-duty programs. RESULTS iGEAK is an R/Shiny-based client-side desktop application, providing an interactive gene expression data analysis pipeline for microarray and RNA-seq data. Gene expression data can be intuitively explored using a seamless analysis pipeline consisting of sample selection, differentially expressed gene prediction, protein-protein interaction, and gene set enrichment analyses. For each analysis step, users can easily alter parameters to mine more relevant biological information. CONCLUSION iGEAK is the outcome of close collaboration with wet-bench biologists who are eager to easily explore, mine, and analyze new or public microarray and RNA-seq data. We designed iGEAK as a gene expression analysis pipeline tool to provide essential analysis steps and a user-friendly interactive graphical user interface. iGEAK enables users without programing knowledge to comfortably perform differential gene expression predictions and downstream analyses. iGEAK packages, manuals, tutorials, sample datasets are available at the iGEAK project homepage ( https://sites.google.com/view/iGEAK ).
Collapse
Affiliation(s)
- Kwangmin Choi
- Division of Experimental Hematology and Cancer Biology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA
| | - Nancy Ratner
- Division of Experimental Hematology and Cancer Biology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA
| |
Collapse
|
9
|
Martins-Santana L, Nora LC, Sanches-Medeiros A, Lovate GL, Cassiano MHA, Silva-Rocha R. Systems and Synthetic Biology Approaches to Engineer Fungi for Fine Chemical Production. Front Bioeng Biotechnol 2018; 6:117. [PMID: 30338257 PMCID: PMC6178918 DOI: 10.3389/fbioe.2018.00117] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/02/2018] [Indexed: 01/16/2023] Open
Abstract
Since the advent of systems and synthetic biology, many studies have sought to harness microbes as cell factories through genetic and metabolic engineering approaches. Yeast and filamentous fungi have been successfully harnessed to produce fine and high value-added chemical products. In this review, we present some of the most promising advances from recent years in the use of fungi for this purpose, focusing on the manipulation of fungal strains using systems and synthetic biology tools to improve metabolic flow and the flow of secondary metabolites by pathway redesign. We also review the roles of bioinformatics analysis and predictions in synthetic circuits, highlighting in silico systemic approaches to improve the efficiency of synthetic modules.
Collapse
Affiliation(s)
- Leonardo Martins-Santana
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Luisa C Nora
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Ananda Sanches-Medeiros
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Gabriel L Lovate
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Murilo H A Cassiano
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Rafael Silva-Rocha
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| |
Collapse
|