1
|
Yang W, Luyten Y, Reister E, Mangelson H, Sisson Z, Auch B, Liachko I, Roberts R, Ettwiller L. Proxi-RIMS-seq2 applied to native microbiomes uncovers hundreds of known and novel m5C methyltransferase specificities. Nucleic Acids Res 2025; 53:gkaf226. [PMID: 40156868 PMCID: PMC11954522 DOI: 10.1093/nar/gkaf226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 12/13/2024] [Accepted: 03/24/2025] [Indexed: 04/01/2025] Open
Abstract
Methylation patterns in bacteria can be used to study restriction-modification or other defense systems with novel properties. While m4C and m6A methylation are well characterized mainly through PacBio sequencing, the landscape of m5C methylation is under-characterized. To bridge this gap, we performed RIMS-seq2 (rapid identification of methyltransferase specificity sequencing) on microbiomes composed of resolved assemblies of distinct genomes through proximity ligation. This high-throughput approach enables the identification of m5C methylated motifs and links them to cognate methyltransferases directly on native microbiomes without the need to isolate bacterial strains. Methylation patterns can also be identified on bacteriophage DNA and compared with host DNA, strengthening evidence for phage-host interactions. Applied to three different microbiomes, the method unveiled over 1900 motifs that were deposited in REBASE. The motifs include a novel eight-base recognition site (CATm5CGATG) that was experimentally validated by characterizing its cognate methyltransferase. Our findings suggest that microbiomes harbor arrays of untapped m5C methyltransferase specificities, providing insights into bacterial biology and biotechnological applications.
Collapse
Affiliation(s)
- Weiwei Yang
- New England Biolabs, Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Yvette Luyten
- New England Biolabs, Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Emily Reister
- Phase Genomics, Inc., 1617 8th Ave N, Seattle, WA 98109, United States
| | - Hayley Mangelson
- Phase Genomics, Inc., 1617 8th Ave N, Seattle, WA 98109, United States
| | - Zach Sisson
- Phase Genomics, Inc., 1617 8th Ave N, Seattle, WA 98109, United States
| | - Benjamin Auch
- Phase Genomics, Inc., 1617 8th Ave N, Seattle, WA 98109, United States
| | - Ivan Liachko
- Phase Genomics, Inc., 1617 8th Ave N, Seattle, WA 98109, United States
| | - Richard J Roberts
- New England Biolabs, Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Laurence Ettwiller
- New England Biolabs, Inc., 240 County Road, Ipswich, MA 01938, United States
| |
Collapse
|
2
|
Yang W, Luyten Y, Reister E, Mangelson H, Sisson Z, Auch B, Liachko I, Roberts RJ, Ettwiller L. Proxi-RIMS-seq2 applied to native microbiomes uncovers hundreds of known and novel m5C methyltransferase specificities. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.15.603628. [PMID: 39071437 PMCID: PMC11275837 DOI: 10.1101/2024.07.15.603628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Methylation patterns in bacteria can be used to study Restriction-Modification (RM) or other defense systems with novel properties. While m4C and m6A methylation is well characterized mainly through PacBio sequencing, the landscape of m5C methylation is under-characterized. To bridge this gap, we performed RIMS-seq2 on microbiomes composed of resolved assemblies of distinct genomes through proximity ligation. This high-throughput approach enables the identification of m5C methylated motifs and links them to cognate methyltransferases directly on native microbiomes without the need to isolate bacterial strains. Methylation patterns can also be identified on viral DNA and compared to host DNA, strengthening evidence for virus-host interaction. Applied to three different microbiomes, the method unveils over 1900 motifs that were deposited in REBASE. The motifs include a novel 8-base recognition site (CATm5CGATG) that was experimentally validated by characterizing its cognate methyltransferase. Our findings suggest that microbiomes harbor arrays of untapped m5C methyltransferase specificities, providing insights to bacterial biology and biotechnological applications.
Collapse
Affiliation(s)
- Weiwei Yang
- New England Biolabs Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Yvette Luyten
- New England Biolabs Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Emily Reister
- Phase Genomics Inc, 1617 8th Ave N Seattle, WA 98109, United States
| | - Hayley Mangelson
- Phase Genomics Inc, 1617 8th Ave N Seattle, WA 98109, United States
| | - Zach Sisson
- Phase Genomics Inc, 1617 8th Ave N Seattle, WA 98109, United States
| | - Benjamin Auch
- Phase Genomics Inc, 1617 8th Ave N Seattle, WA 98109, United States
| | - Ivan Liachko
- Phase Genomics Inc, 1617 8th Ave N Seattle, WA 98109, United States
| | - Richard J. Roberts
- New England Biolabs Inc., 240 County Road, Ipswich, MA 01938, United States
| | - Laurence Ettwiller
- New England Biolabs Inc., 240 County Road, Ipswich, MA 01938, United States
| |
Collapse
|
3
|
Anton BP, Roberts RJ. A Survey of Archaeal Restriction-Modification Systems. Microorganisms 2023; 11:2424. [PMID: 37894082 PMCID: PMC10609329 DOI: 10.3390/microorganisms11102424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 09/24/2023] [Accepted: 09/25/2023] [Indexed: 10/29/2023] Open
Abstract
When compared with bacteria, relatively little is known about the restriction-modification (RM) systems of archaea, particularly those in taxa outside of the haloarchaea. To improve our understanding of archaeal RM systems, we surveyed REBASE, the restriction enzyme database, to catalog what is known about the genes and activities present in the 519 completely sequenced archaeal genomes currently deposited there. For 49 (9.4%) of these genomes, we also have methylome data from Single-Molecule Real-Time (SMRT) sequencing that reveal the target recognition sites of the active m6A and m4C DNA methyltransferases (MTases). The gene-finding pipeline employed by REBASE is trained primarily on bacterial examples and so will look for similar genes in archaea. Nonetheless, the organizational structure and protein sequence of RM systems from archaea are highly similar to those of bacteria, with both groups acquiring systems from a shared genetic pool through horizontal gene transfer. As in bacteria, we observe numerous examples of "persistent" DNA MTases conserved within archaeal taxa at different levels. We experimentally validated two homologous members of one of the largest "persistent" MTase groups, revealing that methylation of C(m5C)WGG sites may play a key epigenetic role in Crenarchaea. Throughout the archaea, genes encoding m6A, m4C, and m5C DNA MTases, respectively, occur in approximately the ratio 4:2:1.
Collapse
|
4
|
Baum C, Lin YC, Fomenkov A, Anton BP, Chen L, Yan B, Evans TC, Roberts RJ, Tolonen AC, Ettwiller L. Rapid identification of methylase specificity (RIMS-seq) jointly identifies methylated motifs and generates shotgun sequencing of bacterial genomes. Nucleic Acids Res 2021; 49:e113. [PMID: 34417598 PMCID: PMC8565308 DOI: 10.1093/nar/gkab705] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 07/29/2021] [Accepted: 08/16/2021] [Indexed: 11/21/2022] Open
Abstract
DNA methylation is widespread amongst eukaryotes and prokaryotes to modulate gene expression and confer viral resistance. 5-Methylcytosine (m5C) methylation has been described in genomes of a large fraction of bacterial species as part of restriction-modification systems, each composed of a methyltransferase and cognate restriction enzyme. Methylases are site-specific and target sequences vary across organisms. High-throughput methods, such as bisulfite-sequencing can identify m5C at base resolution but require specialized library preparations and single molecule, real-time (SMRT) sequencing usually misses m5C. Here, we present a new method called RIMS-seq (rapid identification of methylase specificity) to simultaneously sequence bacterial genomes and determine m5C methylase specificities using a simple experimental protocol that closely resembles the DNA-seq protocol for Illumina. Importantly, the resulting sequencing quality is identical to DNA-seq, enabling RIMS-seq to substitute standard sequencing of bacterial genomes. Applied to bacteria and synthetic mixed communities, RIMS-seq reveals new methylase specificities, supporting routine study of m5C methylation while sequencing new genomes.
Collapse
Affiliation(s)
- Chloé Baum
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA.,Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91000 Évry, France
| | - Yu-Cheng Lin
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | - Alexey Fomenkov
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | - Brian P Anton
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | - Lixin Chen
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | - Bo Yan
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | - Thomas C Evans
- New England Biolabs, Inc. 240 County Road Ipswich, MA 01938, USA
| | | | - Andrew C Tolonen
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91000 Évry, France
| | | |
Collapse
|
5
|
Prosperi M, Marini S, Boucher C. Fast and exact quantification of motif occurrences in biological sequences. BMC Bioinformatics 2021; 22:445. [PMID: 34537012 PMCID: PMC8449872 DOI: 10.1186/s12859-021-04355-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . CONCLUSIONS The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
Collapse
Affiliation(s)
- Mattia Prosperi
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.
| | - Simone Marini
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
6
|
Castellana S, Biagini T, Parca L, Petrizzelli F, Bianco SD, Vescovi AL, Carella M, Mazza T. A comparative benchmark of classic DNA motif discovery tools on synthetic data. Brief Bioinform 2021; 22:6341664. [PMID: 34351399 DOI: 10.1093/bib/bbab303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 07/08/2021] [Accepted: 07/15/2021] [Indexed: 01/01/2023] Open
Abstract
Hundreds of human proteins were found to establish transient interactions with rather degenerated consensus DNA sequences or motifs. Identifying these motifs and the genomic sites where interactions occur represent one of the most challenging research goals in modern molecular biology and bioinformatics. The last twenty years witnessed an explosion of computational tools designed to perform this task, whose performance has been last compared fifteen years ago. Here, we survey sixteen of them, benchmark their ability to identify known motifs nested in twenty-nine simulated sequence datasets, and finally report their strengths, weaknesses, and complementarity.
Collapse
Affiliation(s)
- Stefano Castellana
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Tommaso Biagini
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Luca Parca
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Francesco Petrizzelli
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy.,Department of Experimental Medicine, Sapienza University of Rome, Rome 00161, Italy
| | | | - Angelo Luigi Vescovi
- ISBReMIT Institute for Stem Cell Biology, Regenerative Medicine and Innovative Therapies, IRCSS Casa Sollievo della Sofferenza, San Giovanni Rotondo (FG), 71013, Italy
| | - Massimo Carella
- Medical Genetics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Tommaso Mazza
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| |
Collapse
|
7
|
Saad C, Noé L, Richard H, Leclerc J, Buisine MP, Touzet H, Figeac M. DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data. BMC Bioinformatics 2018; 19:223. [PMID: 29890948 PMCID: PMC5996464 DOI: 10.1186/s12859-018-2215-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 05/21/2018] [Indexed: 12/30/2022] Open
Abstract
Background Discovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics. This topic has been studied extensively because of the increasing number of potential applications. However, it remains a difficult challenge, especially with the huge quantity of data generated by high throughput sequencing technologies. To overcome this problem, existing tools use greedy algorithms and probabilistic approaches to find motifs in reasonable time. Nevertheless these approaches lack sensitivity and have difficulties coping with rare and subtle motifs. Results We developed DiNAMO (for DNA MOtif), a new software based on an exhaustive and efficient algorithm for IUPAC motif discovery. We evaluated DiNAMO on synthetic and real datasets with two different applications, namely ChIP-seq peaks and Systematic Sequencing Error analysis. DiNAMO proves to compare favorably with other existing methods and is robust to noise. Conclusions We shown that DiNAMO software can serve as a tool to search for degenerate motifs in an exact manner using IUPAC models. DiNAMO can be used in scanning mode with sliding windows or in fixed position mode, which makes it suitable for numerous potential applications. Availability https://github.com/bonsai-team/DiNAMO. Electronic supplementary material The online version of this article (10.1186/s12859-018-2215-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chadi Saad
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France. .,Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France.
| | - Laurent Noé
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France
| | - Hugues Richard
- Sorbonne Université, UMR7238, Laboratory Computational and Quantitative Biology, LCQB, Paris, F-75005, France
| | - Julie Leclerc
- Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France
| | - Marie-Pierre Buisine
- Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France
| | - Hélène Touzet
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France
| | - Martin Figeac
- Univ. Lille. Plateau de génomique fonctionnelle et structurale, Lille, F-59000, France
| |
Collapse
|
8
|
Al-Ssulami AM, Azmi AM, Mathkour H. An efficient method for significant motifs discovery from multiple DNA sequences. J Bioinform Comput Biol 2017; 15:1750014. [PMID: 28571483 DOI: 10.1142/s0219720017500147] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Identification of transcription factor binding sites or biological motifs is an important step in deciphering the mechanisms of gene regulation. It is a classic problem that has eluded a satisfactory and efficient solution. In this paper, we devise a three-phase algorithm to mine for biologically significant motifs. In the first phase, we generate all the possible string motifs, this phase is followed by a filtering process where we discard all motifs that do not meet the constraints. And in the final phase, motifs are scored and ranked using a combination of stochastic techniques and [Formula: see text]-value. We show that our method outperforms some very well-known motif discovery tools, e.g. MEME and Weeder on well-established benchmark data suites. We also apply the algorithm on the non-coding regions of M. tuberculosis and report significant motifs of size 10 with excellent [Formula: see text]-values in a fraction of the time MEME and MoSDi did. In fact, among the best 10 motifs ([Formula: see text]-value wise) in the non-coding regions of M. tuberculosis reported by the tools MEME, MoSDi and ours, five were discovered by our approach which included the third and the fourth best ones. All this in 1/17 and 1/6 the time which MEME and MoSDi (respectively) took.
Collapse
Affiliation(s)
- Abdulrakeeb M Al-Ssulami
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Aqil M Azmi
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Hassan Mathkour
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| |
Collapse
|
9
|
Ayad LAK, Pissis SPP, Retha A. libFLASM: a software library for fixed-length approximate string matching. BMC Bioinformatics 2016; 17:454. [PMID: 27832739 PMCID: PMC5103500 DOI: 10.1186/s12859-016-1320-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 11/03/2016] [Indexed: 01/06/2023] Open
Abstract
Background Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length ℓ of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$\mathcal {O}(m\lceil \ell /w \rceil n)$\end{document}O(m⌈ℓ/w⌉n) and space \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$\mathcal {O}(m\lceil \ell /w\rceil)$\end{document}O(m⌈ℓ/w⌉) under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere. Results We present and make available libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixed-length approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds. Conclusions Fixed-length approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK
| | - Solon P P Pissis
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK.
| | - Ahmad Retha
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK
| |
Collapse
|
10
|
Lee S, Min H, Yoon S. Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond. Brief Bioinform 2015; 17:713-27. [PMID: 26330577 DOI: 10.1093/bib/bbv073] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Indexed: 11/12/2022] Open
Abstract
A wide variety of large-scale data have been produced in bioinformatics. In response, the need for efficient handling of biomedical big data has been partly met by parallel computing. However, the time demand of many bioinformatics programs still remains high for large-scale practical uses because of factors that hinder acceleration by parallelization. Recently, new generations of storage devices have emerged, such as NAND flash-based solid-state drives (SSDs), and with the renewed interest in near-data processing, they are increasingly becoming acceleration methods that can accompany parallel processing. In certain cases, a simple drop-in replacement of hard disk drives by SSDs results in dramatic speedup. Despite the various advantages and continuous cost reduction of SSDs, there has been little review of SSD-based profiling and performance exploration of important but time-consuming bioinformatics programs. For an informative review, we perform in-depth profiling and analysis of 23 key bioinformatics programs using multiple types of devices. Based on the insight we obtain from this research, we further discuss issues related to design and optimize bioinformatics algorithms and pipelines to fully exploit SSDs. The programs we profile cover traditional and emerging areas of importance, such as alignment, assembly, mapping, expression analysis, variant calling and metagenomics. We explain how acceleration by parallelization can be combined with SSDs for improved performance and also how using SSDs can expedite important bioinformatics pipelines, such as variant calling by the Genome Analysis Toolkit and transcriptome analysis using RNA sequencing. We hope that this review can provide useful directions and tips to accompany future bioinformatics algorithm design procedures that properly consider new generations of powerful storage devices.
Collapse
|
11
|
De Witte D, Van de Velde J, Decap D, Van Bel M, Audenaert P, Demeester P, Dhoedt B, Vandepoele K, Fostier J. BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements. Bioinformatics 2015; 31:3758-66. [PMID: 26254488 PMCID: PMC4653392 DOI: 10.1093/bioinformatics/btv466] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2014] [Accepted: 08/03/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. RESULTS We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. AVAILABILITY AND IMPLEMENTATION BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller CONTACT Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dieter De Witte
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Jan Van de Velde
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Dries Decap
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Michiel Van Bel
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Pieter Audenaert
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Piet Demeester
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Bart Dhoedt
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Klaas Vandepoele
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Jan Fostier
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| |
Collapse
|
12
|
Hamed M, Spaniol C, Zapp A, Helms V. Integrative network-based approach identifies key genetic elements in breast invasive carcinoma. BMC Genomics 2015; 16 Suppl 5:S2. [PMID: 26040466 PMCID: PMC4460623 DOI: 10.1186/1471-2164-16-s5-s2] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Breast cancer is a genetically heterogeneous type of cancer that belongs to the most prevalent types with a high mortality rate. Treatment and prognosis of breast cancer would profit largely from a correct classification and identification of genetic key drivers and major determinants driving the tumorigenesis process. In the light of the availability of tumor genomic and epigenomic data from different sources and experiments, new integrative approaches are needed to boost the probability of identifying such genetic key drivers. We present here an integrative network-based approach that is able to associate regulatory network interactions with the development of breast carcinoma by integrating information from gene expression, DNA methylation, miRNA expression, and somatic mutation datasets. RESULTS Our results showed strong association between regulatory elements from different data sources in terms of the mutual regulatory influence and genomic proximity. By analyzing different types of regulatory interactions, TF-gene, miRNA-mRNA, and proximity analysis of somatic variants, we identified 106 genes, 68 miRNAs, and 9 mutations that are candidate drivers of oncogenic processes in breast cancer. Moreover, we unraveled regulatory interactions among these key drivers and the other elements in the breast cancer network. Intriguingly, about one third of the identified driver genes are targeted by known anti-cancer drugs and the majority of the identified key miRNAs are implicated in cancerogenesis of multiple organs. Also, the identified driver mutations likely cause damaging effects on protein functions. The constructed gene network and the identified key drivers were compared to well-established network-based methods. CONCLUSION The integrated molecular analysis enabled by the presented network-based approach substantially expands our knowledge base of prospective genomic drivers of genes, miRNAs, and mutations. For a good part of the identified key drivers there exists solid evidence for involvement in the development of breast carcinomas. Our approach also unraveled the complex regulatory interactions comprising the identified key drivers. These genomic drivers could be further investigated in the wet lab as potential candidates for new drug targets. This integrative approach can be applied in a similar fashion to other cancer types, complex diseases, or for studying cellular differentiation processes.
Collapse
Affiliation(s)
- Mohamed Hamed
- Center for Bioinformatics, Saarland University, 66041 Saarbrucken, Germany
| | - Christian Spaniol
- Center for Bioinformatics, Saarland University, 66041 Saarbrucken, Germany
| | - Alexander Zapp
- Center for Bioinformatics, Saarland University, 66041 Saarbrucken, Germany
| | - Volkhard Helms
- Center for Bioinformatics, Saarland University, 66041 Saarbrucken, Germany
| |
Collapse
|
13
|
Abstract
Methylation of N6-adenosine (m6A) has been observed in many different classes of RNA, but its prevalence in microRNAs (miRNAs) has not yet been studied. Here we show that a knockdown of the m6A demethylase FTO affects the steady-state levels of several miRNAs. Moreover, RNA immunoprecipitation with an anti-m6A-antibody followed by RNA-seq revealed that a significant fraction of miRNAs contains m6A. By motif searches we have discovered consensus sequences discriminating between methylated and unmethylated miRNAs. The epigenetic modification of an epigenetic modifier as described here adds a new layer to the complexity of the posttranscriptional regulation of gene expression.
Collapse
|
14
|
Feng Z, Li J, Zhang JR, Zhang X. qDNAmod: a statistical model-based tool to reveal intercellular heterogeneity of DNA modification from SMRT sequencing data. Nucleic Acids Res 2014; 42:13488-99. [PMID: 25404133 PMCID: PMC4267614 DOI: 10.1093/nar/gku1097] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
In an isogenic cell population, phenotypic heterogeneity among individual cells is common and critical for survival of the population under different environment conditions. DNA modification is an important epigenetic factor that can regulate phenotypic heterogeneity. The single molecule real-time (SMRT) sequencing technology provides a unique platform for detecting a wide range of DNA modifications, including N6-methyladenine (6-mA), N4-methylcytosine (4-mC) and 5-methylcytosine (5-mC). Here we present qDNAmod, a novel bioinformatic tool for genome-wide quantitative profiling of intercellular heterogeneity of DNA modification from SMRT sequencing data. It is capable of estimating proportion of isogenic haploid cells, in which the same loci of the genome are differentially modified. We tested the reliability of qDNAmod with the SMRT sequencing data of Streptococcus pneumoniae strain ST556. qDNAmod detected extensive intercellular heterogeneity of DNA methylation (6-mA) in a clonal population of ST556. Subsequent biochemical analyses revealed that the recognition sequences of two type I restriction–modification (R-M) systems are responsible for the intercellular heterogeneity of DNA methylation initially identified by qDNAmod. qDNAmod thus represents a valuable tool for studying intercellular phenotypic heterogeneity from genome-wide DNA modification.
Collapse
Affiliation(s)
- Zhixing Feng
- MOE Key Lab of Bioinformatics, Bioinformatics Division, TNLIST and Department of Automation, Tsinghua University, Beijing 100084, China Center for Infectious Disease Research, School of Medicine, Tsinghua University, Beijing 100084, China
| | - Jing Li
- Center for Infectious Disease Research, School of Medicine, Tsinghua University, Beijing 100084, China
| | - Jing-Ren Zhang
- Center for Infectious Disease Research, School of Medicine, Tsinghua University, Beijing 100084, China Collaborative Innovation Center for Biotherapy, Tsinghua University, Beijing 100084, China Collaborative Innovation Center for Biotherapy, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, West China Medical School, Sichuan University, Chengdu, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics, Bioinformatics Division, TNLIST and Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
15
|
Badr G, Al-Turaiki I, Turcotte M, Mathkour H. IncMD: incremental trie-based structural motif discovery algorithm. J Bioinform Comput Biol 2014; 12:1450027. [PMID: 25362841 DOI: 10.1142/s0219720014500279] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The discovery of common RNA secondary structure motifs is an important problem in bioinformatics. The presence of such motifs is usually associated with key biological functions. However, the identification of structural motifs is far from easy. Unlike motifs in sequences, which have conserved bases, structural motifs have common structure arrangements even if the underlying sequences are different. Over the past few years, hundreds of algorithms have been published for the discovery of sequential motifs, while less work has been done for the structural motifs case. Current structural motif discovery algorithms are limited in terms of accuracy and scalability. In this paper, we present an incremental and scalable algorithm for discovering RNA secondary structure motifs, namely IncMD. We consider the structural motif discovery as a frequent pattern mining problem and tackle it using a modified a priori algorithm. IncMD uses data structures, trie-based linked lists of prefixes (LLP), to accelerate the search and retrieval of patterns, support counting, and candidate generation. We modify the candidate generation step in order to adapt it to the RNA secondary structure representation. IncMD constructs the frequent patterns incrementally from RNA secondary structure basic elements, using nesting and joining operations. The notion of a motif group is introduced in order to simulate an alignment of motifs that only differ in the number of unpaired bases. In addition, we use a cluster beam approach to select motifs that will survive to the next iterations of the search. Results indicate that IncMD can perform better than some of the available structural motif discovery algorithms in terms of sensitivity (Sn), positive predictive value (PPV), and specificity (Sp). The empirical results also show that the algorithm is scalable and runs faster than all of the compared algorithms.
Collapse
Affiliation(s)
- Ghada Badr
- College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia , IRI - The City of Scientific Research and Technological Applications, University and Research District, P. O. 21934, New Borg Alarab, Alexandria, Egypt
| | | | | | | |
Collapse
|
16
|
Azmi AM, Al-Ssulami A. Encoded expansion: an efficient algorithm to discover identical string motifs. PLoS One 2014; 9:e95148. [PMID: 24871320 PMCID: PMC4037181 DOI: 10.1371/journal.pone.0095148] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 03/24/2014] [Indexed: 11/19/2022] Open
Abstract
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
Collapse
Affiliation(s)
- Aqil M. Azmi
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
- * E-mail:
| | - Abdulrakeeb Al-Ssulami
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
17
|
Finding peculiar compositions of two frequent strings with background texts. Knowl Inf Syst 2013. [DOI: 10.1007/s10115-013-0688-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
18
|
Wang D, Tapan S. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S4. [PMID: 23282090 PMCID: PMC3521183 DOI: 10.1186/1752-0509-6-s2-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Background Computational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools. Results This paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones. Conclusions MISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.
Collapse
Affiliation(s)
- Dianhui Wang
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
| | | |
Collapse
|
19
|
Abstract
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
Collapse
Affiliation(s)
- David Simcha
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
| | | | | |
Collapse
|
20
|
Marschall T, Herms I, Kaltenbach HM, Rahmann S. Probabilistic arithmetic automata and their applications. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1737-1750. [PMID: 22868683 DOI: 10.1109/tcbb.2012.109] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models; 2) exact analysis of window-based pattern matching algorithms; 3) sensitivity of filtration seeds used to detect candidate sequence alignments; 4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework.
Collapse
Affiliation(s)
- Tobias Marschall
- Life Sciences Group, Centrum Wiskunde & Informatica (CWI), Science Park 123, 1098 XG Amsterdam, TheNetherlands.
| | | | | | | |
Collapse
|
21
|
PROSPERI MATTIACF, PROSPERI LUCIANO, GRAY REBECCAR, SALEMI MARCO. ON COUNTING THE FREQUENCY DISTRIBUTION OF STRING MOTIFS IN MOLECULAR SEQUENCES. INT J BIOMATH 2012. [DOI: 10.1142/s1793524512500556] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions.
Collapse
Affiliation(s)
- MATTIA C. F. PROSPERI
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, P. O. Box 103633, 2055 Mowry Road, Gainesville, FL 32610-3633, USA
| | | | | | - MARCO SALEMI
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, P. O. Box 103633, 2055 Mowry Road, Gainesville, FL 32610-3633, USA
| |
Collapse
|
22
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|
23
|
|
24
|
Zhang S, Li S, Niu M, Pham PT, Su Z. MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics 2011; 12:238. [PMID: 21679436 PMCID: PMC3225181 DOI: 10.1186/1471-2105-12-238] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2010] [Accepted: 06/16/2011] [Indexed: 11/21/2022] Open
Abstract
Background Although dozens of algorithms and tools have been developed to find a set of cis-regulatory binding sites called a motif in a set of intergenic sequences using various approaches, most of these tools focus on identifying binding sites that are significantly different from their background sequences. However, some motifs may have a similar nucleotide distribution to that of their background sequences. Therefore, such binding sites can be missed by these tools. Results Here, we present a graph-based polynomial-time algorithm, MotifClick, for the prediction of cis-regulatory binding sites, in particular, those that have a similar nucleotide distribution to that of their background sequences. To find binding sites with length k, we construct a graph using some 2(k-1)-mers in the input sequences as the vertices, and connect two vertices by an edge if the maximum number of matches of the local gapless alignments between the two 2(k-1)-mers is greater than a cutoff value. We identify a motif as a set of similar k-mers from a merged group of maximum cliques associated with some vertices. Conclusions When evaluated on both synthetic and real datasets of prokaryotes and eukaryotes, MotifClick outperforms existing leading motif-finding tools for prediction accuracy and balancing the prediction sensitivity and specificity in general. In particular, when the distribution of nucleotides of binding sites is similar to that of their background sequences, MotifClick is more likely to identify the binding sites than the other tools.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- Department of Bioinformatics and Genomics, Center for Bioinformatics Research, the University of North Carolina at Charlotte, 28223, USA
| | | | | | | | | |
Collapse
|
25
|
Levitsky VG, Oshchepkov DY, Ershov NI, Bryzgalov LO, Antontseva EV, Vasiliev GV, Merkulova TI, Kolchanov NA. Development of computational methods to search for FoxA transcription factor binding sites, their experimental verification and application to the analysis of ChIP-seq data. DOKL BIOCHEM BIOPHYS 2011; 436:12-5. [PMID: 21369894 DOI: 10.1134/s1607672911010054] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2010] [Indexed: 11/22/2022]
Affiliation(s)
- V G Levitsky
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, pr. Akademika Lavrent'eva 10, Novosibirsk 630090, Russia
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Pugalenthi G, Kandaswamy KK, Suganthan PN, Sowdhamini R, Martinetz T, Kolatkar PR. SMpred: a support vector machine approach to identify structural motifs in protein structure without using evolutionary information. J Biomol Struct Dyn 2011; 28:405-14. [PMID: 20919755 DOI: 10.1080/07391102.2010.10507369] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.
Collapse
Affiliation(s)
- Ganesan Pugalenthi
- Laboratory of Structural Biochemistry, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672
| | | | | | | | | | | |
Collapse
|
27
|
Abstract
Expectation maximization and Gibbs' sampling are two statistical approaches used to identify transcription factor binding sites and the motif that represents them. Both take as input unaligned sequences and search for a statistically significant alignment of putative binding sites. Expectation maximization is deterministic so that starting with the same initial parameters will always converge to the same solution, making it wise to start it multiple times from different initial parameters. Gibbs' sampling is stochastic so that it may arrive at different solutions from the same initial parameters. In both cases multiple runs are advised because comparisons of the solutions after each run can indicate whether a global, optimum solution is likely to have been achieved.
Collapse
|