1
|
LRRpredictor-A New LRR Motif Detection Method for Irregular Motifs of Plant NLR Proteins Using an Ensemble of Classifiers. Genes (Basel) 2020; 11:genes11030286. [PMID: 32182725 PMCID: PMC7140858 DOI: 10.3390/genes11030286] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 02/28/2020] [Accepted: 03/04/2020] [Indexed: 12/17/2022] Open
Abstract
Leucine-rich-repeats (LRRs) belong to an archaic procaryal protein architecture that is widely involved in protein-protein interactions. In eukaryotes, LRR domains developed into key recognition modules in many innate immune receptor classes. Due to the high sequence variability imposed by recognition specificity, precise repeat delineation is often difficult especially in plant NOD-like Receptors (NLRs) notorious for showing far larger irregularities. To address this problem, we introduce here LRRpredictor, a method based on an ensemble of estimators designed to better identify LRR motifs in general but particularly adapted for handling more irregular LRR environments, thus allowing to compensate for the scarcity of structural data on NLR proteins. The extrapolation capacity tested on a set of annotated LRR domains from six immune receptor classes shows the ability of LRRpredictor to recover all previously defined specific motif consensuses and to extend the LRR motif coverage over annotated LRR domains. This analysis confirms the increased variability of LRR motifs in plant and vertebrate NLRs when compared to extracellular receptors, consistent with previous studies. Hence, LRRpredictor is able to provide novel insights into the diversification of LRR domains and a robust support for structure-informed analyses of LRRs in immune receptor functioning.
Collapse
|
2
|
Ehsan Elahi F, Hasan A. A method for estimating Hill function-based dynamic models of gene regulatory networks. ROYAL SOCIETY OPEN SCIENCE 2018; 5:171226. [PMID: 29515843 PMCID: PMC5830732 DOI: 10.1098/rsos.171226] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 01/25/2018] [Indexed: 08/24/2023]
Abstract
Gene regulatory networks (GRNs) are quite large and complex. To better understand and analyse GRNs, mathematical models are being employed. Different types of models, such as logical, continuous and stochastic models, can be used to describe GRNs. In this paper, we present a new approach to identify continuous models, because they are more suitable for large number of genes and quantitative analysis. One of the most promising techniques for identifying continuous models of GRNs is based on Hill functions and the generalized profiling method (GPM). The advantage of this approach is low computational cost and insensitivity to initial conditions. In the GPM, a constrained nonlinear optimization problem has to be solved that is usually underdetermined. In this paper, we propose a new optimization approach in which we reformulate the optimization problem such that constraints are embedded implicitly in the cost function. Moreover, we propose to split the unknown parameter in two sets based on the structure of Hill functions. These two sets are estimated separately to resolve the issue of the underdetermined problem. As a case study, we apply the proposed technique on the SOS response in Escherichia coli and compare the results with the existing literature.
Collapse
Affiliation(s)
| | - Ammar Hasan
- National University of Sciences and Technology (NUST), H-12, 44000, Islamabad, Pakistan
| |
Collapse
|
3
|
Pellegrini M, Renda ME, Vecchio A. Ab initio detection of fuzzy amino acid tandem repeats in protein sequences. BMC Bioinformatics 2012; 13 Suppl 3:S8. [PMID: 22536906 PMCID: PMC3402919 DOI: 10.1186/1471-2105-13-s3-s8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). RESULTS In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. CONCLUSIONS PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available.
Collapse
Affiliation(s)
- Marco Pellegrini
- Istituto di Informatica e Telematica, CNR - Consiglio Nazionale delle Ricerche, Pisa I-56124, Italy.
| | | | | |
Collapse
|
4
|
Roy S, Werner-Washburne M, Lane T. A multiple network learning approach to capture system-wide condition-specific responses. Bioinformatics 2011; 27:1832-8. [PMID: 21551143 DOI: 10.1093/bioinformatics/btr270] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
MOTIVATION Condition-specific networks capture system-wide behavior under varying conditions such as environmental stresses, cell types or tissues. These networks frequently comprise parts that are unique to each condition, and parts that are shared among related conditions. Existing approaches for learning condition-specific networks typically identify either only differences or only similarities across conditions. Most of these approaches first learn networks per condition independently, and then identify similarities and differences in a post-learning step. Such approaches do not exploit the shared information across conditions during network learning. RESULTS We describe an approach for learning condition-specific networks that identifies the shared and unique subgraphs during network learning simultaneously, rather than as a post-processing step. Our approach learns networks across condition sets, shares data from different conditions and produces high-quality networks that capture biologically meaningful information. On simulated data, our approach outperformed an existing approach that learns networks independently for each condition, especially for small training datasets. On microarray data of hundreds of deletion mutants in two, yeast stationary-phase cell populations, the inferred network structure identified several common and population-specific effects of these deletion mutants and several high-confidence cases of double-deletion pairs, which can be experimentally tested. Our results are consistent with and extend the existing knowledge base of differentiated cell populations in yeast stationary phase. AVAILABILITY AND IMPLEMENTATION C++ code can be accessed from http://www.broadinstitute.org/~sroy/condspec/ .
Collapse
Affiliation(s)
- Sushmita Roy
- Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA.
| | | | | |
Collapse
|
5
|
Abstract
MOTIVATION Over the past decade, the prospect of inferring networks of gene regulation from high-throughput experimental data has received a great deal of attention. In contrast to the massive effort that has gone into automated deconvolution of biological networks, relatively little effort has been invested in benchmarking the proposed algorithms. The rate at which new network inference methods are being proposed far outpaces our ability to objectively evaluate and compare them. This is largely due to a lack of fully understood biological networks to use as gold standards. RESULTS We have developed the most realistic system to date that generates synthetic regulatory networks for benchmarking reconstruction algorithms. The improved biological realism of our benchmark leads to conclusions about the relative accuracies of reconstruction algorithms that are significantly different from those obtained with A-BIOCHEM, an established in silico benchmark. AVAILABILITY The synthetic benchmark utility and the specific benchmark networks that were used in our analyses are available at http://mblab.wustl.edu/software/grendel/.
Collapse
Affiliation(s)
- Brian C Haynes
- Center for Genome Sciences and Department of Computer Science, Washington University, St Louis, MO, USA
| | | |
Collapse
|
6
|
Zhang S, Su W, Yang J. ARCS-Motif: discovering correlated motifs from unaligned biological sequences. Bioinformatics 2008; 25:183-9. [PMID: 19073591 DOI: 10.1093/bioinformatics/btn609] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The goal of motif discovery is to detect novel, unknown, and important signals from biology sequences. In most models, the importance of a motif is equal to the sum of the similarity of every single position. In 2006, Song et al. introduced Aggregated Related Column Score (ARCS) measure which includes correlation information to the evaluation of motif importance. The paper showed that the ARCS measure is superior to other measures. Due to the complicated nature of the ARCS motif model, we cannot directly apply existing sequential motif discovery methods to find motifs with high ARCS values. RESULTS This article presents a novel mining algorithm, ARCS-Motif, to discover related sequential motifs in biological sequences. ARCS-Motif is applied to 400 PROSITE datasets and compared with five alternative methods (CONSENSUS, Gibbs sampler, MEME, SPLASH and DIALIGN-TX). ARCS-Motif outperforms all the methods in accuracy, and most of the methods in efficiency. Although SPLASH has better efficiency than ARCS-Motif, ARCS-Motif has much better accuracy than SPLASH. On average, ARCS-Motif is able to produce the motifs which are at least 10% better than the best of the alternative methods. Among the 400 PROSITE datasets, ARCS-Motif produces the best motifs for more than 200 families. Other than SPLASH, the execution time of ARCS-Motif is less than a third of that of the fastest alternative method and its execution time grows at the slowest rate with respect to the number of sequences and the average sequence among all methods.
Collapse
Affiliation(s)
- Shijie Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA
| | | | | |
Collapse
|
7
|
Liu B, de la Fuente A, Hoeschele I. Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 2008; 178:1763-76. [PMID: 18245846 PMCID: PMC2278111 DOI: 10.1534/genetics.107.080069] [Citation(s) in RCA: 87] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2007] [Accepted: 01/07/2008] [Indexed: 01/09/2023] Open
Abstract
Our goal is gene network inference in genetical genomics or systems genetics experiments. For species where sequence information is available, we first perform expression quantitative trait locus (eQTL) mapping by jointly utilizing cis-, cis-trans-, and trans-regulation. After using local structural models to identify regulator-target pairs for each eQTL, we construct an encompassing directed network (EDN) by assembling all retained regulator-target relationships. The EDN has nodes corresponding to expressed genes and eQTL and directed edges from eQTL to cis-regulated target genes, from cis-regulated genes to cis-trans-regulated target genes, from trans-regulator genes to target genes, and from trans-eQTL to target genes. For network inference within the strongly constrained search space defined by the EDN, we propose structural equation modeling (SEM), because it can model cyclic networks and the EDN indeed contains feedback relationships. On the basis of a factorization of the likelihood and the constrained search space, our SEM algorithm infers networks involving several hundred genes and eQTL. Structure inference is based on a penalized likelihood ratio and an adaptation of Occam's window model selection. The SEM algorithm was evaluated using data simulated with nonlinear ordinary differential equations and known cyclic network topologies and was applied to a real yeast data set.
Collapse
Affiliation(s)
- Bing Liu
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24061-0477, USA
| | | | | |
Collapse
|
8
|
Abstract
MOTIVATION For testing and sensitivity analysis purposes, it is beneficial to have known transcription networks of sufficient size and variability during development of microarray data and network deconvolution algorithms. Description of such networks in a simple language translatable to Systems Biology Markup Language would allow generation of model data for the networks. RESULTS Described herein is software (RANGE: RAndom Network GEnerator) to generate large random transcription networks in the NEMO (NEtwork MOtif) language. NEMO is recognized by a grammar for transcription network motifs using lex and yacc to output Systems Biology Markup Language models for either specified or randomized gene input functions. These models of known networks may be input to a biochemical simulator, allowing the generation of synthetic microarray data. AVAILABILITY http://range.sourceforge.net
Collapse
Affiliation(s)
- James Long
- Biotechnology Computing Research Group, University of Alaska Fairbanks, PO Box 757000, Fairbanks, AK, USA.
| | | |
Collapse
|
9
|
Abstract
MOTIVATION The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. RESULTS This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. AVAILABILITY The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. SUPPLEMENTARY INFORMATION Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.
Collapse
Affiliation(s)
- Julia Handl
- School of Chemistry, University of Manchester, Faraday Building, Sackville Street, PO Box 88, Manchester M60 1QD, UK.
| | | | | |
Collapse
|
10
|
Bing N, Hoeschele I. Genetical genomics analysis of a yeast segregant population for transcription network inference. Genetics 2005; 170:533-42. [PMID: 15781693 PMCID: PMC1450429 DOI: 10.1534/genetics.105.041103] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genetic analysis of gene expression in a segregating population, which is expression profiled and genotyped at DNA markers throughout the genome, can reveal regulatory networks of polymorphic genes. We propose an analysis strategy with several steps: (1) genome-wide QTL analysis of all expression profiles to identify eQTL confidence regions, followed by fine mapping of identified eQTL; (2) identification of regulatory candidate genes in each eQTL region; (3) correlation analysis of the expression profiles of the candidates in any eQTL region with the gene affected by the eQTL to reduce the number of candidates; (4) drawing directional links from retained regulatory candidate genes to genes affected by the eQTL and joining links to form networks; and (5) statistical validation and refinement of the inferred network structure. Here, we apply an initial implementation of this strategy to a segregating yeast population. In 65, 7, and 28% of the identified eQTL regions, a single candidate regulatory gene, no gene, or more than one gene was retained in step 3, respectively. Overall, 768 putative regulatory links were retained, 331 of which are the strongest candidate links, as they were retained in the expression correlation analysis and were located within or near an eQTL subregion identified by a multimarker analysis separating multiple linked QTL. One or several biological processes were statistically significantly overrepresented in independent network structures or in highly interconnected subnetworks. Most of the transcription factors found in the inferred network had a putative regulatory link to only one other gene or exhibited cis-regulation.
Collapse
Affiliation(s)
- Nan Bing
- Virginia Bioinformatics Institute and Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, 24061-0477, USA
| | | |
Collapse
|
11
|
Raphael B, Liu LT, Varghese G. A uniform projection method for motif discovery in DNA sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:91-4. [PMID: 17048384 DOI: 10.1109/tcbb.2004.14] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Buhler and Tompa introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases.
Collapse
Affiliation(s)
- Benjamin Raphael
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093-0114, USA.
| | | | | |
Collapse
|