1
|
Guo Y, Zhou D, Cao J, Nie R, Ruan X, Liu Y. Gated residual neural networks with self-normalization for translation initiation site recognition. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107783] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
2
|
Xia X. Post-Alignment Adjustment and Its Automation. Genes (Basel) 2021; 12:genes12111809. [PMID: 34828415 PMCID: PMC8623120 DOI: 10.3390/genes12111809] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 11/13/2021] [Accepted: 11/16/2021] [Indexed: 11/16/2022] Open
Abstract
Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, Marie-Curie Private, Ottawa, ON K1N 9A7, Canada; ; Tel.: +1-613-562-5718
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| |
Collapse
|
3
|
Wan H, Li JM, Ding H, Lin SX, Tu SQ, Tian XH, Hu JP, Chang S. An Overview of Computational Tools of Nucleic Acid Binding Site Prediction for Site-specific Proteins and Nucleases. Protein Pept Lett 2019; 27:370-384. [PMID: 31746287 DOI: 10.2174/0929866526666191028162302] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Revised: 05/24/2019] [Accepted: 09/24/2019] [Indexed: 12/26/2022]
Abstract
Understanding the interaction mechanism of proteins and nucleic acids is one of the most fundamental problems for genome editing with engineered nucleases. Due to some limitations of experimental investigations, computational methods have played an important role in obtaining the knowledge of protein-nucleic acid interaction. Over the past few years, dozens of computational tools have been used for identification of nucleic acid binding site for site-specific proteins and design of site-specific nucleases because of their significant advantages in genome editing. Here, we review existing widely-used computational tools for target prediction of site-specific proteins as well as off-target prediction of site-specific nucleases. This article provides a list of on-line prediction tools according to their features followed by the description of computational methods used by these tools, which range from various sequence mapping algorithms (like Bowtie, FetchGWI and BLAST) to different machine learning methods (such as Support Vector Machine, hidden Markov models, Random Forest, elastic network and deep neural networks). We also make suggestions on the further development in improving the accuracy of prediction methods. This survey will provide a reference guide for computational biologists working in the field of genome editing.
Collapse
Affiliation(s)
- Hua Wan
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
| | - Jian-Ming Li
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
| | - Huang Ding
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
| | - Shuo-Xin Lin
- Department of Electrical and Computer Engineering, James Clark School of Engineering, University of Maryland, College Park, MD 20742, United States
| | - Shu-Qin Tu
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
| | - Xu-Hong Tian
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
| | - Jian-Ping Hu
- College of Pharmacy and Biological Engineering, Sichuan Industrial Institute of Antibiotics, Key Laboratory of Medicinal and Edible Plants Resources Development of Sichuan Education Department, Antibiotics Research and Re-Evaluation Key Laboratory of Sichuan Province, Chengdu University, Chengdu 610106, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
4
|
Chen X, Hao Y, Cui Y, Fan Z, He S, Luo J, Chen R. LncVar: a database of genetic variation associated with long non-coding genes. Bioinformatics 2016; 33:112-118. [DOI: 10.1093/bioinformatics/btw581] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 06/29/2016] [Accepted: 09/02/2016] [Indexed: 01/16/2023] Open
|
5
|
Yang J, Ramsey SA. A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites. Bioinformatics 2015; 31:3445-50. [PMID: 26130577 DOI: 10.1093/bioinformatics/btv391] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Accepted: 06/24/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure ('shape') is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites. RESULTS We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM+shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs. AVAILABILITY AND IMPLEMENTATION The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at https://github.com/ramseylab/regshape/. CONTACT stephen.ramsey@oregonstate.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Stephen A Ramsey
- Department of Biomedical Sciences and School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
6
|
Discrimination of Mammalian GPI-Anchored Proteins by Hydropathy and Amino Acid Propensities. Biosci Biotechnol Biochem 2014; 77:526-33. [DOI: 10.1271/bbb.120781] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
7
|
Abstract
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein–DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein–DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
Collapse
Affiliation(s)
- Alvaro Sebastian
- Laboratory of Computational Biology, Department of Genetics and Plant Breeding, Estación Experimental de Aula Dei/CSIC, Av. Montañana, Spain.
| | | |
Collapse
|
8
|
Nandi S, Ioshikhes I. Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites. BMC Genomics 2012; 13:416. [PMID: 22913572 PMCID: PMC3481455 DOI: 10.1186/1471-2164-13-416] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 08/02/2012] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping. RESULTS The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters. CONCLUSION Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
Collapse
Affiliation(s)
- Soumyadeep Nandi
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| | - Ilya Ioshikhes
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|
9
|
Abstract
Differential gene expression is the fundamental mechanism underlying animal development and cell differentiation. However, it is a challenge to identify comprehensively and accurately the DNA sequences that are required to regulate gene expression: namely, cis-regulatory modules (CRMs). Three major features, either singly or in combination, are used to predict CRMs: clusters of transcription factor binding site motifs, non-coding DNA that is under evolutionary constraint and biochemical marks associated with CRMs, such as histone modifications and protein occupancy. The validation rates for predictions indicate that identifying diagnostic biochemical marks is the most reliable method, and understanding is enhanced by the analysis of motifs and conservation patterns within those predicted CRMs.
Collapse
|
10
|
Wang Y, Li X, Hu H. Transcriptional regulation of co-expressed microRNA target genes. Genomics 2011; 98:445-52. [PMID: 22002038 DOI: 10.1016/j.ygeno.2011.09.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2011] [Revised: 08/12/2011] [Accepted: 09/24/2011] [Indexed: 01/26/2023]
Abstract
MicroRNAs play pivotal roles in gene regulation. Despite various research efforts on microRNAs, how microRNA target genes are transcriptionally regulated and how the transcriptional regulation of microRNA target genes relates to that of the microRNA genes are not well studied. By investigating the transcriptional regulation of microRNA target genes, we found that different groups of target genes of the same microRNA are co-expressed under different conditions, and these groups rarely overlap with each other for the majority of microRNAs. We also discovered that co-expressed microRNA target genes are often co-regulated, and different groups of target genes of the same microRNA are often regulated differently. In addition, we observed that transcription factors regulating a microRNA gene often regulate its target genes. Our study sheds light on the regulation of microRNA target genes, which will facilitate the prediction of microRNA target genes and the understanding of the transcriptional regulation of microRNA genes.
Collapse
Affiliation(s)
- Ying Wang
- Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | | | | |
Collapse
|
11
|
Francke C, Groot Kormelink T, Hagemeijer Y, Overmars L, Sluijter V, Moezelaar R, Siezen RJ. Comparative analyses imply that the enigmatic Sigma factor 54 is a central controller of the bacterial exterior. BMC Genomics 2011; 12:385. [PMID: 21806785 PMCID: PMC3162934 DOI: 10.1186/1471-2164-12-385] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Accepted: 08/01/2011] [Indexed: 02/06/2023] Open
Abstract
Background Sigma-54 is a central regulator in many pathogenic bacteria and has been linked to a multitude of cellular processes like nitrogen assimilation and important functional traits such as motility, virulence, and biofilm formation. Until now it has remained obscure whether these phenomena and the control by Sigma-54 share an underlying theme. Results We have uncovered the commonality by performing a range of comparative genome analyses. A) The presence of Sigma-54 and its associated activators was determined for all sequenced prokaryotes. We observed a phylum-dependent distribution that is suggestive of an evolutionary relationship between Sigma-54 and lipopolysaccharide and flagellar biosynthesis. B) All Sigma-54 activators were identified and annotated. The relation with phosphotransfer-mediated signaling (TCS and PTS) and the transport and assimilation of carboxylates and nitrogen containing metabolites was substantiated. C) The function annotations, that were represented within the genomic context of all genes encoding Sigma-54, its activators and its promoters, were analyzed for intra-phylum representation and inter-phylum conservation. Promoters were localized using a straightforward scoring strategy that was formulated to identify similar motifs. We found clear highly-represented and conserved genetic associations with genes that concern the transport and biosynthesis of the metabolic intermediates of exopolysaccharides, flagella, lipids, lipopolysaccharides, lipoproteins and peptidoglycan. Conclusion Our analyses directly implicate Sigma-54 as a central player in the control over the processes that involve the physical interaction of an organism with its environment like in the colonization of a host (virulence) or the formation of biofilm.
Collapse
Affiliation(s)
- Christof Francke
- TI Food and Nutrition, P,O,Box 557, 6700AN Wageningen, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
12
|
Chen YC, Aguan K, Yang CW, Wang YT, Pal NR, Chung IF. Discovery of protein phosphorylation motifs through exploratory data analysis. PLoS One 2011; 6:e20025. [PMID: 21647451 PMCID: PMC3102080 DOI: 10.1371/journal.pone.0020025] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Accepted: 04/09/2011] [Indexed: 11/19/2022] Open
Abstract
Background The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find “actual” motifs with statistically significant motif scores. Results and Discussion We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes. Conclusions An iterative algorithm proposed here uses exploratory data analysis to discover motifs from phosphorylated data. The effectiveness of F-Motif has been demonstrated using several real data sets as well as using a synthetic data set. The method is quite general in nature and can be used to find other types of motifs also. We have also provided a server for F-Motif at http://f-motif.classcloud.org/, http://bio.classcloud.org/f-motif/ or http://ymu.classcloud.org/f-motif/.
Collapse
Affiliation(s)
- Yi-Cheng Chen
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
| | - Kripamoy Aguan
- Department of Biotechnology and Bioinformatics, North Eastern Hill University, Shillong, India
| | - Chu-Wen Yang
- Department of Microbiology, Soochow University, Taipei, Taiwan
| | - Yao-Tsung Wang
- Software Technology Division, National Center for High-Performance Computing, Taichung, Taiwan
| | - Nikhil R. Pal
- Electronics and Communication Sciences Unit, Indian Statistical Institute, Calcutta, India
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
- Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei, Taiwan
- * E-mail:
| |
Collapse
|
13
|
Macintyre G, Bailey J, Haviv I, Kowalczyk A. is-rSNP: a novel technique for in silico regulatory SNP detection. ACTA ACUST UNITED AC 2010; 26:i524-30. [PMID: 20823317 PMCID: PMC2935445 DOI: 10.1093/bioinformatics/btq378] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Motivation: Determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score. Results: We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS. Availability: is-rSNP software is available for use at: www.genomics.csse.unimelb.edu.au/is-rSNP Contact:gmaci@csse.unimelb.edu.au; adam.kowalczyk@nicta.com.au Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Geoff Macintyre
- Department of Computer Science and Software Engineering, University of Melbourne, Victoria, Australia.
| | | | | | | |
Collapse
|
14
|
Nuel G, Regad L, Martin J, Camproux AC. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithms Mol Biol 2010; 5:15. [PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 01/26/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Collapse
Affiliation(s)
- Gregory Nuel
- LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France
- CNRS, Paris, France
- MAP5, Department of Applied Mathematics, CNRS UMR-8145, University Paris Descartes, Paris, France
| | - Leslie Regad
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| | - Juliette Martin
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MIG, Mathématique Informatique et Genome, INRA UR-1077, Jouy-en-Josas, France
- IBCP, Institut de Biologie et Chimie des Protéines, IFR 128, CNRS UMR 5086, University of Lyon 1, Lyon, France
| | - Anne-Claude Camproux
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| |
Collapse
|
15
|
Oh YM, Kim JK, Choi Y, Choi S, Yoo JY. Prediction and experimental validation of novel STAT3 target genes in human cancer cells. PLoS One 2009; 4:e6911. [PMID: 19730699 PMCID: PMC2731854 DOI: 10.1371/journal.pone.0006911] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2009] [Accepted: 08/03/2009] [Indexed: 11/23/2022] Open
Abstract
The comprehensive identification of functional transcription factor binding sites (TFBSs) is an important step in understanding complex transcriptional regulatory networks. This study presents a motif-based comparative approach, STAT-Finder, for identifying functional DNA binding sites of STAT3 transcription factor. STAT-Finder combines STAT-Scanner, which was designed to predict functional STAT TFBSs with improved sensitivity, and a motif-based alignment to minimize false positive prediction rates. Using two reference sets containing promoter sequences of known STAT3 target genes, STAT-Finder identified functional STAT3 TFBSs with enhanced prediction efficiency and sensitivity relative to other conventional TFBS prediction tools. In addition, STAT-Finder identified novel STAT3 target genes among a group of genes that are over-expressed in human cancer cells. The binding of STAT3 to the predicted TFBSs was also experimentally confirmed through chromatin immunoprecipitation. Our proposed method provides a systematic approach to the prediction of functional TFBSs that can be applied to other TFs.
Collapse
Affiliation(s)
- Young Min Oh
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea
| | - Jong Kyoung Kim
- Department of Computer Science, Pohang University of Science and Technology, Pohang, Republic of Korea
| | - Yongwook Choi
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea
| | - Seungjin Choi
- Department of Computer Science, Pohang University of Science and Technology, Pohang, Republic of Korea
- * E-mail: (JY); (SC)
| | - Joo-Yeon Yoo
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea
- * E-mail: (JY); (SC)
| |
Collapse
|
16
|
Pape UJ, Rahmann S, Sun F, Vingron M. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J Comput Biol 2008; 15:547-64. [PMID: 18631020 DOI: 10.1089/cmb.2007.0084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de.
Collapse
Affiliation(s)
- Utz J Pape
- Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | |
Collapse
|
17
|
Hu J, Hu H, Li X. MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs. Nucleic Acids Res 2008; 36:4488-97. [PMID: 18606616 PMCID: PMC2490743 DOI: 10.1093/nar/gkn407] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
The identification of cis-regulatory modules (CRMs) can greatly advance our understanding of eukaryotic regulatory mechanism. Current methods to predict CRMs from known motifs either depend on multiple alignments or can only deal with a small number of known motifs provided by users. These methods are problematic when binding sites are not well aligned in multiple alignments or when the number of input known motifs is large. We thus developed a new CRM identification method MOPAT (motif pair tree), which identifies CRMs through the identification of motif modules, groups of motifs co-ccurring in multiple CRMs. It can identify ‘orthologous’ CRMs without multiple alignments. It can also find CRMs given a large number of known motifs. We have applied this method to mouse developmental genes, and have evaluated the predicted CRMs and motif modules by microarray expression data and known interacting motif pairs. We show that the expression profiles of the genes containing CRMs of the same motif module correlate significantly better than those of a random set of genes do. We also show that the known interacting motif pairs are significantly included in our predictions. Compared with several current methods, our method shows better performance in identifying meaningful CRMs.
Collapse
Affiliation(s)
- Jianfei Hu
- Division of Biostatistics, School of Informatics, Indiana University, 410 West 10th Street, Indianapolis, IN 46202, USA
| | | | | |
Collapse
|
18
|
Lähdesmäki H, Rust AG, Shmulevich I. Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008; 3:e1820. [PMID: 18364997 PMCID: PMC2268002 DOI: 10.1371/journal.pone.0001820] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 02/04/2008] [Indexed: 11/21/2022] Open
Abstract
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
Collapse
Affiliation(s)
- Harri Lähdesmäki
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Alistair G. Rust
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Ilya Shmulevich
- Institute for Systems Biology, Seattle, Washington, United States of America
| |
Collapse
|
19
|
Touzet H, Varré JS. Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol 2007; 2:15. [PMID: 18072973 PMCID: PMC2238751 DOI: 10.1186/1748-7188-2-15] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2007] [Accepted: 12/11/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time. RESULTS The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available. CONCLUSION We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
Collapse
|
20
|
Affiliation(s)
- Dmitry A Rodionov
- Burnham Institute for Medical Research, La Jolla, California 92037, USA.
| |
Collapse
|
21
|
Chen G, Jensen ST, Stoeckert CJ. Clustering of genes into regulons using integrated modeling-COGRIM. Genome Biol 2007; 8:R4. [PMID: 17204163 PMCID: PMC1839128 DOI: 10.1186/gb-2007-8-1-r4] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2006] [Revised: 11/14/2006] [Accepted: 01/04/2007] [Indexed: 11/12/2022] Open
Abstract
COGRIM, an implementation that integrates gene expression, ChIP binding and transcription factor motif data, is described and applied to both unicellular and mammalian organisms. We present a Bayesian hierarchical model and Gibbs Sampling implementation that integrates gene expression, ChIP binding, and transcription factor motif data in a principled and robust fashion. COGRIM was applied to both unicellular and mammalian organisms under different scenarios of available data. In these applications, we demonstrate the ability to predict gene-transcription factor interactions with reduced numbers of false-positive findings and to make predictions beyond what is obtained when single types of data are considered.
Collapse
Affiliation(s)
- Guang Chen
- Department of Bioengineering, University of Pennsylvania, 240 Skirkanich Hall, 3320 Smith Walk, Philadelphia, Pennsylvania 19104, USA
- Center for Bioinformatics, University of Pennsylvania,1420 Blockley Hall, 423 Guardian Drive, Philadelphia, Pennsylvania 19104, USA
| | - Shane T Jensen
- Department of Statistics, The Wharton School, University of Pennsylvania, 463 Jon M. Huntsman Hall, 3730 Walnut Street, Philadelphia, Pennsylvania 19104, USA
| | - Christian J Stoeckert
- Center for Bioinformatics, University of Pennsylvania,1420 Blockley Hall, 423 Guardian Drive, Philadelphia, Pennsylvania 19104, USA
- Department of Genetics, School of Medicine, University of Pennsylvania, 415 Curie Boulevard, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
22
|
Schones DE, Smith AD, Zhang MQ. Statistical significance of cis-regulatory modules. BMC Bioinformatics 2007; 8:19. [PMID: 17241466 PMCID: PMC1796902 DOI: 10.1186/1471-2105-8-19] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2006] [Accepted: 01/22/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software.
Collapse
Affiliation(s)
- Dustin E Schones
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Department of Physics and Astronomy, Stony Brook University, Stony Brook, NY 11790, USA
| | - Andrew D Smith
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Michael Q Zhang
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| |
Collapse
|
23
|
|
24
|
Defrance M, Touzet H. Predicting transcription factor binding sites using local over-representation and comparative genomics. BMC Bioinformatics 2006; 7:396. [PMID: 16945132 PMCID: PMC1570149 DOI: 10.1186/1471-2105-7-396] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2006] [Accepted: 08/31/2006] [Indexed: 12/02/2022] Open
Abstract
Background Identifying cis-regulatory elements is crucial to understanding gene expression, which highlights the importance of the computational detection of overrepresented transcription factor binding sites (TFBSs) in coexpressed or coregulated genes. However, this is a challenging problem, especially when considering higher eukaryotic organisms. Results We have developed a method, named TFM-Explorer, that searches for locally overrepresented TFBSs in a set of coregulated genes, which are modeled by profiles provided by a database of position weight matrices. The novelty of the method is that it takes advantage of spatial conservation in the sequence and supports multiple species. The efficiency of the underlying algorithm and its robustness to noise allow weak regulatory signals to be detected in large heterogeneous data sets. Conclusion TFM-Explorer provides an efficient way to predict TFBS overrepresentation in related sequences. Promising results were obtained in a variety of examples in human, mouse, and rat genomes. The software is publicly available at .
Collapse
Affiliation(s)
- Matthieu Defrance
- LIFL, UMR CNRS 8022, Université des Sciences et Technologies de Lille, Villeneuve d'Ascq, France
| | - Hélène Touzet
- LIFL, UMR CNRS 8022, Université des Sciences et Technologies de Lille, Villeneuve d'Ascq, France
| |
Collapse
|
25
|
Wren JD, Johnson D, Gruenwald L. Automating genomic data mining via a sequence-based matrix format and associative rule set. BMC Bioinformatics 2005; 6 Suppl 2:S2. [PMID: 16026599 PMCID: PMC1637034 DOI: 10.1186/1471-2105-6-s2-s2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands.
Collapse
Affiliation(s)
- Jonathan D Wren
- Advanced Center for Genome Technology, Department of Botany and Microbiology, 101 David L. Boren Blvd. Rm 2025
| | - David Johnson
- School of Computer Science, The University of Oklahoma, Norman, Oklahoma 73019
| | - Le Gruenwald
- School of Computer Science, The University of Oklahoma, Norman, Oklahoma 73019
| |
Collapse
|
26
|
Gershenzon NI, Stormo GD, Ioshikhes IP. Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res 2005; 33:2290-301. [PMID: 15849315 PMCID: PMC1084321 DOI: 10.1093/nar/gki519] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Position-weight matrices (PWMs) are broadly used to locate transcription factor binding sites in DNA sequences. The majority of existing PWMs provide a low level of both sensitivity and specificity. We present a new computational algorithm, a modification of the Staden–Bucher approach, that improves the PWM. We applied the proposed technique on the PWM of the GC-box, binding site for Sp1. The comparison of old and new PWMs shows that the latter increase both sensitivity and specificity. The statistical parameters of GC-box distribution in promoter regions and in the human genome, as well as in each chromosome, are presented. The majority of commonly used PWMs are the 4-row mononucleotide matrices, although 16-row dinucleotide matrices are known to be more informative. The algorithm efficiently determines the 16-row matrices and preliminary results show that such matrices provide better results than 4-row matrices.
Collapse
Affiliation(s)
- Naum I Gershenzon
- Department of Biomedical Informatics, The Ohio State University 3184 Graves Hall, 333 W. 10th Avenue, Columbus, OH 43210, USA.
| | | | | |
Collapse
|
27
|
Yap YL, Lam DCL, Luc G, Zhang XW, Hernandez D, Gras R, Wang E, Chiu SW, Chung LP, Lam WK, Smith DK, Minna JD, Danchin A, Wong MP. Conserved transcription factor binding sites of cancer markers derived from primary lung adenocarcinoma microarrays. Nucleic Acids Res 2005; 33:409-21. [PMID: 15653641 PMCID: PMC546166 DOI: 10.1093/nar/gki188] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Gene transcription in a set of 49 human primary lung adenocarcinomas and 9 normal lung tissue samples was examined using Affymetrix GeneChip technology. A total of 3442 genes, called the set MAD, were found to be either up- or down-regulated by at least 2-fold between the two phenotypes. Genes assigned to a particular gene ontology term were found, in many cases, to be significantly unevenly distributed between the genes in and outside MAD. Terms that were overrepresented in MAD included functions directly implicated in the cancer cell metabolism. Based on their functional roles and expression profiles, genes in MAD were grouped into likely co-regulated gene sets. Highly conserved sequences in the 5 kb region upstream of the genes in these sets were identified with the motif discovery tool, MoDEL. Potential oncogenic transcription factors and their corresponding binding sites were identified in these conserved regions using the TRANSFAC 8.3 database. Several of the transcription factors identified in this study have been shown elsewhere to be involved in oncogenic processes. This study searched beyond phenotypic gene expression profiles in cancer cells, in order to identify the more important regulatory transcription factors that caused these aberrations in gene expression.
Collapse
Affiliation(s)
- Yee Leng Yap
- HKU-Pasteur Research Centre Dexter H.C. Man Building, 8 Sassoon Road Pokfulam, Hong Kong, China.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Frith MC, Spouge JL, Hansen U, Weng Z. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res 2002; 30:3214-24. [PMID: 12136103 PMCID: PMC135758 DOI: 10.1093/nar/gkf438] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The human genome encodes the transcriptional control of its genes in clusters of cis-elements that constitute enhancers, silencers and promoter signals. The sequence motifs of individual cis- elements are usually too short and degenerate for confident detection. In most cases, the requirements for organization of cis-elements within these clusters are poorly understood. Therefore, we have developed a general method to detect local concentrations of cis-element motifs, using predetermined matrix representations of the cis-elements, and calculate the statistical significance of these motif clusters. The statistical significance calculation is highly accurate not only for idealized, pseudorandom DNA, but also for real human DNA. We use our method 'cluster of motifs E-value tool' (COMET) to make novel predictions concerning the regulation of genes by transcription factors associated with muscle. COMET performs comparably with two alternative state-of-the-art techniques, which are more complex and lack E-value calculations. Our statistical method enables us to clarify the major bottleneck in the hard problem of detecting cis-regulatory regions, which is that many known enhancers do not contain very significant clusters of the motif types that we search for. Thus, discovery of additional signals that belong to these regulatory regions will be the key to future progress.
Collapse
Affiliation(s)
- Martin C Frith
- Bioinformatics Program, Boston University, 44 Cummington Street, Boston MA 02215, USA
| | | | | | | |
Collapse
|
29
|
Alimi JP, Poirot O, Lopez F, Claverie JM. Reverse transcriptase-polymerase chain reaction validation of 25 "orphan" genes from Escherichia coli K-12 MG1655. Genome Res 2000; 10:959-66. [PMID: 10899145 PMCID: PMC310931 DOI: 10.1101/gr.10.7.959] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Despite the accumulation of sequence information sampling from a broad spectrum of phyla, newly sequenced genomes continue to reveal a high proportion (50%-30%) of "uncharacterized" genes, including a significant number of strictly "orphan" genes, i.e., putative open reading frames (ORFs) without any resemblance to previously determined protein-coding sequences. Most genes found in databases have only been predicted by computer methods and have never been experimentally validated. Although theoretical evolutionary arguments support the reality of genes when homologs are found in a variety of distant species, this is not the case for orphan genes. Here, we report the direct reverse transcriptase-polymerase chain reaction assay of 25 strictly orphan ORFs of Escherichia coli. Two growth conditions, exponential and stationary phases, were tested. Transcripts were identified for a total of 19 orphan genes, with 2 genes found to be expressed in only one of the two growth conditions. Our results suggest that a vast majority of E. coli ORFs presently annotated as "hypothetical" correspond to bona fide genes. By extension, this implies that randomly occurring "junk" ORFs have been actively counter selected during the evolution of the dense E. coli genome.
Collapse
Affiliation(s)
- J P Alimi
- Structural and Genetic Information Laboratory, Marseille, France
| | | | | | | |
Collapse
|
30
|
Affiliation(s)
- J W Fickett
- Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406, USA.
| | | |
Collapse
|