Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Abeel T, Saeys Y, Rouzé P, Van de Peer Y. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics 2008;24:i24-31. [PMID: 18586720 PMCID: PMC2718650 DOI: 10.1093/bioinformatics/btn172] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

For:	Abeel T, Saeys Y, Rouzé P, Van de Peer Y. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics 2008;24:i24-31. [PMID: 18586720 PMCID: PMC2718650 DOI: 10.1093/bioinformatics/btn172] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Number

Cited by Other Article(s)

Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024;64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]

Abstract

Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.

Collapse

Uemura K, Ohyama T. Physical Peculiarity of Two Sites in Human Promoters: Universality and Diverse Usage in Gene Function. Int J Mol Sci 2024;25:1487. [PMID: 38338773 PMCID: PMC10855393 DOI: 10.3390/ijms25031487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/15/2024] [Accepted: 01/18/2024] [Indexed: 02/12/2024] Open

Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022;16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open

Abstract

Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.

Collapse

de Medeiros Oliveira M, Bonadio I, Lie de Melo A, Mendes Souza G, Durham AM. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes. Brief Bioinform 2021;22:bbab198. [PMID: 34050351 PMCID: PMC8574697 DOI: 10.1093/bib/bbab198] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 02/14/2021] [Accepted: 02/23/2021] [Indexed: 12/02/2022] Open

Aman Beshir J, Kebede M. In silico analysis of promoter regions and regulatory elements (motifs and CpG islands) of the genes encoding for alcohol production in Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h. J Genet Eng Biotechnol 2021;19:8. [PMID: 33428031 PMCID: PMC7801573 DOI: 10.1186/s43141-020-00097-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 11/17/2020] [Indexed: 11/10/2022]

Abstract

BACKGROUND

The crucial factor in the production of bio-fuels is the choice of potent microorganisms used in fermentation processes. Despite the evolving trend of using bacteria, yeast is still the primary choice for fermentation. Molecular characterization of many genes from baker's yeast (Saccharomyces cerevisiaea), and fission yeast (Schizosaccharomyces pombe), have improved our understanding in gene structure and the regulation of its expression. This in silico study was done with the aim of analyzing the promoter regions, transcription start site (TSS), and CpG islands of genes encoding for alcohol production in S. cerevisiaea S288C and S. pombe 972h-.

RESULTS

The analysis revealed the highest promoter prediction scores (1.0) were obtained in five sequences (AAD4, SFA1, GRE3, YKL071W, and YPR127W) for S. cerevisiaea S288C TSS while the lowest (0.8) were found in three sequences (AAD6, ADH5, and BDH2). Similarly, in S. pombe 972h-, the highest (0.99) and lowest (0.88) prediction scores were obtained in five (Adh1, SPBC8E4.04, SPBC215.11c, SPAP32A8.02, and SPAC19G12.09) and one (erg27) sequences, respectively. Determination of common motifs revealed that S. cerevisiaea S288C had 100% coverage at MSc1 with an E value of 3.7e-007 while S. pombe 972h- had 95.23% at MSp1 with an E value of 2.6e+002. Furthermore, comparison of identified transcription factor proteins indicated that 88.88% of MSp1 were exactly similar to MSc1. It also revealed that only 21.73% in S. cerevisiaea S288C and 28% in S. pombe 972h- of the gene body regions had CpG islands. A combined phylogenetic analysis indicated that all sequences from both S. cerevisiaea S288C and S. pombe 972h- were divided into four subgroups (I, II, III, and IV). The four clades are respectively colored in blue, red, green, and violet.

CONCLUSION

This in silico analysis of gene promoter regions and transcription factors through the actions of regulatory structure such as motifs and CpG islands of genes encoding alcohol production could be used to predict gene expression profiles in yeast species.

Collapse

Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning. Genes (Basel) 2020;11:genes11060614. [PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/26/2020] [Accepted: 05/28/2020] [Indexed: 12/15/2022] Open

Liu B, Han L, Liu X, Wu J, Ma Q. Computational Prediction of Sigma-54 Promoters in Bacterial Genomes by Integrating Motif Finding and Machine Learning Strategies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:1211-1218. [PMID: 29993815 DOI: 10.1109/tcbb.2018.2816032] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Karami K, Zerehdaran S, Javadmanesh A, Shariati MM, Fallahi H. Characterization of bovine (Bos taurus) imprinted genes from genomic to amino acid attributes by data mining approaches. PLoS One 2019;14:e0217813. [PMID: 31170205 PMCID: PMC6553745 DOI: 10.1371/journal.pone.0217813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2018] [Accepted: 05/21/2019] [Indexed: 01/05/2023] Open

Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics 2019;35:1125-1132. [PMID: 30184052 PMCID: PMC6449759 DOI: 10.1093/bioinformatics/bty752] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 07/15/2018] [Accepted: 08/31/2018] [Indexed: 01/05/2023] Open

He W, Jia C, Duan Y, Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC SYSTEMS BIOLOGY 2018;12:44. [PMID: 29745856 PMCID: PMC5998878 DOI: 10.1186/s12918-018-0570-1] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Ryasik A, Orlov M, Zykova E, Ermak T, Sorokin A. Bacterial promoter prediction: Selection of dynamic and static physical properties of DNA for reliable sequence classification. J Bioinform Comput Biol 2018;16:1840003. [DOI: 10.1142/s0219720018400036] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation. Mol Neurobiol 2017;55:1871-1904. [PMID: 28233272 DOI: 10.1007/s12035-017-0427-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 01/26/2017] [Indexed: 01/31/2023]

Sloutskin A, Danino YM, Orenstein Y, Zehavi Y, Doniger T, Shamir R, Juven-Gershon T. ElemeNT: a computational tool for detecting core promoter elements. Transcription 2016. [PMID: 26226151 PMCID: PMC4581360 DOI: 10.1080/21541264.2015.1067286] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open

Carvalho SG, Guerra-Sá R, de C Merschmann LH. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics 2015;16 Suppl 19:S5. [PMID: 26695879 PMCID: PMC4686783 DOI: 10.1186/1471-2105-16-s19-s5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open

Abstract

BACKGROUND

The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers.

RESULTS

We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more.

CONCLUSION

The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.

Collapse

Yella VR, Bansal M. In silico Identification of Eukaryotic Promoters. SYSTEMS AND SYNTHETIC BIOLOGY 2015. [DOI: 10.1007/978-94-017-9514-2_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 2014;9:e112963. [PMID: 25409509 PMCID: PMC4237348 DOI: 10.1371/journal.pone.0112963] [Citation(s) in RCA: 6047] [Impact Index Per Article: 549.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2014] [Accepted: 10/16/2014] [Indexed: 02/06/2023] Open

Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014;42:12961-72. [PMID: 25361964 PMCID: PMC4245931 DOI: 10.1093/nar/gku1019] [Citation(s) in RCA: 413] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Carter JR, Keith JH, Fraser TS, Dawson JL, Kucharski CA, Horne KM, Higgs S, Fraser MJ. Effective suppression of dengue virus using a novel group-I intron that induces apoptotic cell death upon infection through conditional expression of the Bax C-terminal domain. Virol J 2014;11:111. [PMID: 24927852 PMCID: PMC4104402 DOI: 10.1186/1743-422x-11-111] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Accepted: 05/20/2014] [Indexed: 11/10/2022] Open

Abstract

INTRODUCTION

Approximately 100 million confirmed infections and 20,000 deaths are caused by Dengue virus (DENV) outbreaks annually. Global warming and rapid dispersal have resulted in DENV epidemics in formally non-endemic regions. Currently no consistently effective preventive measures for DENV exist, prompting development of transgenic and paratransgenic vector control approaches. Production of transgenic mosquitoes refractory for virus infection and/or transmission is contingent upon defining antiviral genes that have low probability for allowing escape mutations, and are equally effective against multiple serotypes. Previously we demonstrated the effectiveness of an anti-viral group I intron targeting U143 of the DENV genome in mediating trans-splicing and expression of a marker gene with the capsid coding domain. In this report we examine the effectiveness of coupling expression of ΔN Bax to trans-splicing U143 intron activity as a means of suppressing DENV infection of mosquito cells.

RESULTS

Targeting the conserved DENV circularization sequence (CS) by U143 intron trans-splicing activity appends a 3' exon RNA encoding ΔN Bax to the capsid coding region of the genomic RNA, resulting in a chimeric protein that induces premature cell death upon infection. TCID50-IFA analyses demonstrate an enhancement of DENV suppression for all DENV serotypes tested over the identical group I intron coupled with the non-apoptotic inducing firefly luciferase as the 3' exon. These cumulative results confirm the increased effectiveness of this αDENV-U143-ΔN Bax group I intron as a sequence specific antiviral that should be useful for suppression of DENV in transgenic mosquitoes. Annexin V staining, caspase 3 assays, and DNA ladder observations confirm DCA-ΔN Bax fusion protein expression induces apoptotic cell death.

CONCLUSION

This report confirms the relative effectiveness of an anti-DENV group I intron coupled to an apoptosis-inducing ΔN Bax 3' exon that trans-splices conserved sequences of the 5' CS region of all DENV serotypes and induces apoptotic cell death upon infection. Our results confirm coupling the targeted ribozyme capabilities of the group I intron with the generation of an apoptosis-inducing transcript increases the effectiveness of infection suppression, improving the prospects of this unique approach as a means of inducing transgenic refractoriness in mosquitoes for all serotypes of this important disease.

Collapse

Xiong D, Liu R, Xiao F, Gao X. ProMT: effective human promoter prediction using Markov chain model based on DNA structural properties. IEEE Trans Nanobioscience 2014;13:374-83. [PMID: 24919203 DOI: 10.1109/tnb.2014.2327586] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Grid topologies for the self-organizing map. Neural Netw 2014;56:35-48. [PMID: 24861385 DOI: 10.1016/j.neunet.2014.05.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 04/28/2014] [Accepted: 05/01/2014] [Indexed: 11/20/2022]

Meysman P, Collado-Vides J, Morett E, Viola R, Engelen K, Laukens K. Structural properties of prokaryotic promoter regions correlate with functional features. PLoS One 2014;9:e88717. [PMID: 24516674 PMCID: PMC3918002 DOI: 10.1371/journal.pone.0088717] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 01/10/2014] [Indexed: 12/31/2022] Open

Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014;2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open

Datta S, Mukhopadhyay S. A composite method based on formal grammar and DNA structural features in detecting human polymerase II promoter region. PLoS One 2013;8:e54843. [PMID: 23437045 PMCID: PMC3577817 DOI: 10.1371/journal.pone.0054843] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/25/2022] Open

Zhou X, Li Z, Dai Z, Zou X. Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform. J Theor Biol 2013;319:1-7. [DOI: 10.1016/j.jtbi.2012.11.024] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 11/20/2012] [Accepted: 11/21/2012] [Indexed: 10/27/2022]

DNA-energetics-based analyses suggest additional genes in prokaryotes. J Biosci 2012;37:433-44. [PMID: 22750981 DOI: 10.1007/s12038-012-9221-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

A new avenue for classification and prediction of olive cultivars using supervised and unsupervised algorithms. PLoS One 2012;7:e44164. [PMID: 22957050 PMCID: PMC3434224 DOI: 10.1371/journal.pone.0044164] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2012] [Accepted: 07/30/2012] [Indexed: 11/19/2022] Open

Osypov AA, Krutinin GG, Krutinina EA, Kamzolova SG. DEPPDB - DNA electrostatic potential properties database. Electrostatic properties of genome DNA elements. J Bioinform Comput Biol 2012;10:1241004. [PMID: 22809340 DOI: 10.1142/s0219720012410041] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Hosseinzadeh F, Ebrahimi M, Goliaei B, Shamabadi N. Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models. PLoS One 2012;7:e40017. [PMID: 22829872 PMCID: PMC3400626 DOI: 10.1371/journal.pone.0040017] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Accepted: 05/30/2012] [Indexed: 12/03/2022] Open

Abstract

Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.

Collapse

POWRS: position-sensitive motif discovery. PLoS One 2012;7:e40373. [PMID: 22792292 PMCID: PMC3390389 DOI: 10.1371/journal.pone.0040373] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 06/07/2012] [Indexed: 12/04/2022] Open

Meysman P, Marchal K, Engelen K. DNA structural properties in the classification of genomic transcription regulation elements. Bioinform Biol Insights 2012;6:155-68. [PMID: 22837642 PMCID: PMC3399529 DOI: 10.4137/bbi.s9426] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Gan Y, Guan J, Zhou S. A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics 2012;13:4. [PMID: 22226192 PMCID: PMC3280155 DOI: 10.1186/1471-2105-13-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 01/07/2012] [Indexed: 01/27/2023] Open

Prediction of thermostability from amino acid attributes by combination of clustering with attribute weighting: a new vista in engineering enzymes. PLoS One 2011;6:e23146. [PMID: 21853079 PMCID: PMC3154288 DOI: 10.1371/journal.pone.0023146] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Accepted: 07/06/2011] [Indexed: 11/19/2022] Open

Abstract

The engineering of thermostable enzymes is receiving increased attention. The paper, detergent, and biofuel industries, in particular, seek to use environmentally friendly enzymes instead of toxic chlorine chemicals. Enzymes typically function at temperatures below 60°C and denature if exposed to higher temperatures. In contrast, a small portion of enzymes can withstand higher temperatures as a result of various structural adaptations. Understanding the protein attributes that are involved in this adaptation is the first step toward engineering thermostable enzymes. We employed various supervised and unsupervised machine learning algorithms as well as attribute weighting approaches to find amino acid composition attributes that contribute to enzyme thermostability. Specifically, we compared two groups of enzymes: mesostable and thermostable enzymes. Furthermore, a combination of attribute weighting with supervised and unsupervised clustering algorithms was used for prediction and modelling of protein thermostability from amino acid composition properties. Mining a large number of protein sequences (2090) through a variety of machine learning algorithms, which were based on the analysis of more than 800 amino acid attributes, increased the accuracy of this study. Moreover, these models were successful in predicting thermostability from the primary structure of proteins. The results showed that expectation maximization clustering in combination with uncertainly and correlation attribute weighting algorithms can effectively (100%) classify thermostable and mesostable proteins. Seventy per cent of the weighting methods selected Gln content and frequency of hydrophilic residues as the most important protein attributes. On the dipeptide level, the frequency of Asn-Glu was the key factor in distinguishing mesostable from thermostable enzymes. This study demonstrates the feasibility of predicting thermostability irrespective of sequence similarity and will serve as a basis for engineering thermostable enzymes in the laboratory.

Collapse

Morey C, Mookherjee S, Rajasekaran G, Bansal M. DNA free energy-based promoter prediction and comparative analysis of Arabidopsis and rice genomes. PLANT PHYSIOLOGY 2011;156:1300-15. [PMID: 21531900 PMCID: PMC3135951 DOI: 10.1104/pp.110.167809] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2010] [Accepted: 04/21/2011] [Indexed: 05/06/2023]

Bedo J, Kowalczyk A. Genome annotation test with validation on transcription start site and ChIP-Seq for Pol-II binding data. Bioinformatics 2011;27:1610-7. [DOI: 10.1093/bioinformatics/btr263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y. Predicting promoter activities of primary human DNA sequences. Nucleic Acids Res 2011;39:e75. [PMID: 21486745 PMCID: PMC3113590 DOI: 10.1093/nar/gkr173] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Kantorovitz MR, Rapti Z, Gelev V, Usheva A. Computing DNA duplex instability profiles efficiently with a two-state model: trends of promoters and binding sites. BMC Bioinformatics 2010;11:604. [PMID: 21172036 PMCID: PMC3018474 DOI: 10.1186/1471-2105-11-604] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2010] [Accepted: 12/21/2010] [Indexed: 11/30/2022] Open

Dineen DG, Schröder M, Higgins DG, Cunningham P. Ensemble approach combining multiple methods improves human transcription start site prediction. BMC Genomics 2010;11:677. [PMID: 21118509 PMCID: PMC3053590 DOI: 10.1186/1471-2164-11-677] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 11/30/2010] [Indexed: 11/20/2022] Open

Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics 2010;97:112-20. [PMID: 21112384 DOI: 10.1016/j.ygeno.2010.11.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Revised: 11/05/2010] [Accepted: 11/12/2010] [Indexed: 11/20/2022]

Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010;130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]

Rangannan V, Bansal M. High-quality annotation of promoter regions for 913 bacterial genomes. ACTA ACUST UNITED AC 2010;26:3043-50. [PMID: 20956245 DOI: 10.1093/bioinformatics/btq577] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]

Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010;7:550-562. [PMID: 20671324 DOI: 10.1109/tcbb.2008.95] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]

Osypov AA, Krutinin GG, Kamzolova SG. Deppdb--DNA electrostatic potential properties database: electrostatic properties of genome DNA. J Bioinform Comput Biol 2010;8:413-25. [PMID: 20556853 DOI: 10.1142/s0219720010004811] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Revised: 01/28/2010] [Accepted: 02/12/2010] [Indexed: 11/18/2022]

Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S, Markenscoff-Papadimitriou E, Kuhl D, Bito H, Worley PF, Kreiman G, Greenberg ME. Widespread transcription at neuronal activity-regulated enhancers. Nature 2010;465:182-7. [PMID: 20393465 PMCID: PMC3020079 DOI: 10.1038/nature09033] [Citation(s) in RCA: 1838] [Impact Index Per Article: 122.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2009] [Accepted: 03/25/2010] [Indexed: 01/12/2023]

Affiliation(s)

Tae-Kyung Kim Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Martin Hemberg Department of Ophthalmology, Children's Hospital Boston, Center for Brain Science and Swartz Center for Theoretical Neuroscience, Harvard University, 300 Longwood Avenue, Boston, MA 02115, USA
Jesse M. Gray Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Allen M. Costa Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Daniel M. Bear Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Jing Wu The Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, 725 North Wolfe St., Baltimore, MD 21205, USA
David A. Harmin Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, 300 Longwood Avenue, Boston, MA 02115, USA
Mike Laptewicz Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Kellie Barbara-Haley Molecular Genetics Core facility, Children's Hospital Boston, 300 Longwood Ave, Boston, MA 02115, USA
Scott Kuersten Epicentre Biotechnologies, 726 Post Road, Madison, WI 53713, USA
Eirene Markenscoff-Papadimitriou Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
Dietmar Kuhl Institute for Molecular and Cellular Cognition (IMCC), Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Falkenried 94, 20251 Hamburg, Germany
Haruhiko Bito Department of Neurochemistry, Graduate School of Medicine, University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
Paul F. Worley The Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, 725 North Wolfe St., Baltimore, MD 21205, USA
Gabriel Kreiman Department of Ophthalmology, Children's Hospital Boston, Center for Brain Science and Swartz Center for Theoretical Neuroscience, Harvard University, 300 Longwood Avenue, Boston, MA 02115, USA
Michael E. Greenberg Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA

Collapse

Dineen DG, Wilm A, Cunningham P, Higgins DG. High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res 2010;37:7360-7. [PMID: 19820114 PMCID: PMC2794178 DOI: 10.1093/nar/gkp821] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

Gupta R, Wikramasinghe P, Bhattacharyya A, Perez FA, Pal S, Davuluri RV. Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data. BMC Bioinformatics 2010;11 Suppl 1:S65. [PMID: 20122241 PMCID: PMC3009539 DOI: 10.1186/1471-2105-11-s1-s65] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

Background

Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.

Methods

We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.

Results

We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.

Conclusion

Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.

Collapse

Abeel T, Van de Peer Y, Saeys Y. Toward a gold standard for promoter prediction evaluation. ACTA ACUST UNITED AC 2009;25:i313-20. [PMID: 19478005 PMCID: PMC2687945 DOI: 10.1093/bioinformatics/btp191] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]

Zeng J, Zhu S, Yan H. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Brief Bioinform 2009;10:498-508. [PMID: 19531545 DOI: 10.1093/bib/bbp027] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009;8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Auger H, Lamy C, Haeussler M, Khoueiry P, Lemaire P, Joly JS. Similar regulatory logic in Ciona intestinalis for two Wnt pathway modulators, ROR and SFRP-1/5. Dev Biol 2009;329:364-73. [PMID: 19248777 DOI: 10.1016/j.ydbio.2009.02.018] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2008] [Revised: 01/22/2009] [Accepted: 02/03/2009] [Indexed: 10/21/2022]

Megraw M, Pereira F, Jensen ST, Ohler U, Hatzigeorgiou AG. A transcription factor affinity-based code for mammalian transcription initiation. Genome Res 2009;19:644-56. [PMID: 19141595 DOI: 10.1101/gr.085449.108] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]