1
|
Richardson MO, Eddy SR. ORFeus: a computational method to detect programmed ribosomal frameshifts and other non-canonical translation events. BMC Bioinformatics 2023; 24:471. [PMID: 38093195 PMCID: PMC10720069 DOI: 10.1186/s12859-023-05602-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 12/05/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND In canonical protein translation, ribosomes initiate translation at a specific start codon, maintain a single reading frame throughout elongation, and terminate at the first in-frame stop codon. However, ribosomal behavior can deviate at each of these steps, sometimes in a programmed manner. Certain mRNAs contain sequence and structural elements that cause ribosomes to begin translation at alternative start codons, shift reading frame, read through stop codons, or reinitiate on the same mRNA. These processes represent important translational control mechanisms that can allow an mRNA to encode multiple functional protein products or regulate protein expression. The prevalence of these events remains uncertain, due to the difficulty of systematic detection. RESULTS We have developed a computational model to infer non-canonical translation events from ribosome profiling data. CONCLUSION ORFeus identifies known examples of alternative open reading frames and recoding events across different organisms and enables transcriptome-wide searches for novel events.
Collapse
Affiliation(s)
- Mary O Richardson
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Sean R Eddy
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
2
|
De Lise F, Strazzulli A, Iacono R, Curci N, Di Fenza M, Maurelli L, Moracci M, Cobucci-Ponzano B. Programmed Deviations of Ribosomes From Standard Decoding in Archaea. Front Microbiol 2021; 12:688061. [PMID: 34149676 PMCID: PMC8211752 DOI: 10.3389/fmicb.2021.688061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 05/04/2021] [Indexed: 11/13/2022] Open
Abstract
Genetic code decoding, initially considered to be universal and immutable, is now known to be flexible. In fact, in specific genes, ribosomes deviate from the standard translational rules in a programmed way, a phenomenon globally termed recoding. Translational recoding, which has been found in all domains of life, includes a group of events occurring during gene translation, namely stop codon readthrough, programmed ± 1 frameshifting, and ribosome bypassing. These events regulate protein expression at translational level and their mechanisms are well known and characterized in viruses, bacteria and eukaryotes. In this review we summarize the current state-of-the-art of recoding in the third domain of life. In Archaea, it was demonstrated and extensively studied that translational recoding regulates the decoding of the 21st and the 22nd amino acids selenocysteine and pyrrolysine, respectively, and only one case of programmed -1 frameshifting has been reported so far in Saccharolobus solfataricus P2. However, further putative events of translational recoding have been hypothesized in other archaeal species, but not extensively studied and confirmed yet. Although this phenomenon could have some implication for the physiology and adaptation of life in extreme environments, this field is still underexplored and genes whose expression could be regulated by recoding are still poorly characterized. The study of these recoding episodes in Archaea is urgently needed.
Collapse
Affiliation(s)
- Federica De Lise
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy
| | - Andrea Strazzulli
- Department of Biology, University of Naples Federico II, Complesso Universitario di Monte S. Angelo, Naples, Italy.,Task Force on Microbiome Studies, University of Naples Federico II, Naples, Italy
| | - Roberta Iacono
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy.,Department of Biology, University of Naples Federico II, Complesso Universitario di Monte S. Angelo, Naples, Italy
| | - Nicola Curci
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy.,Department of Biology, University of Naples Federico II, Complesso Universitario di Monte S. Angelo, Naples, Italy
| | - Mauro Di Fenza
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy
| | - Luisa Maurelli
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy
| | - Marco Moracci
- Institute of Biosciences and BioResources - National Research Council of Italy, Naples, Italy.,Department of Biology, University of Naples Federico II, Complesso Universitario di Monte S. Angelo, Naples, Italy.,Task Force on Microbiome Studies, University of Naples Federico II, Naples, Italy
| | | |
Collapse
|
3
|
Incipient genome erosion and metabolic streamlining for antibiotic production in a defensive symbiont. Proc Natl Acad Sci U S A 2021; 118:2023047118. [PMID: 33883280 PMCID: PMC8092579 DOI: 10.1073/pnas.2023047118] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome reduction is commonly observed in bacteria of several phyla engaging in obligate nutritional symbioses with insects. In Actinobacteria, however, little is known about the process of genome evolution, despite their importance as prolific producers of antibiotics and their increasingly recognized role as defensive partners of insects and other organisms. Here, we show that “Streptomyces philanthi,” a defensive symbiont of digger wasps, has a G+C-enriched genome in the early stages of erosion, with inactivating mutations in a large proportion of genes, causing dependency on its hosts for certain nutrients, which was validated in axenic symbiont cultures. Additionally, overexpressed catabolic and biosynthetic pathways of the bacteria inside the host indicate host–symbiont metabolic integration for streamlining and control of antibiotic production. Genome erosion is a frequently observed result of relaxed selection in insect nutritional symbionts, but it has rarely been studied in defensive mutualisms. Solitary beewolf wasps harbor an actinobacterial symbiont of the genus Streptomyces that provides protection to the developing offspring against pathogenic microorganisms. Here, we characterized the genomic architecture and functional gene content of this culturable symbiont using genomics, transcriptomics, and proteomics in combination with in vitro assays. Despite retaining a large linear chromosome (7.3 Mb), the wasp symbiont accumulated frameshift mutations in more than a third of its protein-coding genes, indicative of incipient genome erosion. Although many of the frameshifted genes were still expressed, the encoded proteins were not detected, indicating post-transcriptional regulation. Most pseudogenization events affected accessory genes, regulators, and transporters, but “Streptomyces philanthi” also experienced mutations in central metabolic pathways, resulting in auxotrophies for biotin, proline, and arginine that were confirmed experimentally in axenic culture. In contrast to the strong A+T bias in the genomes of most obligate symbionts, we observed a significant G+C enrichment in regions likely experiencing reduced selection. Differential expression analyses revealed that—compared to in vitro symbiont cultures—“S. philanthi” in beewolf antennae showed overexpression of genes for antibiotic biosynthesis, the uptake of host-provided nutrients and the metabolism of building blocks required for antibiotic production. Our results show unusual traits in the early stage of genome erosion in a defensive symbiont and suggest tight integration of host–symbiont metabolic pathways that effectively grants the host control over the antimicrobial activity of its bacterial partner.
Collapse
|
4
|
Antonov IV. Two Cobalt Chelatase Subunits Can Be Generated from a Single chlD Gene via Programed Frameshifting. Mol Biol Evol 2020; 37:2268-2278. [PMID: 32211852 DOI: 10.1093/molbev/msaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Magnesium chelatase chlIDH and cobalt chelatase cobNST enzymes are required for biosynthesis of (bacterio)chlorophyll and cobalamin (vitamin B12), respectively. Each enzyme consists of large, medium, and small subunits. Structural and primary sequence similarities indicate common evolutionary origin of the corresponding subunits. It has been reported earlier that some of vitamin B12 synthesizing organisms utilized unusual cobalt chelatase enzyme consisting of a large cobalt chelatase subunit (cobN) along with a medium (chlD) and a small (chlI) subunits of magnesium chelatase. In attempt to understand the nature of this phenomenon, we analyzed >1,200 diverse genomes of cobalamin and/or chlorophyll producing prokaryotes. We found that, surprisingly, genomes of many cobalamin producers contained cobN and chlD genes only; a small subunit gene was absent. Further on, we have discovered a diverse group of chlD genes with functional programed ribosomal frameshifting signals. Given a high similarity between the small subunit and the N-terminal part of the medium subunit, we proposed that programed translational frameshifting may allow chlD mRNA to produce both subunits. Indeed, in genomes where genes for small subunits were absent, we observed statistically significant enrichment of programed frameshifting signals in chlD genes. Interestingly, the details of the frameshifting mechanisms producing small and medium subunits from a single chlD gene could be prokaryotic taxa specific. All over, this programed frameshifting phenomenon was observed to be highly conserved and present in both bacteria and archaea.
Collapse
Affiliation(s)
- Ivan V Antonov
- Institute of Bioengineering, Federal Research Centre Fundamentals of Biotechnology, Moscow, Russia
- Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia
| |
Collapse
|
5
|
Genome-Scale Transcription-Translation Mapping Reveals Features of Zymomonas mobilis Transcription Units and Promoters. mSystems 2020; 5:5/4/e00250-20. [PMID: 32694125 PMCID: PMC7566282 DOI: 10.1128/msystems.00250-20] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Efforts to rationally engineer synthetic pathways in Zymomonas mobilis are impeded by a lack of knowledge and tools for predictable and quantitative programming of gene regulation at the transcriptional, posttranscriptional, and posttranslational levels. With the detailed functional characterization of the Z. mobilis genome presented in this work, we provide crucial knowledge for the development of synthetic genetic parts tailored to Z. mobilis. This information is vital as researchers continue to develop Z. mobilis for synthetic biology applications. Our methods and statistical analyses also provide ways to rapidly advance the understanding of poorly characterized bacteria via empirical data that enable the experimental validation of sequence-based prediction for genome characterization and annotation. Zymomonas mobilis is an ethanologenic alphaproteobacterium with promise for the industrial conversion of renewable plant biomass into fuels and chemical bioproducts. Limited functional annotation of the Z. mobilis genome is a current barrier to both fundamental studies of Z. mobilis and its development as a synthetic biology chassis. To gain insight, we collected sample-matched multiomics data, including RNA sequencing (RNA-seq), transcription start site (TSS) sequencing (TSS-seq), termination sequencing (term-seq), ribosome profiling, and label-free shotgun proteomic mass spectrometry, across different growth conditions and used these data to improve annotation and assign functional sites in the Z. mobilis genome. Proteomics and ribosome profiling informed revisions of protein-coding genes, which included 44 start codon changes and 42 added proteins. We developed statistical methods for annotating transcript 5′ and 3′ ends, enabling the identification of 3,940 TSSs and their corresponding promoters and 2,091 transcription termination sites, which were distinguished from RNA processing sites by the lack of an adjacent RNA 5′ end. Our results revealed that Z. mobilis σA −35 and −10 promoter elements closely resemble canonical Escherichia coli −35 and −10 elements, with one notable exception: the Z. mobilis −10 element lacks the highly conserved −7 thymine observed in E. coli and other previously characterized σA promoters. The σA promoters of another alphaproteobacterium, Caulobacter crescentus, similarly lack the conservation of −7 thymine in their −10 elements. Our results anchor the development of Z. mobilis as a platform for synthetic biology and establish strategies for empirical genome annotation that can complement purely computational methods. IMPORTANCE Efforts to rationally engineer synthetic pathways in Zymomonas mobilis are impeded by a lack of knowledge and tools for predictable and quantitative programming of gene regulation at the transcriptional, posttranscriptional, and posttranslational levels. With the detailed functional characterization of the Z. mobilis genome presented in this work, we provide crucial knowledge for the development of synthetic genetic parts tailored to Z. mobilis. This information is vital as researchers continue to develop Z. mobilis for synthetic biology applications. Our methods and statistical analyses also provide ways to rapidly advance the understanding of poorly characterized bacteria via empirical data that enable the experimental validation of sequence-based prediction for genome characterization and annotation.
Collapse
|
6
|
Suvorova YM, Korotkova MA, Skryabin KG, Korotkov EV. Search for potential reading frameshifts in cds from Arabidopsis thaliana and other genomes. DNA Res 2019; 26:157-170. [PMID: 30726896 PMCID: PMC6476729 DOI: 10.1093/dnares/dsy046] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 12/07/2018] [Indexed: 01/01/2023] Open
Abstract
A new mathematical method for potential reading frameshift detection in protein-coding sequences (cds) was developed. The algorithm is adjusted to the triplet periodicity of each analysed sequence using dynamic programming and a genetic algorithm. This does not require any preliminary training. Using the developed method, cds from the Arabidopsis thaliana genome were analysed. In total, the algorithm found 9,930 sequences containing one or more potential reading frameshift(s). This is ∼21% of all analysed sequences of the genome. The Type I and Type II error rates were estimated as 11% and 30%, respectively. Similar results were obtained for the genomes of Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Rattus norvegicus and Xenopus tropicalis. Also, the developed algorithm was tested on 17 bacterial genomes. We compared our results with the previously obtained data on the search for potential reading frameshifts in these genomes. This study discussed the possibility that the reading frameshift seems like a relatively frequently encountered mutation; and this mutation could participate in the creation of new genes and proteins.
Collapse
Affiliation(s)
- Y M Suvorova
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
| | - M A Korotkova
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russia
| | - K G Skryabin
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
| | - E V Korotkov
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia.,National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russia
| |
Collapse
|
7
|
Suvorova YM, Pugacheva VM, Korotkov EV. A Database of Potential Reading Frame Shifts in Coding Sequences from Different Eukaryotic Genomes. Biophysics (Nagoya-shi) 2019. [DOI: 10.1134/s0006350919030217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
8
|
Du N, Sun Y. Improve homology search sensitivity of PacBio data by correcting frameshifts. Bioinformatics 2017; 32:i529-i537. [PMID: 27587671 DOI: 10.1093/bioinformatics/btw458] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina. The long read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and identify gene isoforms with higher accuracy in transcriptomic sequencing. However, PacBio data has high sequencing error rate and most of the errors are insertion or deletion errors. During alignment-based homology search, insertion or deletion errors in genes will cause frameshifts and may only lead to marginal alignment scores and short alignments. As a result, it is hard to distinguish true alignments from random alignments and the ambiguity will incur errors in structural and functional annotation. Existing frameshift correction tools are designed for data with much lower error rate and are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio data. RESULTS In this work, we introduce Frame-Pro, a profile homology search tool for PacBio reads. Our tool corrects sequencing errors and also outputs the profile alignments of the corrected sequences against characterized protein families. We applied our tool to both simulated and real PacBio data. The results showed that our method enables more sensitive homology search, especially for PacBio data sets of low sequencing coverage. In addition, we can correct more errors when comparing with a popular error correction tool that does not rely on hybrid sequencing. AVAILABILITY AND IMPLEMENTATION The source code is freely available at https://sourceforge.net/projects/frame-pro/ CONTACT yannisun@msu.edu.
Collapse
Affiliation(s)
- Nan Du
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
9
|
Andreevskaya M, Hultman J, Johansson P, Laine P, Paulin L, Auvinen P, Björkroth J. Complete genome sequence of Leuconostoc gelidum subsp. gasicomitatum KG16-1, isolated from vacuum-packaged vegetable sausages. Stand Genomic Sci 2016; 11:40. [PMID: 27274361 PMCID: PMC4895993 DOI: 10.1186/s40793-016-0164-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Accepted: 05/31/2016] [Indexed: 11/10/2022] Open
Abstract
Leuconostoc gelidum subsp. gasicomitatum is a predominant lactic acid bacterium (LAB) in spoilage microbial communities of different kinds of modified-atmosphere packaged (MAP) food products. So far, only one genome sequence of a poultry-originating type strain of this bacterium (LMG 18811(T)) has been available. In the current study, we present the completely sequenced and functionally annotated genome of strain KG16-1 isolated from a vegetable-based product. In addition, six other vegetable-associated strains were sequenced to study possible "niche" specificity suggested by recent multilocus sequence typing. The genome of strain KG16-1 consisted of one circular chromosome and three plasmids, which together contained 2,035 CDSs. The chromosome carried at least three prophage regions and one of the plasmids encoded a galactan degradation cluster, which might provide a survival advantage in plant-related environments. The genome comparison with LMG 18811(T) and six other vegetable strains suggests no major differences between the meat- and vegetable-associated strains that would explain their "niche" specificity. Finally, the comparison with the genomes of other leuconostocs highlights the distribution of functionally interesting genes across the L. gelidum strains and the genus Leuconostoc.
Collapse
Affiliation(s)
- Margarita Andreevskaya
- Institute of Biotechnology, University of Helsinki, Viikinkaari 5D, 00790 Helsinki, Finland
| | - Jenni Hultman
- Department of Food Hygiene and Environmental Health, University of Helsinki, Agnes Sjöbergin katu 2, 00790 Helsinki, Finland
| | - Per Johansson
- Department of Food Hygiene and Environmental Health, University of Helsinki, Agnes Sjöbergin katu 2, 00790 Helsinki, Finland
| | - Pia Laine
- Institute of Biotechnology, University of Helsinki, Viikinkaari 5D, 00790 Helsinki, Finland
| | - Lars Paulin
- Institute of Biotechnology, University of Helsinki, Viikinkaari 5D, 00790 Helsinki, Finland
| | - Petri Auvinen
- Institute of Biotechnology, University of Helsinki, Viikinkaari 5D, 00790 Helsinki, Finland
| | - Johanna Björkroth
- Department of Food Hygiene and Environmental Health, University of Helsinki, Agnes Sjöbergin katu 2, 00790 Helsinki, Finland
| |
Collapse
|
10
|
Tang S, Lomsadze A, Borodovsky M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res 2015; 43:e78. [PMID: 25870408 PMCID: PMC4499116 DOI: 10.1093/nar/gkv227] [Citation(s) in RCA: 290] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 03/05/2015] [Indexed: 01/08/2023] Open
Abstract
Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.
Collapse
Affiliation(s)
- Shiyuyun Tang
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Alexandre Lomsadze
- Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Mark Borodovsky
- Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Moscow, Russia
| |
Collapse
|
11
|
Tataru P, Sand A, Hobolth A, Mailund T, Pedersen CNS. Algorithms for hidden markov models restricted to occurrences of regular expressions. BIOLOGY 2013; 2:1282-95. [PMID: 24833225 PMCID: PMC4009796 DOI: 10.3390/biology2041282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2013] [Revised: 10/08/2013] [Accepted: 11/05/2013] [Indexed: 11/24/2022]
Abstract
Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.
Collapse
Affiliation(s)
- Paula Tataru
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Andreas Sand
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Asger Hobolth
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Christian N S Pedersen
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| |
Collapse
|
12
|
Antonov I, Coakley A, Atkins JF, Baranov PV, Borodovsky M. Identification of the nature of reading frame transitions observed in prokaryotic genomes. Nucleic Acids Res 2013; 41:6514-30. [PMID: 23649834 PMCID: PMC3711429 DOI: 10.1093/nar/gkt274] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2012] [Revised: 02/22/2013] [Accepted: 03/22/2013] [Indexed: 12/11/2022] Open
Abstract
Our goal was to identify evolutionary conserved frame transitions in protein coding regions and to uncover an underlying functional role of these structural aberrations. We used the ab initio frameshift prediction program, GeneTack, to detect reading frame transitions in 206 991 genes (fs-genes) from 1106 complete prokaryotic genomes. We grouped 102 731 fs-genes into 19 430 clusters based on sequence similarity between protein products (fs-proteins) as well as conservation of predicted position of the frameshift and its direction. We identified 4010 pseudogene clusters and 146 clusters of fs-genes apparently using recoding (local deviation from using standard genetic code) due to possessing specific sequence motifs near frameshift positions. Particularly interesting was finding of a novel type of organization of the dnaX gene, where recoding is required for synthesis of the longer subunit, τ. We selected 20 clusters of predicted recoding candidates and designed a series of genetic constructs with a reporter gene or affinity tag whose expression would require a frameshift event. Expression of the constructs in Escherichia coli demonstrated enrichment of the set of candidates with sequences that trigger genuine programmed ribosomal frameshifting; we have experimentally confirmed four new families of programmed frameshifts.
Collapse
Affiliation(s)
- Ivan Antonov
- School of Computational Science and Engineering at Georgia Tech, Atlanta, GA 30332, USA, Department of Biochemistry, University College Cork, Ireland, Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russia, Center for Bioinformatics and Computational Genomics at Georgia Tech and Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA
| | - Arthur Coakley
- School of Computational Science and Engineering at Georgia Tech, Atlanta, GA 30332, USA, Department of Biochemistry, University College Cork, Ireland, Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russia, Center for Bioinformatics and Computational Genomics at Georgia Tech and Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA
| | - John F. Atkins
- School of Computational Science and Engineering at Georgia Tech, Atlanta, GA 30332, USA, Department of Biochemistry, University College Cork, Ireland, Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russia, Center for Bioinformatics and Computational Genomics at Georgia Tech and Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA
| | - Pavel V. Baranov
- School of Computational Science and Engineering at Georgia Tech, Atlanta, GA 30332, USA, Department of Biochemistry, University College Cork, Ireland, Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russia, Center for Bioinformatics and Computational Genomics at Georgia Tech and Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA
| | - Mark Borodovsky
- School of Computational Science and Engineering at Georgia Tech, Atlanta, GA 30332, USA, Department of Biochemistry, University College Cork, Ireland, Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russia, Center for Bioinformatics and Computational Genomics at Georgia Tech and Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA
| |
Collapse
|
13
|
Liu Y, Guo J, Hu G, Zhu H. Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinformatics 2013; 14 Suppl 5:S12. [PMID: 23735199 PMCID: PMC3622649 DOI: 10.1186/1471-2105-14-s5-s12] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues. Results In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains. Conclusions Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
Collapse
Affiliation(s)
- Yongchu Liu
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, China
| | | | | | | |
Collapse
|
14
|
Antonov I, Baranov P, Borodovsky M. GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences. Nucleic Acids Res 2012; 41:D152-6. [PMID: 23161689 PMCID: PMC3531167 DOI: 10.1093/nar/gks1062] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Database annotations of prokaryotic genomes and eukaryotic mRNA sequences pay relatively low attention to frame transitions that disrupt protein-coding genes. Frame transitions (frameshifts) could be caused by sequencing errors or indel mutations inside protein-coding regions. Other observed frameshifts are related to recoding events (that evolved to control expression of some genes). Earlier, we have developed an algorithm and software program GeneTack for ab initio frameshift finding in intronless genes. Here, we describe a database (freely available at http://topaz.gatech.edu/GeneTack/db.html) containing genes with frameshifts (fs-genes) predicted by GeneTack. The database includes 206 991 fs-genes from 1106 complete prokaryotic genomes and 45 295 frameshifts predicted in mRNA sequences from 100 eukaryotic genomes. The whole set of fs-genes was grouped into clusters based on sequence similarity between fs-proteins (conceptually translated fs-genes), conservation of the frameshift position and frameshift direction (−1, +1). The fs-genes can be retrieved by similarity search to a given query sequence via a web interface, by fs-gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, etc. The largest clusters contain fs-genes with programed frameshifts (related to recoding events).
Collapse
Affiliation(s)
- Ivan Antonov
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | |
Collapse
|
15
|
Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences. ACTA ACUST UNITED AC 2012; 29:114-6. [PMID: 23129300 PMCID: PMC3530910 DOI: 10.1093/bioinformatics/bts636] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Summary: Frameshift (FS) prediction is important for analysis and biological interpretation of metagenomic sequences. Since a genomic context of a short metagenomic sequence is rarely known, there is not enough data available to estimate parameters of species-specific statistical models of protein-coding and non-coding regions. The challenge of ab initio FS detection is, therefore, two fold: (i) to find a way to infer necessary model parameters and (ii) to identify positions of frameshifts (if any). Here we describe a new tool, MetaGeneTack, which uses a heuristic method to estimate parameters of sequence models used in the FS detection algorithm. It is shown on multiple test sets that the MetaGeneTack FS detection performance is comparable or better than the one of earlier developed program FragGeneScan. Availability and implementation: MetaGeneTack is available as a web server at http://exon.gatech.edu/GeneTack/cgi/metagenetack.cgi. Academic users can download a standalone version of the program from http://exon.gatech.edu/license_download.cgi. Contact:borodovsky@gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shiyuyun Tang
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | |
Collapse
|
16
|
Teeling H, Glöckner FO. Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Brief Bioinform 2012; 13:728-42. [PMID: 22966151 PMCID: PMC3504927 DOI: 10.1093/bib/bbs039] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 06/09/2012] [Indexed: 12/21/2022] Open
Abstract
Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such 'ecosystems biology' approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
Collapse
|
17
|
Trimble WL, Keegan KP, D'Souza M, Wilke A, Wilkening J, Gilbert J, Meyer F. Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC Bioinformatics 2012; 13:183. [PMID: 22839106 PMCID: PMC3526449 DOI: 10.1186/1471-2105-13-183] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Accepted: 07/13/2012] [Indexed: 11/17/2022] Open
Abstract
Background Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms—FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal—for accuracy on short (75–1000 bp) fragments containing sequence error from previously published artificial data and “real” metagenomic datasets. Results While gene prediction tools have similar accuracies predicting genes on error-free fragments, in the presence of sequencing errors considerable differences between tools become evident. For error-containing short reads, FragGeneScan finds more prokaryotic coding regions than does MetaGeneAnnotator, MetaGeneMark, Orphelia, or Prodigal. This improved detection of genes in error-containing fragments, however, comes at the cost of much lower (50%) specificity and overprediction of genes in noncoding regions. Conclusions Ab initio gene callers offer a significant reduction in the computational burden of annotating individual nucleic acid reads and are used in many metagenomic annotation systems. For predicting reading frames on raw reads, we find the hidden Markov model approach in FragGeneScan is more sensitive than other gene prediction tools, while Prodigal, MGA, and MGM are better suited for higher-quality sequences such as assembled contigs.
Collapse
Affiliation(s)
- William L Trimble
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.
| | | | | | | | | | | | | |
Collapse
|
18
|
Liu Z, Frigaard NU, Vogl K, Iino T, Ohkuma M, Overmann J, Bryant DA. Complete Genome of Ignavibacterium album, a Metabolically Versatile, Flagellated, Facultative Anaerobe from the Phylum Chlorobi. Front Microbiol 2012; 3:185. [PMID: 22661972 PMCID: PMC3362086 DOI: 10.3389/fmicb.2012.00185] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2012] [Accepted: 05/04/2012] [Indexed: 11/13/2022] Open
Abstract
Prior to the recent discovery of Ignavibacterium album (I. album), anaerobic photoautotrophic green sulfur bacteria (GSB) were the only members of the bacterial phylum Chlorobi that had been grown axenically. In contrast to GSB, sequence analysis of the 3.7-Mbp genome of I. album shows that this recently described member of the phylum Chlorobi is a chemoheterotroph with a versatile metabolism. I. album lacks genes for photosynthesis and sulfur oxidation but has a full set of genes for flagella and chemotaxis. The occurrence of genes for multiple electron transfer complexes suggests that I. album is capable of organoheterotrophy under both oxic and anoxic conditions. The occurrence of genes encoding enzymes for CO(2) fixation as well as other enzymes of the reductive TCA cycle suggests that mixotrophy may be possible under certain growth conditions. However, known biosynthetic pathways for several amino acids are incomplete; this suggests that I. album is dependent upon on exogenous sources of these metabolites or employs novel biosynthetic pathways. Comparisons of I. album and other members of the phylum Chlorobi suggest that the physiology of the ancestors of this phylum might have been quite different from that of modern GSB.
Collapse
Affiliation(s)
- Zhenfeng Liu
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University University Park, PA, USA
| | | | | | | | | | | | | |
Collapse
|
19
|
Korotkova MA, Kudryashov NA, Korotkov EV. An approach for searching insertions in bacterial genes leading to the phase shift of triplet periodicity. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 9:158-70. [PMID: 22196359 PMCID: PMC5054449 DOI: 10.1016/s1672-0229(11)60019-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2011] [Accepted: 08/02/2011] [Indexed: 11/28/2022]
Abstract
The concept of the phase shift of triplet periodicity (TP) was used for searching potential DNA insertions in genes from 17 bacterial genomes. A mathematical algorithm for detection of these insertions has been developed. This approach can detect potential insertions and deletions with lengths that are not multiples of three bases, especially insertions of relatively large DNA fragments (>100 bases). New similarity measure between triplet matrixes was employed to improve the sensitivity for detecting the TP phase shift. Sequences of 17,220 bacterial genes with each consisting of more than 1,200 bases were analyzed, and the presence of a TP phase shift has been shown in ~16% of analysed genes (2,809 genes), which is about 4 times more than that detected in our previous work. We propose that shifts of the TP phase may indicate the shifts of reading frame in genes after insertions of the DNA fragments with lengths that are not multiples of three bases. A relationship between the phase shifts of TP and the frame shifts in genes is discussed.
Collapse
Affiliation(s)
- Maria A. Korotkova
- National University of Nuclear Investigations (MIFI), Moscow 115409, Russia
| | | | - Eugene V. Korotkov
- National University of Nuclear Investigations (MIFI), Moscow 115409, Russia
- Centre of Bioengineering, Russian Academy of Sciences, Moscow 117312, Russia
- Corresponding author.
| |
Collapse
|
20
|
Klimke W, O'Donovan C, White O, Brister JR, Clark K, Fedorov B, Mizrachi I, Pruitt KD, Tatusova T. Solving the Problem: Genome Annotation Standards before the Data Deluge. Stand Genomic Sci 2011; 5:168-93. [PMID: 22180819 PMCID: PMC3236044 DOI: 10.4056/sigs.2084864] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Collapse
|
21
|
Wóycicki R, Witkowicz J, Gawroński P, Dąbrowska J, Lomsadze A, Pawełkowicz M, Siedlecka E, Yagi K, Pląder W, Seroczyńska A, Śmiech M, Gutman W, Niemirowicz-Szczytt K, Bartoszewski G, Tagashira N, Hoshi Y, Borodovsky M, Karpiński S, Malepszy S, Przybecki Z. The genome sequence of the North-European cucumber (Cucumis sativus L.) unravels evolutionary adaptation mechanisms in plants. PLoS One 2011; 6:e22728. [PMID: 21829493 PMCID: PMC3145757 DOI: 10.1371/journal.pone.0022728] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2010] [Accepted: 07/05/2011] [Indexed: 01/01/2023] Open
Abstract
Cucumber (Cucumis sativus L.), a widely cultivated crop, has originated from Eastern Himalayas and secondary domestication regions includes highly divergent climate conditions e.g. temperate and subtropical. We wanted to uncover adaptive genome differences between the cucumber cultivars and what sort of evolutionary molecular mechanisms regulate genetic adaptation of plants to different ecosystems and organism biodiversity. Here we present the draft genome sequence of the Cucumis sativus genome of the North-European Borszczagowski cultivar (line B10) and comparative genomics studies with the known genomes of: C. sativus (Chinese cultivar – Chinese Long (line 9930)), Arabidopsis thaliana, Populus trichocarpa and Oryza sativa. Cucumber genomes show extensive chromosomal rearrangements, distinct differences in quantity of the particular genes (e.g. involved in photosynthesis, respiration, sugar metabolism, chlorophyll degradation, regulation of gene expression, photooxidative stress tolerance, higher non-optimal temperatures tolerance and ammonium ion assimilation) as well as in distributions of abscisic acid-, dehydration- and ethylene-responsive cis-regulatory elements (CREs) in promoters of orthologous group of genes, which lead to the specific adaptation features. Abscisic acid treatment of non-acclimated Arabidopsis and C. sativus seedlings induced moderate freezing tolerance in Arabidopsis but not in C. sativus. This experiment together with analysis of abscisic acid-specific CRE distributions give a clue why C. sativus is much more susceptible to moderate freezing stresses than A. thaliana. Comparative analysis of all the five genomes showed that, each species and/or cultivars has a specific profile of CRE content in promoters of orthologous genes. Our results constitute the substantial and original resource for the basic and applied research on environmental adaptations of plants, which could facilitate creation of new crops with improved growth and yield in divergent conditions.
Collapse
MESH Headings
- Adaptation, Physiological
- Chromosome Mapping
- Chromosomes, Artificial, Bacterial
- Chromosomes, Plant/genetics
- Cucumis sativus/genetics
- DNA, Plant/genetics
- Evolution, Molecular
- Gene Expression Regulation, Plant
- Genes, Plant
- Genome, Plant
- Polymerase Chain Reaction
- Promoter Regions, Genetic/genetics
- Regulatory Sequences, Nucleic Acid
- Sequence Analysis, DNA
Collapse
Affiliation(s)
- Rafał Wóycicki
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
- * E-mail: (ZP); (SK); (RW)
| | - Justyna Witkowicz
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Piotr Gawroński
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Joanna Dąbrowska
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Alexandre Lomsadze
- Center for Bioinformatics and Computational Genomics, Joint Wallace H. Coulter Georgia Tech and Emory Department of Biomedical Engineering, School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Magdalena Pawełkowicz
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Ewa Siedlecka
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Kohei Yagi
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Wojciech Pląder
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Anna Seroczyńska
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Mieczysław Śmiech
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Wojciech Gutman
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Katarzyna Niemirowicz-Szczytt
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Grzegorz Bartoszewski
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Norikazu Tagashira
- Department of Living Design and Information Science, Faculty of Human Development, Hiroshima Jogakuin University, Higashi-ku, Japan
| | - Yoshikazu Hoshi
- Department of Plant Science, Tokai University, Minamiaso-mura, Kumamoto, Japan
| | - Mark Borodovsky
- Center for Bioinformatics and Computational Genomics, Joint Wallace H. Coulter Georgia Tech and Emory Department of Biomedical Engineering, School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Stanisław Karpiński
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
- * E-mail: (ZP); (SK); (RW)
| | - Stefan Malepszy
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
| | - Zbigniew Przybecki
- Department of Plant Genetics, Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture, Warsaw University of Life Sciences - SGGW, Nowoursynowska, Warsaw, Poland
- * E-mail: (ZP); (SK); (RW)
| |
Collapse
|
22
|
Sharma V, Firth AE, Antonov I, Fayet O, Atkins JF, Borodovsky M, Baranov PV. A pilot study of bacterial genes with disrupted ORFs reveals a surprising profusion of protein sequence recoding mediated by ribosomal frameshifting and transcriptional realignment. Mol Biol Evol 2011; 28:3195-211. [PMID: 21673094 DOI: 10.1093/molbev/msr155] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Bacterial genome annotations contain a number of coding sequences (CDSs) that, in spite of reading frame disruptions, encode a single continuous polypeptide. Such disruptions have different origins: sequencing errors, frameshift, or stop codon mutations, as well as instances of utilization of nontriplet decoding. We have extracted over 1,000 CDSs with annotated disruptions and found that about 75% of them can be clustered into 64 groups based on sequence similarity. Analysis of the clusters revealed deep phylogenetic conservation of open reading frame organization as well as the presence of conserved sequence patterns that indicate likely utilization of the nonstandard decoding mechanisms: programmed ribosomal frameshifting (PRF) and programmed transcriptional realignment (PTR). Further enrichment of these clusters with additional homologous nucleotide sequences revealed over 6,000 candidate genes utilizing PRF or PTR. Analysis of the patterns of conservation apparently associated with nontriplet decoding revealed the presence of both previously characterized frameshift-prone sequences and a few novel ones. Since the starting point of our analysis was a set of genes with already annotated disruptions, it is highly plausible that in this study, we have identified only a fraction of all bacterial genes that utilize PRF or PTR. In addition to the identification of a large number of recoded genes, a surprising observation is that nearly half of them are expressed via PTR-a mechanism that, in contrast to PRF, has not yet received substantial attention.
Collapse
Affiliation(s)
- Virag Sharma
- Department of Biochemistry, University College Cork, Cork, Ireland
| | | | | | | | | | | | | |
Collapse
|
23
|
Zhang Y, Sun Y. HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors. BMC Bioinformatics 2011; 12:198. [PMID: 21609463 PMCID: PMC3115854 DOI: 10.1186/1471-2105-12-198] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Accepted: 05/24/2011] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Protein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors. RESULTS We introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families. CONCLUSIONS HMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at http://www.cse.msu.edu/~zhangy72/hmmframe/ and at https://sourceforge.net/projects/hmm-frame/.
Collapse
Affiliation(s)
- Yuan Zhang
- Computer Science and Engineering Department, Michigan State University, East Lansing, MI, USA
| | - Yanni Sun
- Computer Science and Engineering Department, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
24
|
Xu J, Linning R, Fellers J, Dickinson M, Zhu W, Antonov I, Joly DL, Donaldson ME, Eilam T, Anikster Y, Banks T, Munro S, Mayo M, Wynhoven B, Ali J, Moore R, McCallum B, Borodovsky M, Saville B, Bakkeren G. Gene discovery in EST sequences from the wheat leaf rust fungus Puccinia triticina sexual spores, asexual spores and haustoria, compared to other rust and corn smut fungi. BMC Genomics 2011; 12:161. [PMID: 21435244 PMCID: PMC3074555 DOI: 10.1186/1471-2164-12-161] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2010] [Accepted: 03/24/2011] [Indexed: 12/30/2022] Open
Abstract
Background Rust fungi are biotrophic basidiomycete plant pathogens that cause major diseases on plants and trees world-wide, affecting agriculture and forestry. Their biotrophic nature precludes many established molecular genetic manipulations and lines of research. The generation of genomic resources for these microbes is leading to novel insights into biology such as interactions with the hosts and guiding directions for breakthrough research in plant pathology. Results To support gene discovery and gene model verification in the genome of the wheat leaf rust fungus, Puccinia triticina (Pt), we have generated Expressed Sequence Tags (ESTs) by sampling several life cycle stages. We focused on several spore stages and isolated haustorial structures from infected wheat, generating 17,684 ESTs. We produced sequences from both the sexual (pycniospores, aeciospores and teliospores) and asexual (germinated urediniospores) stages of the life cycle. From pycniospores and aeciospores, produced by infecting the alternate host, meadow rue (Thalictrum speciosissimum), 4,869 and 1,292 reads were generated, respectively. We generated 3,703 ESTs from teliospores produced on the senescent primary wheat host. Finally, we generated 6,817 reads from haustoria isolated from infected wheat as well as 1,003 sequences from germinated urediniospores. Along with 25,558 previously generated ESTs, we compiled a database of 13,328 non-redundant sequences (4,506 singlets and 8,822 contigs). Fungal genes were predicted using the EST version of the self-training GeneMarkS algorithm. To refine the EST database, we compared EST sequences by BLASTN to a set of 454 pyrosequencing-generated contigs and Sanger BAC-end sequences derived both from the Pt genome, and to ESTs and genome reads from wheat. A collection of 6,308 fungal genes was identified and compared to sequences of the cereal rusts, Puccinia graminis f. sp. tritici (Pgt) and stripe rust, P. striiformis f. sp. tritici (Pst), and poplar leaf rust Melampsora species, and the corn smut fungus, Ustilago maydis (Um). While extensive homologies were found, many genes appeared novel and species-specific; over 40% of genes did not match any known sequence in existing databases. Focusing on spore stages, direct comparison to Um identified potential functional homologs, possibly allowing heterologous functional analysis in that model fungus. Many potentially secreted protein genes were identified by similarity searches against genes and proteins of Pgt and Melampsora spp., revealing apparent orthologs. Conclusions The current set of Pt unigenes contributes to gene discovery in this major cereal pathogen and will be invaluable for gene model verification in the genome sequence.
Collapse
Affiliation(s)
- Junhuan Xu
- Pacific Agri-Food Research Centre, Agriculture & Agri-Food Canada, Summerland, BC V0H 1Z0, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Gelfand MS. Introduction: 4th International Moscow Conference on Computational Molecular Biology MCCMB'09. J Bioinform Comput Biol 2010; 8:v-vii. [PMID: 20564834 DOI: 10.1142/s0219720010004938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
26
|
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res 2010; 38:e132. [PMID: 20403810 PMCID: PMC2896542 DOI: 10.1093/nar/gkq275] [Citation(s) in RCA: 1099] [Impact Index Per Article: 73.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.
Collapse
Affiliation(s)
- Wenhan Zhu
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | |
Collapse
|