1
|
Dimonaco NJ, Clare A, Kenobi K, Aubrey W, Creevey CJ. StORF-Reporter: finding genes between genes. Nucleic Acids Res 2023; 51:11504-11517. [PMID: 37897345 PMCID: PMC10682499 DOI: 10.1093/nar/gkad814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/04/2023] [Accepted: 09/27/2023] [Indexed: 10/30/2023] Open
Abstract
Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for genes using alternative start codons to be misreported or completely omitted. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns regions that may contain missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations. We show this methodology recovers genes missing from canonical genome annotations. We inspect the results of the genomes of model organisms, the pangenome of Escherichia coli, and a set of 5109 prokaryotic genomes of 247 genera from the Ensembl Bacteria database. StORF-Reporter extended the core, soft-core and accessory gene collections, identified novel gene families and extended families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORFs are likely to be functional genes that should now be considered for inclusion in canonical annotations.
Collapse
Affiliation(s)
- Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, Wales, UK
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
- Department of Medicine, McMaster University, Hamilton, ON, Canada
- Farncombe Family Digestive Health Research Institute, McMaster University, Hamilton, ON, Canada
- School of Biological Sciences, Queen’s University Belfast, Belfast BT7 1NN, Northern Ireland, UK
| | - Amanda Clare
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
| | - Kim Kenobi
- Department of Mathematics, Aberystwyth University, Aberystwyth SY23 3BZ, Wales, UK
| | - Wayne Aubrey
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
| | - Christopher J Creevey
- School of Biological Sciences, Queen’s University Belfast, Belfast BT7 1NN, Northern Ireland, UK
| |
Collapse
|
2
|
Syberg-Olsen MJ, Garber AI, Keeling PJ, McCutcheon JP, Husnik F. Pseudofinder: detection of pseudogenes in prokaryotic genomes. Mol Biol Evol 2022; 39:6633826. [PMID: 35801562 PMCID: PMC9336565 DOI: 10.1093/molbev/msac153] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Prokaryotic genomes are usually densely packed with intact and functional genes. However, in certain contexts, such as after recent ecological shifts or extreme population bottlenecks, broken and nonfunctional gene fragments can quickly accumulate and form a substantial fraction of the genome. Identification of these broken genes, called pseudogenes, is a critical step for understanding the evolutionary forces acting upon, and the functional potential encoded within, prokaryotic genomes. Here, we present Pseudofinder, an open-source software dedicated to pseudogene identification and analysis in bacterial and archaeal genomes. We demonstrate that Pseudofinder’s multi-pronged, reference-based approach can detect a wide variety of pseudogenes, including those that are highly degraded and typically missed by gene-calling pipelines, as well newly formed pseudogenes containing only one or a few inactivating mutations. Additionally, Pseudofinder can detect genes that lack inactivating substitutions but experiencing relaxed selection. Implementation of Pseudofinder in annotation pipelines will allow more precise estimations of the functional potential of sequenced microbes, while also generating new hypotheses related to the evolutionary dynamics of bacterial and archaeal genomes.
Collapse
Affiliation(s)
| | - Arkadiy I Garber
- Division of Biological Sciences, University of Montana, Missoula, Montana, USA
| | - Patrick J Keeling
- Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada
| | - John P McCutcheon
- Division of Biological Sciences, University of Montana, Missoula, Montana, USA.,Howard Hughes Medical Institute, 4000 Jones Bridge Road, Chevy Chase, Maryland, USA
| | - Filip Husnik
- Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada.,Okinawa Institute of Science and Technology, Okinawa, Japan
| |
Collapse
|
3
|
Wang L, Wang M, Shi X, Yang J, Qian C, Liu Q, Zong L, Liu X, Zhu Z, Tang D, Zhang X. Investigation into archaeal extremophilic lifestyles through comparative proteogenomic analysis. J Biomol Struct Dyn 2020; 39:7080-7092. [PMID: 32820705 DOI: 10.1080/07391102.2020.1808531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Archaea are a group of primary life forms on Earth and could thrive in many unique environments. Their successful colonization of extreme niches requires corresponding adaptations at proteogenomic level in order to maintain stable cellular structures and active physiological functions. Although some studies have already investigated the extremophilic lifestyles of archaeal species based on genomic features and protein structures, there is a lack of comparative proteogenomic analysis in a large scale. In this study, we explored 686 high-quality archaeal genomes (proteomes) sourced from the Pathosystems Resource Integration Center (PATRIC) database. General patterns of genomic features such as genome size, coding capacity (coding genes and non-coding regions), and G + C contents were re-confirmed. Protein domain distribution patterns were then identified across archaeal species. Domains with unknown functions (DUFs) and mini proteins were investigated in terms of their distributions due to their importance in archaeal physiological functions. In addition, physicochemical properties of protein sequences, such as stability, hydrophobicity, isoelectric point, aromaticity and amino acid compositions in corresponding archaeal groups were compared. Unique features associated with extremophilic lifestyles were observed, which suggested that evolutionary adaptations to different extreme environments had intrinsic impacts on archaeal protein features. Taken together, this systematic study facilitates a better understanding of the mechanisms behind the extremophilic lifestyles of archaeal species, which will further contribute to the evolutionary explorations of archaeal adaptations both experimentally and theoretically in the future studies.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Liang Wang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, Jiangsu, China.,Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Mengmeng Wang
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xinyi Shi
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Jianye Yang
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Chenlu Qian
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Qinghua Liu
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Lixin Zong
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xin Liu
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Zuobin Zhu
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Daoquan Tang
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xiao Zhang
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Computer Science, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China
| |
Collapse
|
4
|
Menendez-Gil P, Caballero CJ, Catalan-Moreno A, Irurzun N, Barrio-Hernandez I, Caldelari I, Toledo-Arana A. Differential evolution in 3'UTRs leads to specific gene expression in Staphylococcus. Nucleic Acids Res 2020; 48:2544-2563. [PMID: 32016395 PMCID: PMC7049690 DOI: 10.1093/nar/gkaa047] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 12/05/2019] [Accepted: 01/16/2020] [Indexed: 12/16/2022] Open
Abstract
The evolution of gene expression regulation has contributed to species differentiation. The 3' untranslated regions (3'UTRs) of mRNAs include regulatory elements that modulate gene expression; however, our knowledge of their implications in the divergence of bacterial species is currently limited. In this study, we performed genome-wide comparative analyses of mRNAs encoding orthologous proteins from the genus Staphylococcus and found that mRNA conservation was lost mostly downstream of the coding sequence (CDS), indicating the presence of high sequence diversity in the 3'UTRs of orthologous genes. Transcriptomic mapping of different staphylococcal species confirmed that 3'UTRs were also variable in length. We constructed chimeric mRNAs carrying the 3'UTR of orthologous genes and demonstrated that 3'UTR sequence variations affect protein production. This suggested that species-specific functional 3'UTRs might be specifically selected during evolution. 3'UTR variations may occur through different processes, including gene rearrangements, local nucleotide changes, and the transposition of insertion sequences. By extending the conservation analyses to specific 3'UTRs, as well as the entire set of Escherichia coli and Bacillus subtilis mRNAs, we showed that 3'UTR variability is widespread in bacteria. In summary, our work unveils an evolutionary bias within 3'UTRs that results in species-specific non-coding sequences that may contribute to bacterial diversity.
Collapse
Affiliation(s)
- Pilar Menendez-Gil
- Instituto de Agrobiotecnología (IdAB), CSIC-UPNA-Gobierno de Navarra, 31192-Mutilva, Navarra, Spain
| | - Carlos J Caballero
- Instituto de Agrobiotecnología (IdAB), CSIC-UPNA-Gobierno de Navarra, 31192-Mutilva, Navarra, Spain
| | - Arancha Catalan-Moreno
- Instituto de Agrobiotecnología (IdAB), CSIC-UPNA-Gobierno de Navarra, 31192-Mutilva, Navarra, Spain
| | - Naiara Irurzun
- Instituto de Agrobiotecnología (IdAB), CSIC-UPNA-Gobierno de Navarra, 31192-Mutilva, Navarra, Spain
| | - Inigo Barrio-Hernandez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Isabelle Caldelari
- Université de Strasbourg, CNRS, Architecture et Réactivité de l’ARN, UPR9002, F-67000-Strasbourg, France
| | - Alejandro Toledo-Arana
- Instituto de Agrobiotecnología (IdAB), CSIC-UPNA-Gobierno de Navarra, 31192-Mutilva, Navarra, Spain
| |
Collapse
|
5
|
Wachter J, Hill SA. Small transcriptome analysis indicates that the enzyme RppH influences both the quality and quantity of sRNAs in Neisseria gonorrhoeae. FEMS Microbiol Lett 2014; 362:fnu059. [PMID: 25688066 DOI: 10.1093/femsle/fnu059] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Prokaryotic mRNA turnover can be initiated by the removal of pyrophosphate from the 5' end of a transcript using the RNA pyrophosphohydrolase enzyme RppH. Following the initial dephosphorylation step, RNaseE then degrades the message into small oligonucleotide segments. This study assessed the small RNA transcriptome of Neisseria gonorrhoeae strain MS11 in two genetic backgrounds; using wild type cells as well as cells carrying a rppH insertional mutation. It was found that the presence of the RppH enzyme affected both the quantity and length of small RNAs (sRNAs) in various chromosomal locations and involved sense transcripts (seRNAs), transcripts originating from the opposite strand (asRNAs) as well as inter-genic-derived RNAs (IGRs). In comparing the two transcriptomes, we found that not all small RNAs were expressed in both genetic backgrounds, suggesting that RppH apparently targets only a subset of transcripts. Overall, this study shows that small RNAs can be detected from the majority of genes within the chromosome, as well as from inter-genic regions, and that more sRNA transcripts are detected in the absence of the RppH enzyme.
Collapse
Affiliation(s)
- Jenny Wachter
- Department of Biological Sciences, Northern Illinois University, DeKalb, IL 60115-2828, USA
| | - Stuart A Hill
- Department of Biological Sciences, Northern Illinois University, DeKalb, IL 60115-2828, USA
| |
Collapse
|
6
|
Tu Q, He Z, Zhou J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res 2014; 42:e67. [PMID: 24523352 PMCID: PMC4005670 DOI: 10.1093/nar/gku138] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Shotgun metagenome sequencing has become a fast, cheap and high-throughput technology for characterizing microbial communities in complex environments and human body sites. However, accurate identification of microorganisms at the strain/species level remains extremely challenging. We present a novel k-mer-based approach, termed GSMer, that identifies genome-specific markers (GSMs) from currently sequenced microbial genomes, which were then used for strain/species-level identification in metagenomes. Using 5390 sequenced microbial genomes, 8 770 321 50-mer strain-specific and 11 736 360 species-specific GSMs were identified for 4088 strains and 2005 species (4933 strains), respectively. The GSMs were first evaluated against mock community metagenomes, recently sequenced genomes and real metagenomes from different body sites, suggesting that the identified GSMs were specific to their targeting genomes. Sensitivity evaluation against synthetic metagenomes with different coverage suggested that 50 GSMs per strain were sufficient to identify most microbial strains with ≥0.25× coverage, and 10% of selected GSMs in a database should be detected for confident positive callings. Application of GSMs identified 45 and 74 microbial strains/species significantly associated with type 2 diabetes patients and obese/lean individuals from corresponding gastrointestinal tract metagenomes, respectively. Our result agreed with previous studies but provided strain-level information. The approach can be directly applied to identify microbial strains/species from raw metagenomes, without the effort of complex data pre-processing.
Collapse
Affiliation(s)
- Qichao Tu
- Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, OK 73072, USA, Earth Science Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and State Key Joint Laboratory of Environmental Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China
| | | | | |
Collapse
|
7
|
Abstract
In response to a lack of environmental combined nitrogen, the filamentous cyanobacterium Anabaena sp. strain PCC 7120 differentiates nitrogen-fixing heterocyst cells in a periodic pattern. HetR is a transcription factor that coordinates the regulation of this developmental program. An inverted repeat-containing sequence in the hepA promoter required for proheterocyst-specific transcription was identified based on sequence similarity to a previously characterized binding site for HetR in the promoter of hetP. The binding affinity of HetR for the hepA site is roughly an order of magnitude lower than that for the hetP binding site. A BLAST search of the Anabaena genome identified 166 hepA-like sites that occur as single or tandem sites (two binding sites separated by 13 bp). The vast majority of these sites are present in predicted intergenic regions. HetR bound five representative single binding sites in vitro, and binding was abrogated by transversions in the binding sites that conserved the inverted repeat nature of the sites. Binding to four representative tandem sites was not observed. Transcriptional fusions of the green fluorescent protein gene gfp with putative promoter regions associated with the representative binding sites indicated that HetR could function as either an activator or repressor and that activation was cell-type specific. Taken together, we have expanded the direct HetR regulon and propose a model in which three categories of HetR binding sites, based on binding affinity and nucleotide sequence, contribute to three of the four phases of differentiation.
Collapse
|