1
|
Gunasekera RS, Raja KKB, Hewapathirana S, Tundrea E, Gunasekera V, Galbadage T, Nelson PA. ORFanID: A web-based search engine for the discovery and identification of orphan and taxonomically restricted genes. PLoS One 2023; 18:e0291260. [PMID: 37879070 PMCID: PMC10599687 DOI: 10.1371/journal.pone.0291260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 08/24/2023] [Indexed: 10/27/2023] Open
Abstract
With the numerous genomes sequenced today, it has been revealed that a noteworthy percentage of genes in a given taxon of organisms in the phylogenetic tree of life do not have orthologous sequences in other taxa. These sequences are commonly referred to as "orphans" or "ORFans" if found as single occurrences in a single species or as "taxonomically restricted genes" (TRGs) when found at higher taxonomic levels. Quantitative and collective studies of these genes are necessary for understanding their biological origins. However, the current software for identifying orphan genes is limited in its functionality, database search range, and very complex algorithmically. Thus, researchers studying orphan genes must harvest their data from many disparate sources. ORFanID is a graphical web-based search engine that facilitates the efficient identification of both orphan genes and TRGs at all taxonomic levels, from DNA or amino acid sequences in the NCBI database cluster and other large bioinformatics repositories. The software allows users to identify genes that are unique to any taxonomic rank, from species to domain, using NCBI systematic classifiers. It provides control over NCBI database search parameters, and the results are presented in a spreadsheet as well as a graphical display. The tables in the software are sortable, and results can be filtered using the fuzzy search functionality. The visual presentation can be expanded and collapsed by the taxonomic tree to its various branches. Example results from searches on five species and gene expression data from specific orphan genes are provided in the Supplementary Information.
Collapse
Affiliation(s)
- Richard S. Gunasekera
- Department of Chemistry, Physics and Engineering, School of Science, Technology & Health, Biola University, La Mirada, CA, United States of America
| | - Komal K. B. Raja
- Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, United States of America
| | - Suresh Hewapathirana
- European Bioinformatics Institute, Welcome Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Emanuel Tundrea
- Griffiths School of Management and IT, Emanuel University of Oradea, Oradea, Romania
| | - Vinodh Gunasekera
- Bioinformatics, Chesalon USA, Inc., Houston, TX, United States of America
| | - Thushara Galbadage
- Department of Kinesiology and Public Health, School of Science, Technology & Health, Biola University, La Mirada, CA, United States of America
| | - Paul A. Nelson
- Biola University, La Mirada, CA, United States of America
| |
Collapse
|
2
|
Fakhar AZ, Liu J, Pajerowska-Mukhtar KM, Mukhtar MS. The Lost and Found: Unraveling the Functions of Orphan Genes. J Dev Biol 2023; 11:27. [PMID: 37367481 PMCID: PMC10299390 DOI: 10.3390/jdb11020027] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/19/2023] [Accepted: 05/26/2023] [Indexed: 06/28/2023] Open
Abstract
Orphan Genes (OGs) are a mysterious class of genes that have recently gained significant attention. Despite lacking a clear evolutionary history, they are found in nearly all living organisms, from bacteria to humans, and they play important roles in diverse biological processes. The discovery of OGs was first made through comparative genomics followed by the identification of unique genes across different species. OGs tend to be more prevalent in species with larger genomes, such as plants and animals, and their evolutionary origins remain unclear but potentially arise from gene duplication, horizontal gene transfer (HGT), or de novo origination. Although their precise function is not well understood, OGs have been implicated in crucial biological processes such as development, metabolism, and stress responses. To better understand their significance, researchers are using a variety of approaches, including transcriptomics, functional genomics, and molecular biology. This review offers a comprehensive overview of the current knowledge of OGs in all domains of life, highlighting the possible role of dark transcriptomics in their evolution. More research is needed to fully comprehend the role of OGs in biology and their impact on various biological processes.
Collapse
Affiliation(s)
| | | | | | - M. Shahid Mukhtar
- Department of Biology, University of Alabama at Birmingham, 1300 University Blvd., Birmingham, AL 35294, USA
| |
Collapse
|
3
|
Jiang M, Li X, Dong X, Zu Y, Zhan Z, Piao Z, Lang H. Research Advances and Prospects of Orphan Genes in Plants. FRONTIERS IN PLANT SCIENCE 2022; 13:947129. [PMID: 35874010 PMCID: PMC9305701 DOI: 10.3389/fpls.2022.947129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 06/23/2022] [Indexed: 06/15/2023]
Abstract
Orphan genes (OGs) are defined as genes having no sequence similarity with genes present in other lineages. OGs have been regarded to play a key role in the development of lineage-specific adaptations and can also serve as a constant source of evolutionary novelty. These genes have often been found related to various stress responses, species-specific traits, special expression regulation, and also participate in primary substance metabolism. The advancement in sequencing tools and genome analysis methods has made the identification and characterization of OGs comparatively easier. In the study of OG functions in plants, significant progress has been made. We review recent advances in the fast evolving characteristics, expression modulation, and functional analysis of OGs with a focus on their role in plant biology. We also emphasize current challenges, adoptable strategies and discuss possible future directions of functional study of OGs.
Collapse
Affiliation(s)
- Mingliang Jiang
- School of Agriculture, Jilin Agricultural Science and Technology College, Jilin, China
| | - Xiaonan Li
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Xiangshu Dong
- School of Agriculture, Yunnan University, Kunming, China
| | - Ye Zu
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Zongxiang Zhan
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Zhongyun Piao
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Hong Lang
- School of Agriculture, Jilin Agricultural Science and Technology College, Jilin, China
| |
Collapse
|
4
|
Sakamoto T, Ortega JM. Taxallnomy: an extension of NCBI Taxonomy that produces a hierarchically complete taxonomic tree. BMC Bioinformatics 2021; 22:388. [PMID: 34325658 PMCID: PMC8323199 DOI: 10.1186/s12859-021-04304-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Accepted: 07/12/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND NCBI Taxonomy is the main taxonomic source for several bioinformatics tools and databases since all organisms with sequence accessions deposited on INSDC are organized in its hierarchical structure. Despite the extensive use and application of this data source, an alternative representation of data as a table would facilitate the use of information for processing bioinformatics data. To do so, since some taxonomic-ranks are missing in some lineages, an algorithm might propose provisional names for all taxonomic-ranks. RESULTS To address this issue, we developed an algorithm that takes the tree structure from NCBI Taxonomy and generates a hierarchically complete taxonomic table, maintaining its compatibility with the original tree. The procedures performed by the algorithm consist of attempting to assign a taxonomic-rank to an existing clade or "no rank" node when possible, using its name as part of the created taxonomic-rank name (e.g. Ord_Ornithischia) or interpolating parent nodes when needed (e.g. Cla_of_Ornithischia), both examples given for the dinosaur Brachylophosaurus lineage. The new hierarchical structure was named Taxallnomy because it contains names for all taxonomic-ranks, and it contains 41 hierarchical levels corresponding to the 41 taxonomic-ranks currently found in the NCBI Taxonomy database. From Taxallnomy, users can obtain the complete taxonomic lineage with 41 nodes of all taxa available in the NCBI Taxonomy database, without any hazard to the original tree information. In this work, we demonstrate its applicability by embedding taxonomic information of a specified rank into a phylogenetic tree and by producing metagenomics profiles. CONCLUSION Taxallnomy applies to any bioinformatics analyses that depend on the information from NCBI Taxonomy. Taxallnomy is updated periodically but with a distributed PERL script users can generate it locally using NCBI Taxonomy as input. All Taxallnomy resources are available at http://bioinfo.icb.ufmg.br/taxallnomy .
Collapse
Affiliation(s)
- Tetsu Sakamoto
- BioME - Bioinformatics Multidisciplinary Environment, Instituto Metrópole Digital (IMD), Universidade Federal Do Rio Grande Do Norte (UFRN), Natal, RN, Brazil
- Laboratório de Biodados, Departamento de Bioquímica E Imunologia, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, MG, Brazil
| | - J Miguel Ortega
- Laboratório de Biodados, Departamento de Bioquímica E Imunologia, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, MG, Brazil.
| |
Collapse
|
5
|
O’Conner S, Li L. Mitochondrial Fostering: The Mitochondrial Genome May Play a Role in Plant Orphan Gene Evolution. FRONTIERS IN PLANT SCIENCE 2020; 11:600117. [PMID: 33424897 PMCID: PMC7793901 DOI: 10.3389/fpls.2020.600117] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 11/02/2020] [Indexed: 05/12/2023]
Abstract
Plant mitochondrial genomes exhibit unique evolutionary patterns. They have a high rearrangement but low mutation rate, and a large size. Based on massive mitochondrial DNA transfers to the nucleus as well as the mitochondrial unique evolutionary traits, we propose a "Mitochondrial Fostering" theory where the organelle genome plays an integral role in the arrival and development of orphan genes (genes with no homologs in other lineages). Two approaches were used to test this theory: (1) bioinformatic analysis of nuclear mitochondrial DNA (Numts: mitochondrial originating DNA that migrated to the nucleus) at the genome level, and (2) bioinformatic analysis of particular orphan sequences present in both the mitochondrial genome and the nuclear genome of Arabidopsis thaliana. One study example is given about one orphan sequence that codes for two unique orphan genes: one in the mitochondrial genome and another one in the nuclear genome. DNA alignments show regions of this A. thaliana orphan sequence exist scattered throughout other land plant mitochondrial genomes. This is consistent with the high recombination rates of mitochondrial genomes in land plants. This may also enable the creation of novel coding sequences within the orphan loci, which can then be transferred to the nuclear genome and become exposed to new evolutionary pressures. Our study also reveals a high correlation between the amount of mitochondrial DNA transferred to the nuclear genome and the number of orphan genes in land plants. All the data suggests the mitochondrial genome may play a role in nuclear orphan gene evolution in land plants.
Collapse
Affiliation(s)
| | - Ling Li
- Department of Biological Sciences, Mississippi State University, Mississippi State, MS, United States
| |
Collapse
|
6
|
Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES. phylostratr: a framework for phylostratigraphy. Bioinformatics 2020; 35:3617-3627. [PMID: 30873536 DOI: 10.1093/bioinformatics/btz171] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 02/27/2019] [Accepted: 03/13/2019] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION The goal of phylostratigraphy is to infer the evolutionary origin of each gene in an organism. This is done by searching for homologs within increasingly broad clades. The deepest clade that contains a homolog of the protein(s) encoded by a gene is that gene's phylostratum. RESULTS We have created a general R-based framework, phylostratr, to estimate the phylostratum of every gene in a species. The program fully automates analysis: selecting species for balanced representation, retrieving sequences, building databases, inferring phylostrata and returning diagnostics. Key diagnostics include: detection of genes with inferred homologs in old clades, but not intermediate ones; proteome quality assessments; false-positive diagnostics, and checks for missing organellar genomes. phylostratr allows extensive customization and systematic comparisons of the influence of analysis parameters or genomes on phylostrata inference. A user may: modify the automatically generated clade tree or use their own tree; provide custom sequences in place of those automatically retrieved from UniProt; replace BLAST with an alternative algorithm; or tailor the method and sensitivity of the homology inference classifier. We show the utility of phylostratr through case studies in Arabidopsis thaliana and Saccharomyces cerevisiae. AVAILABILITY AND IMPLEMENTATION Source code available at https://github.com/arendsee/phylostratr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zebulun Arendsee
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.,Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA, USA
| | - Jing Li
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.,Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA
| | - Urminder Singh
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.,Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA
| | - Arun Seetharam
- Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA.,Genome Informatics Facility, Iowa State University, Ames, IA, USA
| | - Karin Dorman
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.,Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA.,Department of Statistics, Iowa State University, Ames, IA, USA
| | - Eve Syrkin Wurtele
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.,Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA.,Center for Metabolic Biology, Iowa State University, Ames, IA, USA
| |
Collapse
|
7
|
Reanalysis of Lactobacillus paracasei Lbs2 Strain and Large-Scale Comparative Genomics Places Many Strains into Their Correct Taxonomic Position. Microorganisms 2019; 7:microorganisms7110487. [PMID: 31731444 PMCID: PMC6920896 DOI: 10.3390/microorganisms7110487] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2019] [Revised: 10/08/2019] [Accepted: 10/14/2019] [Indexed: 12/11/2022] Open
Abstract
Lactobacillus paracasei are diverse Gram-positive bacteria that are very closely related to Lactobacillus casei, belonging to the Lactobacillus casei group. Due to extreme genome similarities between L. casei and L. paracasei, many strains have been cross placed in the other group. We had earlier sequenced and analyzed the genome of Lactobacillus paracasei Lbs2, but mistakenly identified it as L. casei. We re-analyzed Lbs2 reads into a 2.5 MB genome that is 91.28% complete with 0.8% contamination, which is now suitably placed under L. paracasei based on Average Nucleotide Identity and Average Amino Acid Identity. We took 74 sequenced genomes of L. paracasei from GenBank with assembly sizes ranging from 2.3 to 3.3 MB and genome completeness between 88% and 100% for comparison. The pan-genome of 75 L. paracasei strains hold 15,945 gene families (21,5232 genes), while the core genome contained about 8.4% of the total genes (243 gene families with 18,225 genes) of pan-genome. Phylogenomic analysis based on core gene families revealed that the Lbs2 strain has a closer relationship with L. paracasei subsp. tolerans DSM20258. Finally, the in-silico analysis of the L. paracasei Lbs2 genome revealed an important pathway that could underpin the production of thiamin, which may contribute to the host energy metabolism.
Collapse
|
8
|
Bioinformatics Identification of Anti-CRISPR Loci by Using Homology, Guilt-by-Association, and CRISPR Self-Targeting Spacer Approaches. mSystems 2019; 4:4/5/e00455-19. [PMID: 31506266 PMCID: PMC6739104 DOI: 10.1128/msystems.00455-19] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
As a naturally occurring adaptive immune system, CRISPR-Cas (clustered regularly interspersed short palindromic repeats–CRISPR-associated genes) systems are widely found in bacteria and archaea to defend against viruses. Since 2013, the application of various bacterial CRISPR-Cas systems has become very popular due to their development into targeted and programmable genome engineering tools with the ability to edit almost any genome. As the natural off-switch of CRISPR-Cas systems, anti-CRISPRs have a great potential to serve as regulators of CRISPR-Cas tools and enable safer and more controllable genome editing. This study will help understand the relative usefulness of the three bioinformatics approaches for new Acr discovery, as well as guide the future development of new bioinformatics tools to facilitate anti-CRISPR research. The thousands of Acr homologs and hundreds of new anti-CRISPR loci identified in this study will be a valuable data resource for genome engineers to search for new CRISPR-Cas regulators. Anti-CRISPR (Acr) loci/operons encode Acr proteins and Acr-associated (Aca) proteins. Forty-five Acr families have been experimentally characterized inhibiting seven subtypes of CRISPR-Cas systems. We have developed a bioinformatics pipeline to identify genomic loci containing Acr homologs and/or Aca homologs by combining three computational approaches: homology, guilt-by-association, and self-targeting spacers. Homology search found thousands of Acr homologs in bacterial and viral genomes, but most are homologous to AcrIIA7 and AcrIIA9. Investigating the gene neighborhood of these Acr homologs revealed that only a small percentage (23.0% in bacteria and 8.2% in viruses) of them have neighboring Aca homologs and thus form Acr-Aca operons. Surprisingly, although a self-targeting spacer is a strong indicator of the presence of Acr genes in a genome, a large percentage of Acr-Aca loci are found in bacterial genomes without self-targeting spacers or even without complete CRISPR-Cas systems. Additionally, for Acr homologs from genomes with self-targeting spacers, homology-based Acr family assignments do not always agree with the self-targeting CRISPR-Cas subtypes. Last, by investigating Acr genomic loci coexisting with self-targeting spacers in the same genomes, five known subtypes (I-C, I-E, I-F, II-A, and II-C) and five new subtypes (I-B, III-A, III-B, IV-A, and V-U4) of Acrs were inferred. Based on these findings, we conclude that the discovery of new anti-CRISPRs should not be restricted to genomes with self-targeting spacers and loci with Acr homologs. The evolutionary arms race of CRISPR-Cas systems and anti-CRISPR systems may have driven the adaptive and rapid gain and loss of these elements in closely related genomes. IMPORTANCE As a naturally occurring adaptive immune system, CRISPR-Cas (clustered regularly interspersed short palindromic repeats–CRISPR-associated genes) systems are widely found in bacteria and archaea to defend against viruses. Since 2013, the application of various bacterial CRISPR-Cas systems has become very popular due to their development into targeted and programmable genome engineering tools with the ability to edit almost any genome. As the natural off-switch of CRISPR-Cas systems, anti-CRISPRs have a great potential to serve as regulators of CRISPR-Cas tools and enable safer and more controllable genome editing. This study will help understand the relative usefulness of the three bioinformatics approaches for new Acr discovery, as well as guide the future development of new bioinformatics tools to facilitate anti-CRISPR research. The thousands of Acr homologs and hundreds of new anti-CRISPR loci identified in this study will be a valuable data resource for genome engineers to search for new CRISPR-Cas regulators.
Collapse
|
9
|
Orphan Genes Shared by Pathogenic Genomes Are More Associated with Bacterial Pathogenicity. mSystems 2019; 4:mSystems00290-18. [PMID: 30801025 PMCID: PMC6372840 DOI: 10.1128/msystems.00290-18] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 01/08/2019] [Indexed: 11/20/2022] Open
Abstract
Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases. Orphan genes (also known as ORFans [i.e., orphan open reading frames]) are new genes that enable an organism to adapt to its specific living environment. Our focus in this study is to compare ORFans between pathogens (P) and nonpathogens (NP) of the same genus. Using the pangenome idea, we have identified 130,169 ORFans in nine bacterial genera (505 genomes) and classified these ORFans into four groups: (i) SS-ORFans (P), which are only found in a single pathogenic genome; (ii) SS-ORFans (NP), which are only found in a single nonpathogenic genome; (iii) PS-ORFans (P), which are found in multiple pathogenic genomes; and (iv) NS-ORFans (NP), which are found in multiple nonpathogenic genomes. Within the same genus, pathogens do not always have more genes, more ORFans, or more pathogenicity-related genes (PRGs)—including prophages, pathogenicity islands (PAIs), virulence factors (VFs), and horizontal gene transfers (HGTs)—than nonpathogens. Interestingly, in pathogens of the nine genera, the percentages of PS-ORFans are consistently higher than those of SS-ORFans, which is not true in nonpathogens. Similarly, in pathogens of the nine genera, the percentages of PS-ORFans matching the four types of PRGs are also always higher than those of SS-ORFans, but this is not true in nonpathogens. All of these findings suggest the greater importance of PS-ORFans for bacterial pathogenicity. IMPORTANCE Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases.
Collapse
|