1
|
Meng W, Kong L, Abulizi A, Cong J, Sun Z, Chang Y. Sex determination factor, a novel male-linked gene in the sea cucumber Apostichopus japonicus: Molecular characterization, expression patterns and effects of gene knockdown. Comp Biochem Physiol B Biochem Mol Biol 2025; 277:111071. [PMID: 39778676 DOI: 10.1016/j.cbpb.2025.111071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 01/05/2025] [Accepted: 01/05/2025] [Indexed: 01/11/2025]
Abstract
Apostichopus japonicus is a highly significant marine aquaculture species. Research findings have indicated that male sea cucumbers demonstrate a more rapid growth rate compared to females, underscoring the potential advantages of establishing an all-male population. In this study, we identified a specific protein-coding gene (ORFan) within a 4565 bp male fragment and named it sex determination factor (sdf). The sdf transcript exhibited ubiquitous expression in various adult male tissues, along with dynamic expression patterns in the testis across different developmental stages. Notably, knockdown of the sdf gene through immersion of embryos in its specific vivo-morpholino oligomers (vivo-MO) resulted in significant changes in the expression levels of several sex-related genes including piwi1, vasa, foxl2, and DNMT3. Additionally, a transcriptomic analysis showed that sdf knockdown resulted in significant alterations in multiple biological processes encompassing various sex-related gene ontology terms such as male gonad development, ovarian follicle development, and steroidogenesis. These results provide a molecular foundation for comprehending ORFans in sea cucumbers while offering a valuable method for gene knockdown studies in echinoderms.
Collapse
Affiliation(s)
- Weihan Meng
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China
| | - Lingnan Kong
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China
| | - Abudula Abulizi
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China
| | - Jingjing Cong
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China; School of Life Science, Liaoning Normal University, Dalian 116029, China
| | - Zhihui Sun
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China.
| | - Yaqing Chang
- Key Laboratory of Mariculture& Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China
| |
Collapse
|
2
|
Fakhar AZ, Liu J, Pajerowska-Mukhtar KM, Mukhtar MS. The Lost and Found: Unraveling the Functions of Orphan Genes. J Dev Biol 2023; 11:27. [PMID: 37367481 PMCID: PMC10299390 DOI: 10.3390/jdb11020027] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/19/2023] [Accepted: 05/26/2023] [Indexed: 06/28/2023] Open
Abstract
Orphan Genes (OGs) are a mysterious class of genes that have recently gained significant attention. Despite lacking a clear evolutionary history, they are found in nearly all living organisms, from bacteria to humans, and they play important roles in diverse biological processes. The discovery of OGs was first made through comparative genomics followed by the identification of unique genes across different species. OGs tend to be more prevalent in species with larger genomes, such as plants and animals, and their evolutionary origins remain unclear but potentially arise from gene duplication, horizontal gene transfer (HGT), or de novo origination. Although their precise function is not well understood, OGs have been implicated in crucial biological processes such as development, metabolism, and stress responses. To better understand their significance, researchers are using a variety of approaches, including transcriptomics, functional genomics, and molecular biology. This review offers a comprehensive overview of the current knowledge of OGs in all domains of life, highlighting the possible role of dark transcriptomics in their evolution. More research is needed to fully comprehend the role of OGs in biology and their impact on various biological processes.
Collapse
Affiliation(s)
| | | | | | - M. Shahid Mukhtar
- Department of Biology, University of Alabama at Birmingham, 1300 University Blvd., Birmingham, AL 35294, USA
| |
Collapse
|
3
|
Abstract
Here we report the discovery of Yaravirus, a lineage of amoebal virus with a puzzling origin and evolution. Yaravirus presents 80-nm-sized particles and a 44,924-bp dsDNA genome encoding for 74 predicted proteins. Yaravirus genome annotation showed that none of its genes matched with sequences of known organisms at the nucleotide level; at the amino acid level, six predicted proteins had distant matches in the nr database. Complimentary prediction of three-dimensional structures indicated possible function of 17 proteins in total. Furthermore, we were not able to retrieve viral genomes closely related to Yaravirus in 8,535 publicly available metagenomes spanning diverse habitats around the globe. The Yaravirus genome also contained six types of tRNAs that did not match commonly used codons. Proteomics revealed that Yaravirus particles contain 26 viral proteins, one of which potentially representing a divergent major capsid protein (MCP) with a predicted double jelly-roll domain. Structure-guided phylogeny of MCP suggests that Yaravirus groups together with the MCPs of Pleurochrysis endemic viruses. Yaravirus expands our knowledge of the diversity of DNA viruses. The phylogenetic distance between Yaravirus and all other viruses highlights our still preliminary assessment of the genomic diversity of eukaryotic viruses, reinforcing the need for the isolation of new viruses of protists.
Collapse
|
4
|
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification. J Virol 2017; 91:JVI.02275-16. [PMID: 28122979 PMCID: PMC5375668 DOI: 10.1128/jvi.02275-16] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Accepted: 01/13/2017] [Indexed: 11/20/2022] Open
Abstract
Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids.
Collapse
|
5
|
Daubin V, Szöllősi GJ. Horizontal Gene Transfer and the History of Life. Cold Spring Harb Perspect Biol 2016; 8:a018036. [PMID: 26801681 DOI: 10.1101/cshperspect.a018036] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Microbes acquire DNA from a variety of sources. The last decades, which have seen the development of genome sequencing, have revealed that horizontal gene transfer has been a major evolutionary force that has constantly reshaped genomes throughout evolution. However, because the history of life must ultimately be deduced from gene phylogenies, the lack of methods to account for horizontal gene transfer has thrown into confusion the very concept of the tree of life. As a result, many questions remain open, but emerging methodological developments promise to use information conveyed by horizontal gene transfer that remains unexploited today.
Collapse
Affiliation(s)
- Vincent Daubin
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, 69000 Lyon, France Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, 69622 Villeurbanne, France
| | | |
Collapse
|
6
|
González-Casanova A, Aguirre-von-Wobeser E, Espín G, Servín-González L, Kurt N, Spanò D, Blath J, Soberón-Chávez G. Strong seed-bank effects in bacterial evolution. J Theor Biol 2014; 356:62-70. [PMID: 24768952 DOI: 10.1016/j.jtbi.2014.04.009] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2013] [Revised: 01/26/2014] [Accepted: 04/04/2014] [Indexed: 11/15/2022]
Abstract
Bacterial genomes are mosaics with fragments showing distinct phylogenetic origins or even being unrelated to any other genetic information (ORFan genes). Thus the analysis of bacterial population genetics is in large part a collection of explanations for anomalies in relation to classical population genetic models such as the Wright-Fisher model and the Kingman coalescent that do not adequately describe bacterial population genetics, genomics or evolution. The concept of "species" as an evolutionary coherent biological group that is genetically isolated and shares genetic information through recombination among its members cannot be applied to any bacterial group. Recently, a simple probabilistic model considering the role of strong seed-bank effects in population genetics has been proposed by Blath et al. This model suggests the existence of a genetic pool with high diversity that is not subject to classical selection and extinction. We reason that certain bacterial population genetics anomalies could be explained by the prevalence of strong seed-bank effects among bacteria. To address this possibility we analyzed the genome of the bacterium Azotobacter vinelandii and show that genes that code for functions that are essential for the bacterium biology do not have a relation of ancestry with closely related bacteria, or are ORFan genes. The existence of essential genes that are not inherited from the most recent ancestor cannot be explained by classical population genetics models and is irreconcilable with the current view of genes acquired by horizontal transfer as being accessory or adaptive.
Collapse
Affiliation(s)
- Adrián González-Casanova
- Technische Universität Berlin, TU Berlin, Fakultät II, Institut für Mathematik, MA 7-3, Strasse des 17. Juni 136, 10623 Berlin, Germany; Berlin Mathematical School, Departamento de Biología Molecular y Biotecnología, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Ciudad Universitaria, Apartado Postal 70228, 04510 DF, México
| | - Eneas Aguirre-von-Wobeser
- Instituto de Ecología, A. C., Red de Estudios Moleculares Avanzados, Apartado Postal 63, 91000, Xalapa, Veracruz, México
| | - Guadalupe Espín
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Apartado, México
| | - Luis Servín-González
- Departamento de Biología Molecular y Biotecnología, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Distrito Federal, México
| | - Noemi Kurt
- Technische Universität Berlin, TU Berlin, Fakultät II, Institut für Mathematik, MA 7-3, Strasse des 17. Juni 136, 10623 Berlin, Germany
| | | | - Jochen Blath
- Technische Universität Berlin, TU Berlin, Fakultät II, Institut für Mathematik, MA 7-3, Strasse des 17. Juni 136, 10623 Berlin, Germany.
| | - Gloria Soberón-Chávez
- Departamento de Biología Molecular y Biotecnología, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Distrito Federal, México.
| |
Collapse
|
7
|
Anton BP, Chang YC, Brown P, Choi HP, Faller LL, Guleria J, Hu Z, Klitgord N, Levy-Moonshine A, Maksad A, Mazumdar V, McGettrick M, Osmani L, Pokrzywa R, Rachlin J, Swaminathan R, Allen B, Housman G, Monahan C, Rochussen K, Tao K, Bhagwat AS, Brenner SE, Columbus L, de Crécy-Lagard V, Ferguson D, Fomenkov A, Gadda G, Morgan RD, Osterman AL, Rodionov DA, Rodionova IA, Rudd KE, Söll D, Spain J, Xu SY, Bateman A, Blumenthal RM, Bollinger JM, Chang WS, Ferrer M, Friedberg I, Galperin MY, Gobeill J, Haft D, Hunt J, Karp P, Klimke W, Krebs C, Macelis D, Madupu R, Martin MJ, Miller JH, O'Donovan C, Palsson B, Ruch P, Setterdahl A, Sutton G, Tate J, Yakunin A, Tchigvintsev D, Plata G, Hu J, Greiner R, Horn D, Sjölander K, Salzberg SL, Vitkup D, Letovsky S, Segrè D, DeLisi C, Roberts RJ, Steffen M, Kasif S. The COMBREX project: design, methodology, and initial results. PLoS Biol 2013; 11:e1001638. [PMID: 24013487 PMCID: PMC3754883 DOI: 10.1371/journal.pbio.1001638] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- Brian P. Anton
- New England Biolabs, Ipswich, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| | - Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Peter Brown
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Han-Pil Choi
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Lina L. Faller
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Jyotsna Guleria
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Zhenjun Hu
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Ami Levy-Moonshine
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Almaz Maksad
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Mark McGettrick
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Lais Osmani
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Revonda Pokrzywa
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - John Rachlin
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Rajeswari Swaminathan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Benjamin Allen
- Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Mathematics, Emmanuel College, Boston, Massachusetts, United States of America
| | - Genevieve Housman
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Caitlin Monahan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Krista Rochussen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Kevin Tao
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Ashok S. Bhagwat
- Department of Chemistry, Wayne State University, Detroit, Michigan, United States of America
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California, United States of America
| | - Linda Columbus
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, United States of America
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida, United States of America
| | - Donald Ferguson
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Alexey Fomenkov
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Giovanni Gadda
- Department of Chemistry, Georgia State University, Atlanta, Georgia, United States of America
| | - Richard D. Morgan
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Andrei L. Osterman
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Dmitry A. Rodionov
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Irina A. Rodionova
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Kenneth E. Rudd
- Department of Biochemistry and Molecular Biology, University of Miami, Miami, Florida, United States of America
| | - Dieter Söll
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - James Spain
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Shuang-yong Xu
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Alex Bateman
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Robert M. Blumenthal
- Department of Medical Microbiology and Immunology, and Program in Bioinformatics, University of Toledo, Toledo, Ohio, United States of America
| | - J. Martin Bollinger
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Woo-Suk Chang
- Department of Biology, University of Texas-Arlington, Arlington, Texas, United States of America
| | - Manuel Ferrer
- Spanish National Research Council (CSIC), Institute of Catalysis, Madrid, Spain
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Michael Y. Galperin
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Julien Gobeill
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Daniel Haft
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Hunt
- Biological Sciences, Columbia University, New York, New York, United States of America
| | - Peter Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, California, United States of America
| | - William Klimke
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Carsten Krebs
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Dana Macelis
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Ramana Madupu
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Maria J. Martin
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Jeffrey H. Miller
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Claire O'Donovan
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Bernhard Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Patrick Ruch
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Aaron Setterdahl
- Department of Chemistry, Indiana University Southeast, New Albany, Indiana, United States of America
| | - Granger Sutton
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Tate
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Alexander Yakunin
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Dmitri Tchigvintsev
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Germán Plata
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
- Integrated Program in Cellular, Molecular, Structural, and Genetic Studies, Columbia University, New York, New York, United States of America
| | - Jie Hu
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
| | - David Horn
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Kimmen Sjölander
- Berkeley Phylogenomics Group, University of California, Berkeley, California, United States of America
| | - Steven L. Salzberg
- Departments of Medicine and Biostatistics, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Dennis Vitkup
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Stanley Letovsky
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Charles DeLisi
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Richard J. Roberts
- New England Biolabs, Ipswich, Massachusetts, United States of America
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Martin Steffen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| |
Collapse
|
8
|
Gao B, Gupta RS. Phylogenetic framework and molecular signatures for the main clades of the phylum Actinobacteria. Microbiol Mol Biol Rev 2012; 76:66-112. [PMID: 22390973 PMCID: PMC3294427 DOI: 10.1128/mmbr.05011-11] [Citation(s) in RCA: 168] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
The phylum Actinobacteria harbors many important human pathogens and also provides one of the richest sources of natural products, including numerous antibiotics and other compounds of biotechnological interest. Thus, a reliable phylogeny of this large phylum and the means to accurately identify its different constituent groups are of much interest. Detailed phylogenetic and comparative analyses of >150 actinobacterial genomes reported here form the basis for achieving these objectives. In phylogenetic trees based upon 35 conserved proteins, most of the main groups of Actinobacteria as well as a number of their superageneric clades are resolved. We also describe large numbers of molecular markers consisting of conserved signature indels in protein sequences and whole proteins that are specific for either all Actinobacteria or their different clades (viz., orders, families, genera, and subgenera) at various taxonomic levels. These signatures independently support the existence of different phylogenetic clades, and based upon them, it is now possible to delimit the phylum Actinobacteria (excluding Coriobacteriia) and most of its major groups in clear molecular terms. The species distribution patterns of these markers also provide important information regarding the interrelationships among different main orders of Actinobacteria. The identified molecular markers, in addition to enabling the development of a stable and reliable phylogenetic framework for this phylum, also provide novel and powerful means for the identification of different groups of Actinobacteria in diverse environments. Genetic and biochemical studies on these Actinobacteria-specific markers should lead to the discovery of novel biochemical and/or other properties that are unique to different groups of Actinobacteria.
Collapse
Affiliation(s)
- Beile Gao
- Department of Biochemistry and Biomedical Science, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
9
|
Halachev MR, Loman NJ, Pallen MJ. Calculating orthologs in bacteria and Archaea: a divide and conquer approach. PLoS One 2011; 6:e28388. [PMID: 22174796 PMCID: PMC3236195 DOI: 10.1371/journal.pone.0028388] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2011] [Accepted: 11/07/2011] [Indexed: 12/27/2022] Open
Abstract
Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a "divide and conquer" approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/.
Collapse
Affiliation(s)
- Mihail R. Halachev
- School of Biosciences, University of Birmingham, Birmingham, United Kingdom
| | - Nicholas J. Loman
- School of Biosciences, University of Birmingham, Birmingham, United Kingdom
| | - Mark J. Pallen
- School of Biosciences, University of Birmingham, Birmingham, United Kingdom
| |
Collapse
|
10
|
Similarity of genes horizontally acquired by Escherichia coli and Salmonella enterica is evidence of a supraspecies pangenome. Proc Natl Acad Sci U S A 2011; 108:20154-9. [PMID: 22128332 DOI: 10.1073/pnas.1109451108] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Most bacterial and archaeal genomes contain many genes with little or no similarity to other genes, a property that impedes identification of gene origins. By comparing the codon usage of genes shared among strains (primarily vertically inherited genes) and genes unique to one strain (primarily recently horizontally acquired genes), we found that the plurality of unique genes in Escherichia coli and Salmonella enterica are much more similar to each other than are their vertically inherited genes. We conclude that E. coli and S. enterica derive these unique genes from a common source, a supraspecies phylogenetic group that includes the organisms themselves. The phylogenetic range of the sharing appears to include other (but not all) members of the Enterobacteriaceae. We found evidence of similar gene sharing in other bacterial and archaeal taxa. Thus, we conclude that frequent gene exchange, particularly that of genetic novelties, extends well beyond accepted species boundaries.
Collapse
|
11
|
Vishnepolsky B, Pirtskhalava M. CONTSOR--a new knowledge-based fold recognition potential, based on side chain orientation and contacts between residue terminal groups. Protein Sci 2011; 21:134-41. [PMID: 22057923 DOI: 10.1002/pro.763] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Revised: 10/18/2011] [Accepted: 10/31/2011] [Indexed: 11/09/2022]
Abstract
Recognizing the structural similarity without significant sequence identity (fold recognition) is an effective method for protein structure prediction. Previously, we developed a fold recognition potential called SORDIS, which incorporated side chain orientation in relation to hydrophobic core centers, distance of the residues from the protein globule center and secondary structure terms. But this potential does not include terms, based on close contacts between residues. In this paper a new fold recognition potential CONTSOR was presented, which based on SORDIS terms and the term, based on contacts between amino acid terminal groups. The performance of this potential was evaluated on SABmark benchmark for alignment accuracy and on SABmark and Lindahl benchmarks for fold recognition. The results show that CONTSOR has the best performance among other potentials on SABmark benchmark both for alignment accuracy and fold recognition and one of the best performances on Lindahl benchmark. CONTSOR software package is available for download at http://www.lifescience.org.ge/downloads/contsor.zip.
Collapse
Affiliation(s)
- Boris Vishnepolsky
- Life Science Research Centre, Laboratory of Bioinformatics, 14 Gotua Street, Tbilisi, Georgia.
| | | |
Collapse
|
12
|
Abstract
In the canonical version of evolution by gene duplication, one copy is kept unaltered while the other is free to evolve. This process of evolutionary experimentation can persist for millions of years. Since it is so short lived in comparison to the lifetime of the core genes that make up the majority of most genomes, a substantial fraction of the genome and the transcriptome may—in principle—be attributable to what we will refer to as “evolutionary transients”, referring here to both the process and the genes that have gone or are undergoing this process. Using the rice gene set as a test case, we argue that this phenomenon goes a long way towards explaining why there are so many more rice genes than Arabidopsis genes, and why most excess rice genes show low similarity to eudicots.
Collapse
|
13
|
Genomic and functional analyses of Rhodococcus equi phages ReqiPepy6, ReqiPoco6, ReqiPine5, and ReqiDocB7. Appl Environ Microbiol 2010; 77:669-83. [PMID: 21097585 DOI: 10.1128/aem.01952-10] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The isolation and results of genomic and functional analyses of Rhodococcus equi phages ReqiPepy6, ReqiDocB7, ReqiPine5, and ReqiPoco6 (hereafter referred to as Pepy6, DocB7, Pine5, and Poco6, respectively) are reported. Two phages, Pepy6 and Poco6, more than 75% identical, exhibited genome organization and protein sequence likeness to Lactococcus lactis phage 1706 and clostridial prophage elements. An unusually high fraction, 27%, of Pepy6 and Poco6 proteins were predicted to possess at least one transmembrane domain, a value much higher than the average of 8.5% transmembrane domain-containing proteins determined from a data set of 36,324 phage protein entries. Genome organization and protein sequence comparisons place phage Pine5 as the first nonmycobacteriophage member of the large Rosebush cluster. DocB7, which had the broadest host range among the four isolates, was not closely related to any phage or prophage in the database, and only 23 of 105 predicted encoded proteins could be assigned a functional annotation. Because of the relationship of Rhodococcus to Mycobacterium, it was anticipated that these phages should exhibit some of the features characteristic of mycobacteriophages. Traits that were identified as shared by the Rhodococcus phages and mycobacteriophages include the prevalent long-tailed morphology and the presence of genes encoding LysB-like mycolate-hydrolyzing lysis proteins. Application of DocB7 lysates to soils amended with a host strain of R. equi reduced recoverable bacterial CFU, suggesting that phage may be useful in limiting R. equi load in the environment while foals are susceptible to infection.
Collapse
|
14
|
Siew N, Fischer D. Unravelling the ORFan Puzzle. Comp Funct Genomics 2010; 4:432-41. [PMID: 18629076 PMCID: PMC2447361 DOI: 10.1002/cfg.311] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2003] [Revised: 06/05/2003] [Accepted: 06/05/2003] [Indexed: 12/27/2022] Open
Abstract
ORFans are open reading frames (ORFs) with no detectable sequence similarity
to any other sequence in the databases. Each newly sequenced genome contains a
significant number of ORFans. Therefore, ORFans entail interesting evolutionary
puzzles. However, little can be learned about them using bioinformatics tools, and
their study seems to have been underemphasized. Here we present some of the
questions that the existence of so many ORFans have raised and review some of
the studies aimed at understanding ORFans, their functions and their origins. These
works have demonstrated that ORFans are an untapped source of research, requiring
further computational and experimental studies.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva 84105, Israel
| | | |
Collapse
|
15
|
Abstract
ORFan genes can constitute a large fraction of a bacterial genome, but due to their lack of homologs, their functions have remained largely unexplored. To determine if particular features of ORFan-encoded proteins promote their presence in a genome, we analyzed properties of ORFans that originated over a broad evolutionary timescale. We also compared ORFan genes to another class of acquired genes, heterogeneous occurrence in prokaryotes (HOPs), which have homologs in other bacteria. A total of 54 ORFan and HOP genes selected from different phylogenetic depths in the Escherichia coli lineage were cloned, expressed, purified, and subjected to circular dichroism (CD) spectroscopy. A majority of genes could be expressed, but only 18 yielded sufficient soluble protein for spectral analysis. Of these, half were significantly alpha-helical, three were predominantly beta-sheet, and six were of intermediate/indeterminate structure. Although a higher proportion of HOPs yielded soluble proteins with resolvable secondary structures, ORFans resembled HOPs with regard to most of the other features tested. Overall, we found that those ORFan and HOP genes that have persisted in the E. coli lineage were more likely to encode soluble and folded proteins, more likely to display environmental modulation of their gene expression, and by extrapolation, are more likely to be functional.
Collapse
Affiliation(s)
- Hema Prasad Narra
- Department of Biochemistry & Molecular Biophysics, University of Arizona, Tucson, AZ 85721, USA
| | | | | |
Collapse
|
16
|
Abstract
Both supervised and unsupervised neural networks have been applied to the prediction of protein structure and function. Here, we focus on feedforward neural networks and describe how these learning machines can be applied to protein prediction. We discuss how to select an appropriate data set, how to choose and encode protein features into the neural network input, and how to assess the predictor's performance.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | |
Collapse
|
17
|
Type II restriction endonuclease R.Hpy188I belongs to the GIY-YIG nuclease superfamily, but exhibits an unusual active site. BMC STRUCTURAL BIOLOGY 2008; 8:48. [PMID: 19014591 PMCID: PMC2630997 DOI: 10.1186/1472-6807-8-48] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2008] [Accepted: 11/14/2008] [Indexed: 11/10/2022]
Abstract
BACKGROUND Catalytic domains of Type II restriction endonucleases (REases) belong to a few unrelated three-dimensional folds. While the PD-(D/E)XK fold is most common among these enzymes, crystal structures have been also determined for single representatives of two other folds: PLD (R.BfiI) and half-pipe (R.PabI). Bioinformatics analyses supported by mutagenesis experiments suggested that some REases belong to the HNH fold (e.g. R.KpnI), and that a small group represented by R.Eco29kI belongs to the GIY-YIG fold. However, for a large fraction of REases with known sequences, the three-dimensional fold and the architecture of the active site remain unknown, mostly due to extreme sequence divergence that hampers detection of homology to enzymes with known folds. RESULTS R.Hpy188I is a Type II REase with unknown structure. PSI-BLAST searches of the non-redundant protein sequence database reveal only 1 homolog (R.HpyF17I, with nearly identical amino acid sequence and the same DNA sequence specificity). Standard application of state-of-the-art protein fold-recognition methods failed to predict the relationship of R.Hpy188I to proteins with known structure or to other protein families. In order to increase the amount of evolutionary information in the multiple sequence alignment, we have expanded our sequence database searches to include sequences from metagenomics projects. This search resulted in identification of 23 further members of R.Hpy188I family, both from metagenomics and the non-redundant database. Moreover, fold-recognition analysis of the extended R.Hpy188I family revealed its relationship to the GIY-YIG domain and allowed for computational modeling of the R.Hpy188I structure. Analysis of the R.Hpy188I model in the light of sequence conservation among its homologs revealed an unusual variant of the active site, in which the typical Tyr residue of the YIG half-motif had been substituted by a Lys residue. Moreover, some of its homologs have the otherwise invariant Arg residue in a non-homologous position in sequence that nonetheless allows for spatial conservation of the guanidino group potentially involved in phosphate binding. CONCLUSION The present study eliminates a significant "white spot" on the structural map of REases. It also provides important insight into sequence-structure-function relationships in the GIY-YIG nuclease superfamily. Our results reveal that in the case of proteins with no or few detectable homologs in the standard "non-redundant" database, it is useful to expand this database by adding the metagenomic sequences, which may provide evolutionary linkage to detect more remote homologs.
Collapse
|
18
|
Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688-719. [PMID: 18948295 PMCID: PMC2588523 DOI: 10.1093/nar/gkn668] [Citation(s) in RCA: 480] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
19
|
Davids W, Fuxelius HH, Andersson SGE. The Journey to smORFland. Comp Funct Genomics 2008; 4:537-41. [PMID: 18629011 PMCID: PMC2447293 DOI: 10.1002/cfg.325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2003] [Revised: 08/06/2003] [Accepted: 08/06/2003] [Indexed: 11/12/2022] Open
Abstract
The genome sequences completed so far contain more than 20 000 genes with unknown function and no similarity to genes in other genomes. The origin and evolution of the
orphan genes is an enigma. Here, we discuss the suggestion that some orphan genes
may represent pseudogenes or short fragments of genes that were functional in the
genome of a common ancestor. These may be the remains of unsuccessful duplication
or horizontal gene transfer events, in which the acquired sequences have entered the
fragmentation process and thereby lost their similarity to genes in other species. This
scenario is supported by a recent case study of orphan genes in several closely related
species of Rickettsia, where full-length ancestral genes were reconstructed from sets
of short, overlapping orphan genes. One of these was found to display similarity to
genes encoding proteins with ankyrin-repeat domains.
Collapse
Affiliation(s)
- Wagied Davids
- Department of Molecular Evolution, Evolutionary Biology Center, Uppsala University, Norbyvägen 18C, Uppsala 752 36, Sweden
| | | | | |
Collapse
|
20
|
Wasmuth J, Schmid R, Hedley A, Blaxter M. On the extent and origins of genic novelty in the phylum Nematoda. PLoS Negl Trop Dis 2008; 2:e258. [PMID: 18596977 PMCID: PMC2432500 DOI: 10.1371/journal.pntd.0000258] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2008] [Accepted: 06/09/2008] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The phylum Nematoda is biologically diverse, including parasites of plants and animals as well as free-living taxa. Underpinning this diversity will be commensurate diversity in expressed genes, including gene sets associated specifically with evolution of parasitism. METHODS AND FINDINGS Here we have analyzed the extensive expressed sequence tag data (available for 37 nematode species, most of which are parasites) and define over 120,000 distinct putative genes from which we have derived robust protein translations. Combined with the complete proteomes of Caenorhabditis elegans and Caenorhabditis briggsae, these proteins have been grouped into 65,000 protein families that in turn contain 40,000 distinct protein domains. We have mapped the occurrence of domains and families across the Nematoda and compared the nematode data to that available for other phyla. Gene loss is common, and in particular we identify nearly 5,000 genes that may have been lost from the lineage leading to the model nematode C. elegans. We find a preponderance of novelty, including 56,000 nematode-restricted protein families and 26,000 nematode-restricted domains. Mapping of the latest time-of-origin of these new families and domains across the nematode phylogeny revealed ongoing evolution of novelty. A number of genes from parasitic species had signatures of horizontal transfer from their host organisms, and parasitic species had a greater proportion of novel, secreted proteins than did free-living ones. CONCLUSIONS These classes of genes may underpin parasitic phenotypes, and thus may be targets for development of effective control measures.
Collapse
Affiliation(s)
- James Wasmuth
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Program for Molecular Structure and Function, Hospital for Sick Children, Toronto, Ontario, Canada
| | - Ralf Schmid
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Department of Biochemistry, University of Leicester, Leicester, United Kingdom
| | - Ann Hedley
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Mark Blaxter
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- * E-mail:
| |
Collapse
|
21
|
Fuxelius HH, Darby AC, Cho NH, Andersson SGE. Visualization of pseudogenes in intracellular bacteria reveals the different tracks to gene destruction. Genome Biol 2008; 9:R42. [PMID: 18302730 PMCID: PMC2374718 DOI: 10.1186/gb-2008-9-2-r42] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Revised: 02/13/2008] [Accepted: 02/26/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Pseudogenes reveal ancestral gene functions. Some obligate intracellular bacteria, such as Mycobacterium leprae and Rickettsia spp., carry substantial fractions of pseudogenes. Until recently, horizontal gene transfers were considered to be rare events in obligate host-associated bacteria. RESULTS We present a visualization tool that displays the relationships and positions of degraded and partially overlapping gene sequences in multiple genomes. With this tool we explore the origin and deterioration patterns of the Rickettsia pseudogenes and find that variably present genes and pseudogenes tend to have been acquired more recently, are more divergent in sequence, and exhibit a different functional profile compared with genes conserved across all species. Overall, the origin of only one-quarter of the variable genes and pseudogenes can be traced back to the common ancestor of Rickettsia and the outgroup genera Orientia and Wolbachia. These sequences contain only a few disruptive mutations and show a broad functional distribution profile, much like the core genes. The remaining genes and pseudogenes are extensively degraded or solely present in a single species. Their functional profile was heavily biased toward the mobile gene pool and genes for components of the cell wall and the lipopolysaccharide. CONCLUSION Reductive evolution of the vertically inherited genomic core accounts for 25% of the predicted genes in the variable segments of the Rickettsia genomes, whereas 75% stems from the flux of the mobile gene pool along with genes for cell surface structures. Thus, most of the variably present genes and pseudogenes in Rickettsia have arisen from recent acquisitions.
Collapse
Affiliation(s)
- Hans-Henrik Fuxelius
- Department of Molecular Evolution, Evolutionary Biology Center, Uppsala University, Norbyvägen 18C, S-752 36 Uppsala, Sweden.
| | | | | | | |
Collapse
|
22
|
Yin Y, Fischer D. Identification and investigation of ORFans in the viral world. BMC Genomics 2008; 9:24. [PMID: 18205946 PMCID: PMC2245933 DOI: 10.1186/1471-2164-9-24] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Accepted: 01/19/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide studies have already shed light into the evolution and enormous diversity of the viral world. Nevertheless, one of the unresolved mysteries in comparative genomics today is the abundance of ORFans - ORFs with no detectable sequence similarity to any other ORF in the databases. Recently, studies attempting to understand the origin and functions of bacterial ORFans have been reported. Here we present a first genome-wide identification and analysis of ORFans in the viral world, with focus on bacteriophages. RESULTS Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes. Like prokaryotic ORFans, viral ORFans are shorter and have a lower GC content than non-ORFans. Nevertheless, a statistically significant lower GC content is found only on a minority of viruses. By focusing on phages, we find that 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world. Phages with different host ranges have different percentages of ORFans, reflecting different sampling status and suggesting various diversities. Similarity searches of the phage ORFeome (ORFans and non-ORFans) against prokaryotic genomes shows that almost half of the phage ORFs have prokaryotic homologs, suggesting the major role that horizontal transfer plays in bacterial evolution. Surprisingly, the percentage of phage ORFans with prokaryotic homologs is only 18.7%. This suggests that phage ORFans play a lesser role in horizontal transfer to prokaryotes, but may be among the major players contributing to the vast phage diversity. CONCLUSION Although the current sampling of viral genomes is extremely low, ORFans and near-ORFans are likely to continue to grow in number as more genomes are sequenced. The abundance of phage ORFans may be partially due to the expected vast viral diversity, and may be instrumental in understanding viral evolution. The functions, origins and fates of the majority of viral ORFans remain a mystery. Further computational and experimental studies are likely to shed light on the mechanisms that have given rise to so many bacterial and viral ORFans.
Collapse
Affiliation(s)
- Yanbin Yin
- Computer Science and Engineering Dept, 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, USA.
| | | |
Collapse
|
23
|
Vishnepolsky B, Managadze G, Pirtskhalava M. Comparison of the efficiency of evolutionary change-based and side chain orientation-based fold recognition potentials. Proteins 2008; 71:1863-78. [PMID: 18175309 DOI: 10.1002/prot.21871] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The present article describes residue level knowledge based potential SORDIS. SORDIS incorporates the information on side-chain orientation in relation to hydrophobic core centres, distance of residue from the globule centre and secondary structure. SORDIS has been tested and compared with widespread evolutionary change-based substitution matrices (BLOSUM, PAM, GONNET, Johnson-Overington, BLAJ, HSDM, and STROMA) in fold recognition experiments within the zone of weak sequence similarity (<16%). The obtained results show that the lower is the amino acid similarity between homologous pairs the higher is the performance of SORDIS in comparison with the potentials, based on the information about the evolutionary changes. Therefore, we propose that the employment of SORDIS in fold recognition can be useful.
Collapse
Affiliation(s)
- Boris Vishnepolsky
- Institute of Molecular Biology and Biological Physics, Tbilisi 0160, Georgia
| | | | | |
Collapse
|
24
|
Affiliation(s)
- Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenchaftszentrum Weihenstephan, 85350 Freising, Germany
| |
Collapse
|
25
|
Jain M, Khurana P, Tyagi AK, Khurana JP. Genome-wide analysis of intronless genes in rice and Arabidopsis. Funct Integr Genomics 2007; 8:69-78. [PMID: 17578610 DOI: 10.1007/s10142-007-0052-9] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2007] [Revised: 04/07/2007] [Accepted: 05/06/2007] [Indexed: 10/23/2022]
Abstract
Intronless genes, a characteristic feature of prokaryotes, constitute a significant portion of the eukaryotic genomes. Our analysis revealed the presence of 11,109 (19.9%) and 5,846 (21.7%) intronless genes in rice and Arabidopsis genomes, respectively, belonging to different cellular role and gene ontology categories. The distribution and conservation of rice and Arabidopsis intronless genes among different taxonomic groups have been analyzed. A total of 301 and 296 intronless genes from rice and Arabidopsis, respectively, are conserved among organisms representing the three major domains of life, i.e., archaea, bacteria, and eukaryotes. These evolutionarily conserved proteins are predicted to be involved in housekeeping cellular functions. Interestingly, among the 68% of rice and 77% of Arabidopsis intronless genes present only in eukaryotic genomes, approximately 51% and 57% genes have orthologs only in plants, and thus may represent the plant-specific genes. Furthermore, 831 and 144 intronless genes of rice and Arabidopsis, respectively, referred to as ORFans, do not exhibit homology to any of the genes in the database and may perform species-specific functions. These data can serve as a resource for further comparative, evolutionary, and functional analysis of intronless genes in plants and other organisms.
Collapse
Affiliation(s)
- Mukesh Jain
- Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi 110 021, India
| | | | | | | |
Collapse
|
26
|
Saini HK, Fischer D. FRalanyzer: a tool for functional analysis of fold-recognition sequence-structure alignments. Nucleic Acids Res 2007; 35:W499-502. [PMID: 17537819 PMCID: PMC1933221 DOI: 10.1093/nar/gkm367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
We describe FRalanyzer (Fold Recognition alignment analyzer), a new web tool to visually inspect sequence–structure alignments in order to predict functionally important residues in a query sequence of unknown function. This tool is aimed at helping to infer functional relationships between a query sequence and a template structure, and is particularly useful in analyzing fold recognition (FR) results. Because similar folds do not necessarily share the same function, it is not always straightforward to infer a function from an FR result alone. Manual inspection of the FR sequence-structure alignment is often required in order to search for conservation of functionally important residues. FRalanyzer automates parts of this time-consuming process. FRalanyzer takes as input a sequence–structure alignment, automatically searches annotated databases, displays functionally significant residues and highlights the functionally important positions that are identical in the alignment. FRalanyzer can also be used with sequence-structure alignments obtained by other methods, and with structure–structure alignments obtained from structural comparison of newly determined 3D-structures of unknown function. Fralanyzer is available at http://fralanyzer.cse.buffalo.edu/.
Collapse
Affiliation(s)
- Harpreet Kaur Saini
- Computer Science and Engineering Department, 201 Bell Hall University at Buffalo, Buffalo, NY 14260, USA.
| | | |
Collapse
|
27
|
Lima-Mendez G, Toussaint A, Leplae R. Analysis of the phage sequence space: the benefit of structured information. Virology 2007; 365:241-9. [PMID: 17482656 DOI: 10.1016/j.virol.2007.03.047] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2007] [Revised: 03/07/2007] [Accepted: 03/28/2007] [Indexed: 11/26/2022]
Abstract
Phages are the most abundant biological entities on Earth and are central players in the evolution of their bacterial hosts and the emergence of new pathogens. In addition, they bear an enormous potential for the development of new drugs, therapies or nanotechnologies. As a result, interest in phages is reviving. In the genomic era, our perspective on the phage sequence space remains incredibly sparse. The modular and combinatorial structure of phage genomes is largely documented. It is confirmed by new sequence information and it fuels a recurrent debate on the need to revise phage taxonomy. The absence of structured, computer readable information on phages is a major bottleneck for an extensive global analysis of phage genomes and their relationships, but such information is essential to reassess phage classification. Based on the ACLAME database, which is dedicated to the organization and analysis of prokaryotic mobile genetic elements, we discuss here how structured information on phage-encoded proteins helps global in silico analysis and allows the prediction of prophages in bacterial genome sequences, providing access to additional phage sequence information.
Collapse
Affiliation(s)
- Gipsi Lima-Mendez
- Service de Conformation de Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, CP 263, Boulevard du Triomphe, 1050, Bruxelles, Belgium.
| | | | | |
Collapse
|
28
|
Wilson GA, Feil EJ, Lilley AK, Field D. Large-scale comparative genomic ranking of taxonomically restricted genes (TRGs) in bacterial and archaeal genomes. PLoS One 2007; 2:e324. [PMID: 17389915 PMCID: PMC1824705 DOI: 10.1371/journal.pone.0000324] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2007] [Accepted: 02/18/2007] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Lineage-specific, or taxonomically restricted genes (TRGs), especially those that are species and strain-specific, are of special interest because they are expected to play a role in defining exclusive ecological adaptations to particular niches. Despite this, they are relatively poorly studied and little understood, in large part because many are still orphans or only have homologues in very closely related isolates. This lack of homology confounds attempts to establish the likelihood that a hypothetical gene is expressed and, if so, to determine the putative function of the protein. METHODOLOGY/PRINCIPAL FINDINGS We have developed "QIPP" ("Quality Index for Predicted Proteins"), an index that scores the "quality" of a protein based on non-homology-based criteria. QIPP can be used to assign a value between zero and one to any protein based on comparing its features to other proteins in a given genome. We have used QIPP to rank the predicted proteins in the proteomes of Bacteria and Archaea. This ranking reveals that there is a large amount of variation in QIPP scores, and identifies many high-scoring orphans as potentially "authentic" (expressed) orphans. There are significant differences in the distributions of QIPP scores between orphan and non-orphan genes for many genomes and a trend for less well-conserved genes to have lower QIPP scores. CONCLUSIONS The implication of this work is that QIPP scores can be used to further annotate predicted proteins with information that is independent of homology. Such information can be used to prioritize candidates for further analysis. Data generated for this study can be found in the OrphanMine at http://www.genomics.ceh.ac.uk/orphan_mine.
Collapse
Affiliation(s)
- Gareth A Wilson
- Centre for Ecology and Hydrology (CEH) Oxford, Oxford, United Kindgom.
| | | | | | | |
Collapse
|
29
|
Leplae R, Lima-Mendez G, Toussaint A. A first global analysis of plasmid encoded proteins in the ACLAME database. FEMS Microbiol Rev 2006; 30:980-94. [PMID: 17064288 DOI: 10.1111/j.1574-6976.2006.00044.x] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Many plasmids are mobile genetic elements (MGEs) and, as other members of that group of DNA entities, their genomes display a mosaic and combinatorial structure, making their classification extremely difficult. As other MGEs, plasmids play a major role in horizontal transfer of genetic materials and genome reorganization. Yet, the full impact of such phenomenon on major properties of the host cell, such as pathogenicity, the ability to use new carbon sources or resistance to antibiotics, remains to be fully assessed. More and more complete plasmid genome sequences are available. However, in the absence of standards for storing plasmid sequence data and annotating genes and gene products on sequenced plasmid genomes, the resulting information remains rather limited. Using 503 sequenced plasmids organized in the ACLAME database, we discuss how, by structuring information on the genomes, their host and the proteins they code for, one can gain access to either global or more detailed analysis of the plasmid sequence information, as illustrated by a network representation of the relationships between plasmids.
Collapse
Affiliation(s)
- Raphaël Leplae
- SCMBB, Université Libre de Bruxelles, Bvd du Triomphe, Bruxelles, Belgium.
| | | | | |
Collapse
|
30
|
Yin Y, Fischer D. On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer. BMC Evol Biol 2006; 6:63. [PMID: 16914045 PMCID: PMC1559721 DOI: 10.1186/1471-2148-6-63] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2006] [Accepted: 08/16/2006] [Indexed: 11/10/2022] Open
Abstract
Background: The origin of microbial ORFans, ORFs having no detectable homology to other ORFs in the databases, is one of the unexplained puzzles of the post-genomic era. Several hypothesis on the origin of ORFans have been suggested in the last few years, most of which based on selected, relatively small, subsets of ORFans. One of the hypotheses for the origin of ORFans is that they have been acquired thru lateral transfer from viruses. Here we carry out a comprehensive, genome-wide study on the origins of ORFans to quantify the strength of current evidence supporting this hypothesis. Results: We performed similarity searches by querying all current ORFans against the public virus protein database. Surprisingly, we found that only 2.8% of all microbial ORFans have detectable homologs in viruses, while the percentage of non-ORFans with detectable homologs in viruses is 7.9%, a significantly higher figure. This suggests that the current evidence for the origin of ORFans from lateral transfer from viruses is at best weak. However, an analysis of individual genomes revealed a number of organisms with much higher percentages, many of them belonging to the Firmicutes and Gamma-proteobacteria. We provide evidence suggesting that the current virus database may be biased towards those viruses attacking Firmicutes and Gamma-proteobacteria. Conclusion: We conclude that as more viral genomes are sequenced, more microbial ORFans will find homologs in viruses, but this trend may vary much for individual genomes. Thus, lateral transfer from viruses alone is unlikely to explain the origin of the majority of ORFans in the majority of prokaryotes and consequently, other, not necessarily exclusive, mechanisms are likely to better explain the origin of the increasing number of ORFans.
Collapse
Affiliation(s)
- Yanbin Yin
- Computer Science and Engineering Dept. 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, US
| | - Daniel Fischer
- Computer Science and Engineering Dept. 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, US
- Bioinformatics/Dept. of Computer Science, Ben Gurion University, Beer-Sheva 84015, Israel
| |
Collapse
|
31
|
Abulencia CB, Wyborski DL, Garcia JA, Podar M, Chen W, Chang SH, Chang HW, Watson D, Brodie EL, Hazen TC, Keller M. Environmental whole-genome amplification to access microbial populations in contaminated sediments. Appl Environ Microbiol 2006; 72:3291-301. [PMID: 16672469 PMCID: PMC1472342 DOI: 10.1128/aem.72.5.3291-3301.2006] [Citation(s) in RCA: 140] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Low-biomass samples from nitrate and heavy metal contaminated soils yield DNA amounts that have limited use for direct, native analysis and screening. Multiple displacement amplification (MDA) using phi29 DNA polymerase was used to amplify whole genomes from environmental, contaminated, subsurface sediments. By first amplifying the genomic DNA (gDNA), biodiversity analysis and gDNA library construction of microbes found in contaminated soils were made possible. The MDA method was validated by analyzing amplified genome coverage from approximately five Escherichia coli cells, resulting in 99.2% genome coverage. The method was further validated by confirming overall representative species coverage and also an amplification bias when amplifying from a mix of eight known bacterial strains. We extracted DNA from samples with extremely low cell densities from a U.S. Department of Energy contaminated site. After amplification, small-subunit rRNA analysis revealed relatively even distribution of species across several major phyla. Clone libraries were constructed from the amplified gDNA, and a small subset of clones was used for shotgun sequencing. BLAST analysis of the library clone sequences showed that 64.9% of the sequences had significant similarities to known proteins, and "clusters of orthologous groups" (COG) analysis revealed that more than half of the sequences from each library contained sequence similarity to known proteins. The libraries can be readily screened for native genes or any target of interest. Whole-genome amplification of metagenomic DNA from very minute microbial sources, while introducing an amplification bias, will allow access to genomic information that was not previously accessible. The reported SSU rRNA sequences and library clone end sequences are listed with their respective GenBank accession numbers, DQ 404590 to DQ 404652, DQ 404654 to DQ 404938, and DX 385314 to DX 389173.
Collapse
|
32
|
Fischer D. Servers for protein structure prediction. Curr Opin Struct Biol 2006; 16:178-82. [PMID: 16546376 DOI: 10.1016/j.sbi.2006.03.004] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Revised: 02/14/2006] [Accepted: 03/07/2006] [Indexed: 11/18/2022]
Abstract
The 1990s cultivated a generation of protein structure human predictors. As a result of structural genomics and genome sequencing projects, and significant improvements in the performance of protein structure prediction methods, a generation of automated servers has evolved in the past few years. Servers for close and distant homology modeling are now routinely used by many biologists, and have already been applied to the experimental structure determination process itself, and to the interpretation and annotation of genome sequences. Because dozens of servers are currently available, it is hard for a biologist to know which server(s) to use; however, the state of the art of these methods is now assessed through the LiveBench and CAFASP experiments. Meta-servers--servers that use the results of other autonomous servers to produce a consensus prediction--have proven to be the best performers, and are already challenging all but a handful of expert human predictors. The difference in performance of the top ten autonomous (non-meta) servers is small and hard to assess using relatively small test sets. Recent experiments suggest that servers will soon free humans from most of the burden of protein structure prediction.
Collapse
Affiliation(s)
- Daniel Fischer
- Buffalo Center of Excellence in Bioinformatics, and Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260, USA.
| |
Collapse
|
33
|
Griffiths E, Ventresca MS, Gupta RS. BLAST screening of chlamydial genomes to identify signature proteins that are unique for the Chlamydiales, Chlamydiaceae, Chlamydophila and Chlamydia groups of species. BMC Genomics 2006; 7:14. [PMID: 16436211 PMCID: PMC1403754 DOI: 10.1186/1471-2164-7-14] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2005] [Accepted: 01/25/2006] [Indexed: 11/24/2022] Open
Abstract
Background Chlamydiae species are of much importance from a clinical viewpoint. Their diversity both in terms of their numbers as well as clinical involvement are presently believed to be significantly underestimated. The obligate intracellular nature of chlamydiae has also limited their genetic and biochemical studies. Thus, it is of importance to develop additional means for their identification and characterization. Results We have carried out analyses of available chlamydiae genomes to identify sets of unique proteins that are either specific for all Chlamydiales genomes, or different Chlamydiaceae family members, or members of the Chlamydia and Chlamydophila genera, or those unique to Protochlamydia amoebophila, but which are not found in any other bacteria. In total, 59 Chlamydiales-specific proteins, 79 Chlamydiaceae-specific proteins, 20 proteins each that are specific for both Chlamydia and Chlamydophila and 445 ORFs that are Protochlamydia-specific were identified. Additionally, 33 cases of possible gene loss or lateral gene transfer were also detected. Conclusion The identified chlamydiae-lineage specific proteins, many of which are highly conserved, provide novel biomarkers that should prove of much value in the diagnosis of these bacteria and in exploration of their prevalence and diversity. These conserved protein sequences (CPSs) also provide novel therapeutic targets for drugs that are specific for these bacteria. Lastly, functional studies on these chlamydiae or chlamydiae subgroup-specific proteins should lead to important insights into lineage-specific adaptations with regards to development, infectivity and pathogenicity.
Collapse
Affiliation(s)
- Emma Griffiths
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, L8N 3Z5, Canada
| | - Michael S Ventresca
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, L8N 3Z5, Canada
| | - Radhey S Gupta
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, L8N 3Z5, Canada
| |
Collapse
|
34
|
Saqi MAS, Wild DL. Expectations from structural genomics revisited: an analysis of structural genomics targets. ACTA ACUST UNITED AC 2005; 5:339-42. [PMID: 16196503 DOI: 10.2165/00129785-200505050-00006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
BACKGROUND Current structural genomics projects are being driven by two main goals; to produce a representative set of protein folds that could be used as templates for comparative modeling purposes, and to provide insight into the function of the currently unannotated protein sequences. Such projects may reveal that a newly determined protein structure shares structural similarity with a previously observed structure or that it is a novel fold. The manner in which structure can be used to suggest the function of a protein will depend on the number and diversity of homologous sequences and the extent to which these sequences are functionally characterized. METHOD AND RESULTS Using sequence searching methods, we analyzed structural genomics target sequences to ascertain if they were members of functionally characterized protein families, protein families of unknown function, or orphan sequences. This analysis provided an indication of what could be expected to emerge from structural genomics projects. Matches were found to approximately 25% of the current functionally unannotated protein families in the PFAM database (protein families database of alignments and hidden Markov models). The 16% of strict orphan sequences will be the most problematic if their structures reveal novel folds. However, out of the remaining target sequences that match families whose members are largely of unknown function, 28% are particularly interesting in that they are part of protein families with considerable sequence diversity. CONCLUSION The determination of a new structure of a member of these families is likely to offer considerable insight into possible functional roles of these proteins even if it is a new fold. Mapping the sequence conservation onto the structure may reveal functionally important residues for further study by experimental methods.
Collapse
Affiliation(s)
- Mansoor A S Saqi
- Queen Mary's School of Medicine and Dentistry, Institute of Cell and Molecular Sciences, Barts and The London, Queen Mary, University of London, London, England
| | | |
Collapse
|
35
|
Lubec G, Afjehi-Sadat L, Yang JW, John JPP. Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol 2005; 77:90-127. [PMID: 16271823 DOI: 10.1016/j.pneurobio.2005.10.001] [Citation(s) in RCA: 120] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Revised: 09/18/2005] [Accepted: 10/02/2005] [Indexed: 12/29/2022]
Abstract
A large part of mammalian proteomes is represented by hypothetical proteins (HP), i.e. proteins predicted from nucleic acid sequences only and protein sequences with unknown function. Databases are far from being complete and errors are expected. The legion of HP is awaiting experiments to show their existence at the protein level and subsequent bioinformatic handling in order to assign proteins a tentative function is mandatory. Two-dimensional gel-electrophoresis with subsequent mass spectrometrical identification of protein spots is an appropriate tool to search for HP in the high-throughput mode. Spots are identified by MS or by MS/MS measurements (MALDI-TOF, MALDI-TOF-TOF) and subsequent software as e.g. Mascot or ProFound. In many cases proteins can thus be unambiguously identified and characterised; if this is not the case, de novo sequencing or Q-TOF analysis is warranted. If the protein is not identified, the sequence is being sent to databases for BLAST searches to determine identities/similarities or homologies to known proteins. If no significant identity to known structures is observed, the protein sequence is examined for the presence of functional domains (databases PROSITE, PRINTS, InterPro, ProDom, Pfam and SMART), subjected to searches for motifs (ELM) and finally protein-protein interaction databases (InterWeaver, STRING) are consulted or predictions from conformations are performed. We here provide information about hypothetical proteins in terms of protein chemical analysis, independent of antibody availability and specificity and bioinformatic handling to contribute to the extension/completion of protein databases and include original work on HP in the brain to illustrate the processes of HP identification and functional assignment.
Collapse
Affiliation(s)
- Gert Lubec
- Department of Pediatrics, Division of Basic Sciences, Medical University of Vienna, Waehringer Guertel 18-20, A-1090, Vienna, Austria.
| | | | | | | |
Collapse
|
36
|
Renesto P, Azza S, Dolla A, Fourquet P, Vestris G, Gorvel JP, Raoult D. Proteome analysis of Rickettsia conorii by two-dimensional gel electrophoresis coupled with mass spectrometry. FEMS Microbiol Lett 2005; 245:231-8. [PMID: 15837377 DOI: 10.1016/j.femsle.2005.03.004] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2005] [Revised: 03/04/2005] [Accepted: 03/04/2005] [Indexed: 10/25/2022] Open
Abstract
The availability of genome sequence offers the opportunity to further expand our knowledge about proteins expressed by Rickettsia conorii, strictly intracellular bacterium responsible for Mediterranean spotted fever. Using two-dimensional polyacrylamide gel electrophoresis combined with MALDI-TOF mass spectrometry, we established the first reference map of R. conorii proteome. This approach also allowed identification of GroEL as the major antigen recognized by rabbit serum and sera of infected patients. Altogether, this work opens the way to characterize the proteome of R. conorii, to compare protein profiles of different isolates or of bacteria maintained under different experimental conditions and to identify immunogenic proteins as potential vaccine targets.
Collapse
Affiliation(s)
- Patricia Renesto
- Unité des Rickettsies, CNRS UMR 6020, IFR-48, Faculté de Médecine, 27 Boulevard Jean Moulin, 13385 Marseille, France.
| | | | | | | | | | | | | |
Collapse
|
37
|
Doolittle RF. Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol 2005; 15:248-53. [PMID: 15963888 DOI: 10.1016/j.sbi.2005.04.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2005] [Revised: 02/08/2005] [Accepted: 04/12/2005] [Indexed: 11/28/2022]
Abstract
A decade of access to whole-genome sequences has been increasingly revealing about the informational network relating all living organisms. Although at one point there was concern that extensive horizontal gene transfer might hopelessly muddle phylogenies, it has not proved a severe hindrance. The melding of sequence and structural information is being used to great advantage, and the prospect exists that some of the earliest aspects of life on Earth can be reconstructed, including the invention of biosynthetic and metabolic pathways. Still, some fundamental phylogenetic problems remain, including determining the root--if there is one--of the historical relationship between Archaea, Bacteria and Eukarya.
Collapse
Affiliation(s)
- Russell F Doolittle
- Department of Chemistry & Biochemistry, University of California San Diego, La Jolla, CA 92093-0314, USA.
| |
Collapse
|
38
|
Siew N, Saini HK, Fischer D. A putative novel alpha/beta hydrolase ORFan family in Bacillus. FEBS Lett 2005; 579:3175-82. [PMID: 15922334 DOI: 10.1016/j.febslet.2005.04.030] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2004] [Revised: 03/25/2005] [Accepted: 04/11/2005] [Indexed: 10/25/2022]
Abstract
A large number of sequences in each newly sequenced genome correspond to lineage and species-specific proteins, also known as ORFans. Amongst these ORFans, a large number are sequences with unknown structures and functions. We have identified a family of sequences, annotated as hypothetical proteins, which are specific to Bacillus and have carried out a computational study aimed at characterizing this family. Fold-recognition methods predict that these sequences belong to the alpha/beta hydrolase fold. We suggest possible catalytic triads for the ORFans and propose a hypothesis regarding the possible families within the alpha/beta hydrolase superfamily to which they may belong.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva 84105, Israel
| | | | | |
Collapse
|
39
|
Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures. J Mol Biol 2005; 348:1235-60. [PMID: 15854658 DOI: 10.1016/j.jmb.2005.03.037] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2004] [Revised: 02/28/2005] [Accepted: 03/15/2005] [Indexed: 11/27/2022]
Abstract
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.
Collapse
Affiliation(s)
- Annabel E Todd
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | |
Collapse
|
40
|
Baranov PV, Hammer AW, Zhou J, Gesteland RF, Atkins JF. Transcriptional slippage in bacteria: distribution in sequenced genomes and utilization in IS element gene expression. Genome Biol 2005; 6:R25. [PMID: 15774026 PMCID: PMC1088944 DOI: 10.1186/gb-2005-6-3-r25] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2004] [Revised: 12/16/2004] [Accepted: 01/25/2005] [Indexed: 11/13/2022] Open
Abstract
To find a length of slippage-prone sequences at which selection against transcriptional slippage is evident, the transcription of repetitive runs of A and T of different lengths in 108 bacterial genomes was analyzed. IS element genes were found to exploit transcriptional slippage for regulation of gene expression. Background Transcription slippage occurs on certain patterns of repeat mononucleotides, resulting in synthesis of a heterogeneous population of mRNAs. Individual mRNA molecules within this population differ in the number of nucleotides they contain that are not specified by the template. When transcriptional slippage occurs in a coding sequence, translation of the resulting mRNAs yields more than one protein product. Except where the products of the resulting mRNAs have distinct functions, transcription slippage occurring in a coding region is expected to be disadvantageous. This probably leads to selection against most slippage-prone sequences in coding regions. Results To find a length at which such selection is evident, we analyzed the distribution of repetitive runs of A and T of different lengths in 108 bacterial genomes. This length varies significantly among different bacteria, but in a large proportion of available genomes corresponds to nine nucleotides. Comparative sequence analysis of these genomes was used to identify occurrences of 9A and 9T transcriptional slippage-prone sequences used for gene expression. Conclusions IS element genes are the largest group found to exploit this phenomenon. A number of genes with disrupted open reading frames (ORFs) have slippage-prone sequences at which transcriptional slippage would result in uninterrupted ORF restoration at the mRNA level. The ability of such genes to encode functional full-length protein products brings into question their annotation as pseudogenes and in these cases is pertinent to the significance of the term 'authentic frameshift' frequently assigned to such genes.
Collapse
Affiliation(s)
- Pavel V Baranov
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112-5330, USA
- Bioscience Institute, University College Cork, Cork, Ireland
| | - Andrew W Hammer
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112-5330, USA
| | - Jiadong Zhou
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112-5330, USA
- Current address: Gene Technology Division, Nitto Denko Technical Corporation, 401 Jones Road, Oceanside, CA 92054, USA
| | - Raymond F Gesteland
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112-5330, USA
| | - John F Atkins
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112-5330, USA
- Bioscience Institute, University College Cork, Cork, Ireland
| |
Collapse
|
41
|
Siew N, Fischer D. Structural Biology Sheds Light on the Puzzle of Genomic ORFans. J Mol Biol 2004; 342:369-73. [PMID: 15327940 DOI: 10.1016/j.jmb.2004.06.073] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2004] [Revised: 06/09/2004] [Accepted: 06/19/2004] [Indexed: 10/26/2022]
Abstract
Genomic ORFans are orphan open reading frames (ORFs) with no significant sequence similarity to other ORFs. ORFans comprise 20-30% of the ORFs of most completely sequenced genomes. Because nothing can be learnt about ORFans via sequence homology, the functions and evolutionary origins of ORFans remain a mystery. Furthermore, because relatively few ORFans have been experimentally characterized, it has been suggested that most ORFans are not likely to correspond to functional, expressed proteins, but rather to spurious ORFs, pseudo-genes or to rapidly evolving proteins with non-essential roles. As a snapshot view of current ORFan structural studies, we searched for ORFans among proteins whose three-dimensional structures have been recently determined. We find that functional and structural studies of ORFans are not as underemphasized as previously suggested. These recently determined structures correspond to ORFans from all Kingdoms of life, and include proteins that have previously been functionally characterized, as well as structural genomics targets of unknown function labeled as "hypothetical proteins". This suggests that many of the ORFans in the databases are likely to correspond to expressed, functional (and even essential) proteins. Furthermore, the recently determined structures include examples of the various types of ORFans, suggesting that the functions and evolutionary origins of ORFans are diverse. Although this survey sheds some light on the ORFan mystery, further experimental studies are required to gain a better understanding of the role and origins of the tens of thousands of ORFans awaiting characterization.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University Beer-Sheva 84105, Israel
| | | |
Collapse
|
42
|
Abstract
Differences in gene repertoire among bacterial genomes are usually ascribed to gene loss or to lateral gene transfer from unrelated cellular organisms. However, most bacteria contain large numbers of ORFans, that is, annotated genes that are restricted to a particular genome and that possess no known homologs. The uniqueness of ORFans within a genome has precluded the use of a comparative approach to examine their function and evolution. However, by identifying sequences unique to monophyletic groups at increasing phylogenetic depths, we can make direct comparisons of the characteristics of ORFans of different ages in the Escherichia coli genome, and establish their functional status and evolutionary rates. Relative to the genes ancestral to gamma-Proteobacteria and to those genes distributed sporadically in other prokaryotic species, ORFans in the E. coli lineage are short, A+T rich, and evolve quickly. Moreover, most encode functional proteins. Based on these features, ORFans are not attributable to errors in gene annotation, limitations of current databases, or to failure of methods for detecting homology. Rather, ORFans in the genomes of free-living microorganisms apparently derive from bacteriophage and occasionally become established by assuming roles in key cellular functions.
Collapse
Affiliation(s)
- Vincent Daubin
- Department of Biochemistry & Molecular Biophysics, University of Arizona, Tucson, Arizona 85721, USA.
| | | |
Collapse
|
43
|
Reinhardt A, Eisenberg D. DPANN: Improved sequence to structure alignments following fold recognition. Proteins 2004; 56:528-38. [PMID: 15229885 DOI: 10.1002/prot.20144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In fold recognition (FR) a protein sequence of unknown structure is assigned to the closest known three-dimensional (3D) fold. Although FR programs can often identify among all possible folds the one a sequence adopts, they frequently fail to align the sequence to the equivalent residue positions in that fold. Such failures frustrate the next step in structure prediction, protein model building. Hence it is desirable to improve the quality of the alignments between the sequence and the identified structure. We have used artificial neural networks (ANN) to derive a substitution matrix to create alignments between a protein sequence and a protein structure through dynamic programming (DPANN: Dynamic Programming meets Artificial Neural Networks). The matrix is based on the amino acid type and the secondary structure state of each residue. In a database of protein pairs that have the same fold but lack sequences-similarity, DPANN aligns over 30% of all sequences to the paired structure, resembling closely the structural superposition of the pair. In over half of these cases the DPANN alignment is close to the structural superposition, although the initial alignment from the step of fold recognition is not close. Conversely, the alignment created during fold recognition outperforms DPANN in only 10% of all cases. Thus application of DPANN after fold recognition leads to substantial improvements in alignment accuracy, which in turn provides more useful templates for the modeling of protein structures. In the artificial case of using actual instead of predicted secondary structures for the probe protein, over 50% of the alignments are successful.
Collapse
|
44
|
Abstract
As each newly sequenced genome contains a significant number of protein-coding ORFs that are species-, family- or lineage-specific, many interesting questions arise about the evolution and role of these ORFs and of the genomes they are part of. We refer to these poorly conserved ORFs as singleton or paralogous ORFans if they are unique to one genome, or as orthologous ORFans if they appear only in a family of closely related organisms and have no homolog in other genomes. In order to study and classify ORFans we have constructed the ORFanage, an ORFan database. This database consists of the predicted ORFs in fully sequenced microbial genomes, and enables searching for the three types of ORFans in any subset of the genomes chosen by the user. The ORFanage could help in choosing interesting targets for further genomic and evolutionary studies. The ORFanage is accessible via http://www.bioinformatics.buffalo. edu/ORFanage.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva 84105, Israel.
| | | | | |
Collapse
|
45
|
Krause DO, Denman SE, Mackie RI, Morrison M, Rae AL, Attwood GT, McSweeney CS. Opportunities to improve fiber degradation in the rumen: microbiology, ecology, and genomics. FEMS Microbiol Rev 2003; 27:663-93. [PMID: 14638418 DOI: 10.1016/s0168-6445(03)00072-x] [Citation(s) in RCA: 287] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The degradation of plant cell walls by ruminants is of major economic importance in the developed as well as developing world. Rumen fermentation is unique in that efficient plant cell wall degradation relies on the cooperation between microorganisms that produce fibrolytic enzymes and the host animal that provides an anaerobic fermentation chamber. Increasing the efficiency with which the rumen microbiota degrades fiber has been the subject of extensive research for at least the last 100 years. Fiber digestion in the rumen is not optimal, as is supported by the fact that fiber recovered from feces is fermentable. This view is confirmed by the knowledge that mechanical and chemical pretreatments improve fiber degradation, as well as more recent research, which has demonstrated increased fiber digestion by rumen microorganisms when plant lignin composition is modified by genetic manipulation. Rumen microbiologists have sought to improve fiber digestion by genetic and ecological manipulation of rumen fermentation. This has been difficult and a number of constraints have limited progress, including: (a) a lack of reliable transformation systems for major fibrolytic rumen bacteria, (b) a poor understanding of ecological factors that govern persistence of fibrolytic bacteria and fungi in the rumen, (c) a poor understanding of which glycolyl hydrolases need to be manipulated, and (d) a lack of knowledge of the functional genomic framework within which fiber degradation operates. In this review the major fibrolytic organisms are briefly discussed. A more extensive discussion of the enzymes involved in fiber degradation is included. We also discuss the use of plant genetic manipulation, application of free-living lignolytic fungi and the use of exogenous enzymes. Lastly, we will discuss how newer technologies such as genomic and metagenomic approaches can be used to improve our knowledge of the functional genomic framework of plant cell wall degradation in the rumen.
Collapse
Affiliation(s)
- Denis O Krause
- CSIRO Australia, Queensland Bioscience Precinct, St. Lucia, Qld 4067, Australia.
| | | | | | | | | | | | | |
Collapse
|
46
|
Abstract
Singleton sequence ORFans are orphan ORFs (open reading frames) that have no detectable sequence similarity to any other sequence in the databases. ORFans are of particular interest not only as evolutionary puzzles but also because we can learn little about them using bioinformatics tools. Here, we present a first systematic analysis of singleton ORFans in the first 60 fully sequenced microbial genomes. We show that although ORFans have been underemphasized, the number of ORFans is steadily growing, currently accounting for 23,634 sequences. At the same time, the percentage of ORFans as a fraction of all sequences is slowly diminishing, and is currently about 14%. Short ORFans comprise about 61% of all ORFans. The abundance of short ORFans may be due to a yet unexplained artifact. The data also suggest that the number of longer ORFans may soon diminish as more genomes of closely related organisms become available. To better address the questions about the functions and origins of ORFans, we propose to focus further studies on the longer ORFans, with emphasis on three new types of ORFans: ORFan modules, paralogous ORFans, and orthologous ORFans. We conclude that the large number of ORFans reflects an intrinsic property of the genetic material not yet fully understood. Further computational and experimental studies aimed at understanding Nature's protein diversity should also include ORFans.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva, Israel
| | | |
Collapse
|
47
|
Charlebois RL, Clarke GDP, Beiko RG, St Jean A. Characterization of species-specific genes using a flexible, web-based querying system. FEMS Microbiol Lett 2003; 225:213-20. [PMID: 12951244 DOI: 10.1016/s0378-1097(03)00512-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
We describe a query-based web-accessible system (www.neurogadgets.com/bws.php) for facilitating comparative microbial genomics. A variety of query pages are available, each with numerous options, that allow a biologist to pose relevant questions of genomic data. We illustrate with a characterization of species-specific protein-coding genes (so-called "ORFans"), finding that they are on average smaller, faster evolving, and less G+C-rich, and that they encode proteins more basic in their predicted isoelectric point, compared with non-species-specific genes. Using a dual-threshold approach, we conclude that these are characteristics of true species-specific genes, rather than artifacts of mis-annotation.
Collapse
|
48
|
Watson JD, Todd AE, Bray J, Laskowski RA, Edwards A, Joachimiak A, Orengo CA, Thornton JM. Target selection and determination of function in structural genomics. IUBMB Life 2003; 55:249-55. [PMID: 12880206 PMCID: PMC3366504 DOI: 10.1080/1521654031000123385] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The first crucial step in any structural genomics project is the selection and prioritization of target proteins for structure determination. There may be a number of selection criteria to be satisfied, including that the proteins have novel folds, that they be representatives of large families for which no structure is known, and so on. The better the selection at this stage, the greater is the value of the structures obtained at the end of the experimental process. This value can be further enhanced once the protein structures have been solved if the functions of the given proteins can also be determined. Here we describe the methods used at either end of the experimental process: firstly, sensitive sequence comparison techniques for selecting a high-quality list of target proteins, and secondly the various computational methods that can be applied to the eventual 3D structures to determine the most likely biochemical function of the proteins in question.
Collapse
Affiliation(s)
- James D Watson
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | | | | | | | | | | | |
Collapse
|