1
|
Baxevanis AD. An overview of gene identification: approaches, strategies, and considerations. ACTA ACUST UNITED AC 2008; Chapter 4:Unit4.1. [PMID: 18428724 DOI: 10.1002/0471250953.bi0401s6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Modern biology is on the verge of officially ushering in a new era in science with the completion of the sequencing of the human genome in April 2003. While often erroneously called the "post-genome era", this will actually truly mark the beginning of the "genome era," a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. This unit offers an overview of many of the gene prediction methods that are currently available and offers a general assessment of how well the methods work for various problems.
Collapse
Affiliation(s)
- Andreas D Baxevanis
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
2
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005; 3:722-32. [PMID: 16138100 DOI: 10.1038/nrmicro1235] [Citation(s) in RCA: 1092] [Impact Index Per Article: 54.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Horizontal genomics is a new field in prokaryotic biology that is focused on the analysis of DNA sequences in prokaryotic chromosomes that seem to have originated from other prokaryotes or eukaryotes. However, it is equally important to understand the agents that effect DNA movement: plasmids, bacteriophages and transposons. Although these agents occur in all prokaryotes, comprehensive genomics of the prokaryotic mobile gene pool or 'mobilome' lags behind other genomics initiatives owing to challenges that are distinct from cellular chromosomal analysis. Recent work shows promise of improved mobile genetic element (MGE) genomics and consequent opportunities to take advantage - and avoid the dangers - of these 'natural genetic engineers'. This review describes MGEs, their properties that are important in horizontal gene transfer, and current opportunities to advance MGE genomics.
Collapse
Affiliation(s)
- Laura S Frost
- Department of Biological Sciences, Biological Sciences Centre, University of Alberta Edmonton, Alberta T6G 2E9, Canada
| | | | | | | |
Collapse
|
3
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 8029-- #] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
4
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 and 1880=1880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
5
|
Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 8029-- awyx] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
6
|
Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 1-- #] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
7
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 1-- gadu] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
8
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 1-- -] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
9
|
Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005. [DOI: 10.1038/nrmicro1235 order by 8029-- -] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
10
|
Zhang L, Pavlovic V, Cantor CR, Kasif S. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res 2003; 13:1190-202. [PMID: 12743024 PMCID: PMC403647 DOI: 10.1101/gr.703903] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2002] [Accepted: 02/03/2003] [Indexed: 11/24/2022]
Abstract
The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.
Collapse
Affiliation(s)
- Lingang Zhang
- Center for Advanced Biotechnology, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | |
Collapse
|
11
|
Zhang N, Osborn M, Gitsham P, Yen K, Miller JR, Oliver SG. Using yeast to place human genes in functional categories. Gene 2003; 303:121-9. [PMID: 12559573 DOI: 10.1016/s0378-1119(02)01142-3] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
The availability of the draft sequence of the human genome has created a pressing need to assign functions to each of the 35,000 or so genes that it defines. One useful approach for this purpose is to use model organisms for both bioinformatic and functional comparisons. We have developed a complementation system, based on the model eukaryote Saccharomyces cerevisiae, to clone human cDNAs that can functionally complement yeast essential genes. The system employs two regulatable promoters. One promoter, tetO (determining doxycycline-repressible expression), is used to control essential S. cerevisiae genes. The other, pMET3 (which is switched off in the presence of methionine), is employed to regulate the expression of mammalian cDNAs in yeast. We have demonstrated that this system is effective for both individual cDNA clones and for cDNA libraries, permitting the direct selection of functionally complementing clones. Three human cDNA libraries have been constructed and screened for clones that can complement specific essential yeast genes whose expression is switched off by the addition of doxycycline to the culture medium. The validity of each complementation was checked by showing that the yeast cells stop their growth in the presence of doxycycline and methionine, which represses the expression of the yeast and mammalian coding sequence, respectively. Using this system, we have screened 25 tetO replacement strains and succeeded in isolating human cDNAs complementing six essential yeast genes. In this way, we have uncovered a novel human ubiquitin-conjugating enzyme, have isolated a human cDNA clone that may function as a signal peptidase and have demonstrated that the functional segment of the human Psmd12 proteosome sub-unit contains a PINT domain.
Collapse
Affiliation(s)
- Nianshu Zhang
- School of Biological Sciences, University of Manchester, 2.205 Stopford Building, Oxford Road, UK
| | | | | | | | | | | |
Collapse
|
12
|
Abstract
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
Collapse
Affiliation(s)
- Tetsuo Shibuya
- Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan
| | | |
Collapse
|
13
|
Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigó R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res 2001; 11:1574-83. [PMID: 11544202 PMCID: PMC311140 DOI: 10.1101/gr.177401] [Citation(s) in RCA: 69] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2001] [Accepted: 06/05/2001] [Indexed: 11/24/2022]
Abstract
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of depends little on species-specific properties such as codon usage or the nucleotide distribution. may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.
Collapse
Affiliation(s)
- T Wiehe
- Max Planck Institute for Chemical Ecology, Jena, Germany.
| | | | | | | |
Collapse
|
14
|
Wang J, Zhang CT. Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. EUROPEAN JOURNAL OF BIOCHEMISTRY 2001; 268:4261-8. [PMID: 11488920 DOI: 10.1046/j.1432-1327.2001.02341.x] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be approximately 3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use.
Collapse
Affiliation(s)
- J Wang
- Department of Physics, Tianjin University, Tianjin 300072, China
| | | |
Collapse
|
15
|
Baxevanis AD. Gene identification: methods and considerations. CURRENT PROTOCOLS IN HUMAN GENETICS 2001; Chapter 6:Unit 6.6. [PMID: 18428301 DOI: 10.1002/0471142905.hg0606s29] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
This unit introduces readers to some of the more commonly used techniques for gene identification. The author discusses the general problem behind accurately predicting genes in both prefinished and finished sequence data, provides a handson description of programs available in the public domain, and suggests strategies for how to best tackle the prediction problem at various stages of data generation and assembly. This unit introduces readers to some of the more commonly used techniques for gene identification.
Collapse
Affiliation(s)
- A D Baxevanis
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
16
|
Abstract
The year 2000 stands as a landmark in modern biology: the first draft of the human genome sequence has been completed. For the pharmaceutical industry, this achievement provides tremendous opportunities because the genomic sequence exposes all human drug targets for therapeutic intervention. The challenge for the pharmaceutical companies is to exploit this definitive resource for the identification of potential molecular targets, rapid characterization of their function and validation of their involvement in disease pathology. Bioinformatics approaches provide increasingly crucial tools to systematically support this exploratory target drug discovery activity.
Collapse
Affiliation(s)
- P Sanseau
- Target Bioinformatics, Glaxo SmithKline, Gunnels Wood Road, SG1 2NY, Stevenage, UK
| |
Collapse
|
17
|
Casrouge A, Beaudoing E, Dalle S, Pannetier C, Kanellopoulos J, Kourilsky P. Size estimate of the alpha beta TCR repertoire of naive mouse splenocytes. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2000; 164:5782-7. [PMID: 10820256 DOI: 10.4049/jimmunol.164.11.5782] [Citation(s) in RCA: 229] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The diversity of the T cell repertoire of mature T splenocytes is generated, in the thymus, by pairing of alpha and beta variable domains of the alpha beta TCR and by the rearrangements of various gene segments encoding these domains. In the periphery, it results from competition between various T cell subpopulations including recent thymic migrants and long-lived T cells. Quantitative data on the actual size of the T cell repertoire are lacking. Using PCR methods and extensive sequencing, we have measured for the first time the size of the TCR-alpha beta repertoire of naive mouse T splenocytes. There are 5-8 x 105 different nucleotide sequences of BV chains in the whole spleen of young adult mice. We have also determined the size of the BV repertoire in a subpopulation of AV2+ T splenocytes, which allows us to provide a minimum estimate of the alpha beta repertoire. We find that the mouse spleen harbors about 2 x 106 clones of about 10 cells each. This figure, although orders of magnitude smaller than the maximum theoretical diversity (estimated up to 1015), is still large enough to maintain a high functional diversity.
Collapse
MESH Headings
- Animals
- Cell Division/genetics
- Cell Division/immunology
- Cloning, Molecular
- Gene Rearrangement, beta-Chain T-Cell Antigen Receptor
- Interphase/genetics
- Interphase/immunology
- Mice
- Mice, Inbred C57BL
- Mice, Inbred DBA
- Polymerase Chain Reaction
- Receptors, Antigen, T-Cell, alpha-beta/chemistry
- Receptors, Antigen, T-Cell, alpha-beta/genetics
- Receptors, Antigen, T-Cell, alpha-beta/isolation & purification
- Sequence Analysis, DNA
- Species Specificity
- Spleen/cytology
- Spleen/immunology
- Spleen/metabolism
- T-Lymphocyte Subsets/cytology
- T-Lymphocyte Subsets/immunology
- T-Lymphocyte Subsets/metabolism
Collapse
Affiliation(s)
- A Casrouge
- Unité de Biologie Moléculaire du Gène, Institut National de la Santé et de la Recherche Médicale, Unité 277, Institut Pasteur, Paris, France
| | | | | | | | | | | |
Collapse
|