1
|
Posada-Reyes AB, Balderas-Martínez YI, Ávila-Ríos S, Vinuesa P, Fonseca-Coronado S. An Epistatic Network Describes oppA and glgB as Relevant Genes for Mycobacterium tuberculosis. Front Mol Biosci 2022; 9:856212. [PMID: 35712352 PMCID: PMC9194097 DOI: 10.3389/fmolb.2022.856212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Accepted: 03/11/2022] [Indexed: 11/18/2022] Open
Abstract
Mycobacterium tuberculosis is an acid-fast bacterium that causes tuberculosis worldwide. The role of epistatic interactions among different loci of the M. tuberculosis genome under selective pressure may be crucial for understanding the disease and the molecular basis of antibiotic resistance acquisition. Here, we analyzed polymorphic loci interactions by applying a model-free method for epistasis detection, SpydrPick, on a pan–genome-wide alignment created from a set of 254 complete reference genomes. By means of the analysis of an epistatic network created with the detected epistatic interactions, we found that glgB (α-1,4-glucan branching enzyme) and oppA (oligopeptide-binding protein) are putative targets of co-selection in M. tuberculosis as they were associated in the network with M. tuberculosis genes related to virulence, pathogenesis, transport system modulators of the immune response, and antibiotic resistance. In addition, our work unveiled potential pharmacological applications for genotypic antibiotic resistance inherent to the mutations of glgB and oppA as they epistatically interact with fprA and embC, two genes recently included as antibiotic-resistant genes in the catalog of the World Health Organization. Our findings showed that this approach allows the identification of relevant epistatic interactions that may lead to a better understanding of M. tuberculosis by deciphering the complex interactions of molecules involved in its metabolism, virulence, and pathogenesis and that may be applied to different bacterial populations.
Collapse
Affiliation(s)
- Ali-Berenice Posada-Reyes
- Posgrado en Ciencias Biológicas, UNAM, Mexico, Mexico
- Facultad de Estudios Superiores Cuautitlán, UNAM, Estado de Mexico, Mexico
- *Correspondence: Ali-Berenice Posada-Reyes, ; Salvador Fonseca-Coronado,
| | | | - Santiago Ávila-Ríos
- Instituto Nacional de Enfermedades Respiratorias “Ismael Cosio Villegas”, Ciudad de Mexico, Mexico
| | - Pablo Vinuesa
- Centro de Ciencias Genómicas, UNAM, Cuernavaca, Mexico
| | - Salvador Fonseca-Coronado
- Facultad de Estudios Superiores Cuautitlán, UNAM, Estado de Mexico, Mexico
- *Correspondence: Ali-Berenice Posada-Reyes, ; Salvador Fonseca-Coronado,
| |
Collapse
|
2
|
Borstein SR, O’Meara BC. AnnotationBustR: an R package to extract subsequences from GenBank annotations. PeerJ 2018; 6:e5179. [PMID: 30002984 PMCID: PMC6034590 DOI: 10.7717/peerj.5179] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Accepted: 06/18/2018] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND DNA sequences are pivotal for a wide array of research in biology. Large sequence databases, like GenBank, provide an amazing resource to utilize DNA sequences for large scale analyses. However, many sequence records on GenBank contain more than one gene or are portions of genomes. Inconsistencies in the way genes are annotated and the numerous synonyms a single gene may be listed under provide major challenges for extracting large numbers of subsequences for comparative analysis across taxa. At present, there is no easy way to extract portions from many GenBank accessions based on annotations where gene names may vary extensively. RESULTS The R package AnnotationBustR allows users to extract sequences based on GenBank annotations through the ACNUC retrieval system given search terms of gene synonyms and accession numbers. AnnotationBustR extracts subsequences of interest and then writes them to a FASTA file for users to employ in their research endeavors. CONCLUSION FASTA files of extracted subsequences and accession tables generated by AnnotationBustR allow users to quickly find and extract subsequences from GenBank accessions. These sequences can then be incorporated in various analyses, like the construction of phylogenies to test a wide range of ecological and evolutionary hypotheses.
Collapse
Affiliation(s)
- Samuel R. Borstein
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| | - Brian C. O’Meara
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| |
Collapse
|
3
|
Francois CM, Duret L, Simon L, Mermillod-Blondin F, Malard F, Konecny-Dupré L, Planel R, Penel S, Douady CJ, Lefébure T. No Evidence That Nitrogen Limitation Influences the Elemental Composition of Isopod Transcriptomes and Proteomes. Mol Biol Evol 2016; 33:2605-20. [PMID: 27401232 DOI: 10.1093/molbev/msw131] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
The field of stoichiogenomics aims at understanding the influence of nutrient limitations on the elemental composition of the genome, transcriptome, and proteome. The 20 amino acids and the 4 nt differ in the number of nutrients they contain, such as nitrogen (N). Thus, N limitation shall theoretically select for changes in the composition of proteins or RNAs through preferential use of N-poor amino acids or nucleotides, which will decrease the N-budget of an organism. While these N-saving mechanisms have been evidenced in microorganisms, they remain controversial in multicellular eukaryotes. In this study, we used 13 surface and subterranean isopod species pairs that face strongly contrasted N limitations, either in terms of quantity or quality. We combined in situ nutrient quantification and transcriptome sequencing to test if N limitation selected for N-savings through changes in the expression and composition of the transcriptome and proteome. No evidence of N-savings was found in the total N-budget of transcriptomes or proteomes or in the average protein N-cost. Nevertheless, subterranean species evolving in N-depleted habitats displayed lower N-usage at their third codon positions. To test if this convergent compositional change was driven by natural selection, we developed a method to detect the strand-asymmetric signature that stoichiogenomic selection should leave in the substitution pattern. No such signature was evidenced, indicating that the observed stoichiogenomic-like patterns were attributable to nonadaptive processes. The absence of stoichiogenomic signal despite strong N limitation within a powerful phylogenetic framework casts doubt on the existence of stoichiogenomic mechanisms in metazoans.
Collapse
Affiliation(s)
- Clémentine M Francois
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| | - Laurent Duret
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR5558, Villeurbanne, France
| | - Laurent Simon
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| | - Florian Mermillod-Blondin
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| | - Florian Malard
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| | - Lara Konecny-Dupré
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| | - Rémi Planel
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR5558, Villeurbanne, France
| | - Simon Penel
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR5558, Villeurbanne, France
| | - Christophe J Douady
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France Institut Universitaire de France, Paris, France
| | - Tristan Lefébure
- Univ Lyon, Université Claude Bernard Lyon 1, CNRS, ENTPE, Laboratoire d'Ecologie des Hydrosystèmes Naturels et Anthropisés UMR5023, Villeurbanne, France
| |
Collapse
|
4
|
Flandrois JP, Perrière G, Gouy M. leBIBIQBPP: a set of databases and a webtool for automatic phylogenetic analysis of prokaryotic sequences. BMC Bioinformatics 2015; 16:251. [PMID: 26264559 PMCID: PMC4531848 DOI: 10.1186/s12859-015-0692-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2015] [Accepted: 07/31/2015] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Estimating the phylogenetic position of bacterial and archaeal organisms by genetic sequence comparisons is considered as the gold-standard in taxonomy. This is also a way to identify the species of origin of the sequence. The quality of the reference database used in such analyses is crucial: the database must reflect the up-to-date bacterial nomenclature and accurately indicate the species of origin of its sequences. DESCRIPTION leBIBI(QBPP) is a web tool taking as input a series of nucleotide sequences belonging to one of a set of reference markers (e.g., SSU rRNA, rpoB, groEL2) and automatically retrieving closely related sequences, aligning them, and performing phylogenetic reconstruction using an approximate maximum likelihood approach. The system returns a set of quality parameters and, if possible, a suggested taxonomic assigment for the input sequences. The reference databases are extracted from GenBank and present four degrees of stringency, from the "superstringent" degree (one type strain per species) to the loosely parsed degree ("lax" database). A set of one hundred to more than a thousand sequences may be analyzed at a time. The speed of the process has been optimized through careful hardware selection and database design. CONCLUSION leBIBI(QBPP) is a powerful tool helping biologists to position bacterial or archaeal sequence commonly used markers in a phylogeny. It is a diagnostic tool for clinical, industrial and environmental microbiology laboratory, as well as an exploratory tool for more specialized laboratories. Its main advantages, relatively to comparable systems are: i) the use of a broad set of databases covering diverse markers with various degrees of stringency; ii) the use of an approximate Maximum Likelihood approach for phylogenetic reconstruction; iii) a speed compatible with on-line usage; and iv) providing fully documented results to help the user in decision making.
Collapse
Affiliation(s)
- Jean-Pierre Flandrois
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard - Lyon 1, 43 bd. du 11 Novembre 1918, Villeurbanne, 69622, France.
| | - Guy Perrière
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard - Lyon 1, 43 bd. du 11 Novembre 1918, Villeurbanne, 69622, France.
| | - Manolo Gouy
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard - Lyon 1, 43 bd. du 11 Novembre 1918, Villeurbanne, 69622, France.
| |
Collapse
|
5
|
Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V. GC-Content evolution in bacterial genomes: the biased gene conversion hypothesis expands. PLoS Genet 2015; 11:e1004941. [PMID: 25659072 PMCID: PMC4450053 DOI: 10.1371/journal.pgen.1004941] [Citation(s) in RCA: 149] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2014] [Accepted: 12/08/2014] [Indexed: 11/29/2022] Open
Abstract
The characterization of functional elements in genomes relies on the identification of the footprints of natural selection. In this quest, taking into account neutral evolutionary processes such as mutation and genetic drift is crucial because these forces can generate patterns that may obscure or mimic signatures of selection. In mammals, and probably in many eukaryotes, another such confounding factor called GC-Biased Gene Conversion (gBGC) has been documented. This mechanism generates patterns identical to what is expected under selection for higher GC-content, specifically in highly recombining genomic regions. Recent results have suggested that a mysterious selective force favouring higher GC-content exists in Bacteria but the possibility that it could be gBGC has been excluded. Here, we show that gBGC is probably at work in most if not all bacterial species. First we find a consistent positive relationship between the GC-content of a gene and evidence of intra-genic recombination throughout a broad spectrum of bacterial clades. Second, we show that the evolutionary force responsible for this pattern is acting independently from selection on codon usage, and could potentially interfere with selection in favor of optimal AU-ending codons. A comparison with data from human populations shows that the intensity of gBGC in Bacteria is comparable to what has been reported in mammals. We propose that gBGC is not restricted to sexual Eukaryotes but also widespread among Bacteria and could therefore be an ancestral feature of cellular organisms. We argue that if gBGC occurs in bacteria, it can account for previously unexplained observations, such as the apparent non-equilibrium of base substitution patterns and the heterogeneity of gene composition within bacterial genomes. Because gBGC produces patterns similar to positive selection, it is essential to take this process into account when studying the evolutionary forces at work in bacterial genomes. Classical population genetics models indicate that the efficiency of selection, and hence adaptation, depends on a number of non-selective factors, such as the size of a population or the intensity of recombination. In the last 10 years, evidence has accumulated that another mechanism called GC-Biased Gene Conversion (gBGC) can interfere with selection and even mimic its effects. This phenomenon, which arises from a particularity of the recombination machinery, was first thought to be restricted to sexual eukaryotic organisms. Here, we show that this mechanism probably exists in Bacteria and has a strong impact on their genome evolution. This discovery not only explains many previously unconnected features of bacterial genome evolution, but also highlights the importance of non-adaptive evolutionary processes in Bacteria.
Collapse
Affiliation(s)
- Florent Lassalle
- Université de Lyon, Lyon, France
- Université Lyon 1, Villeurbanne, France
- CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France
- CNRS, UMR 5557, Ecologie Microbienne, Villeurbanne, France
- INRA, USC 1364, Ecologie Microbienne, Villeurbanne, France
- Ecole Normale Supérieure de Lyon, Lyon, France
| | - Séverine Périan
- Université de Lyon, Lyon, France
- Université Lyon 1, Villeurbanne, France
- CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France
| | - Thomas Bataillon
- Aarhus University, Bioinformatics Research Center, Århus Denmark1 Université de Lyon, Lyon, France
| | - Xavier Nesme
- Université de Lyon, Lyon, France
- Université Lyon 1, Villeurbanne, France
- CNRS, UMR 5557, Ecologie Microbienne, Villeurbanne, France
- INRA, USC 1364, Ecologie Microbienne, Villeurbanne, France
| | - Laurent Duret
- Université de Lyon, Lyon, France
- Université Lyon 1, Villeurbanne, France
- CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France
| | - Vincent Daubin
- Université de Lyon, Lyon, France
- Université Lyon 1, Villeurbanne, France
- CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France
- * E-mail:
| |
Collapse
|
6
|
Palmeira L, Penel S, Lotteau V, Rabourdin-Combe C, Gautier C. PhEVER: a database for the global exploration of virus-host evolutionary relationships. Nucleic Acids Res 2010; 39:D569-75. [PMID: 21081560 PMCID: PMC3013642 DOI: 10.1093/nar/gkq1013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Fast viral adaptation and the implication of this rapid evolution in the emergence of several new infectious diseases have turned this issue into a major challenge for various research domains. Indeed, viruses are involved in the development of a wide range of pathologies and understanding how viruses and host cells interact in the context of adaptation remains an open question. In order to provide insights into the complex interactions between viruses and their host organisms and namely in the acquisition of novel functions through exchanges of genetic material, we developed the PhEVER database. This database aims at providing accurate evolutionary and phylogenetic information to analyse the nature of virus-virus and virus-host lateral gene transfers. PhEVER (http://pbil.univ-lyon1.fr/databases/phever) is a unique database of homologous families both (i) between sequences from different viruses and (ii) between viral sequences and sequences from cellular organisms. PhEVER integrates extensive data from up-to-date completely sequenced genomes (2426 non-redundant viral genomes, 1007 non-redundant prokaryotic genomes, 43 eukaryotic genomes ranging from plants to vertebrates) and offers a clustering of proteins into homologous families containing at least one viral sequences, as well as alignments and phylogenies for each of these families. Public access to PhEVER is available through its webpage and through all dedicated ACNUC retrieval systems.
Collapse
Affiliation(s)
- Leonor Palmeira
- CNRS, UMR5558, Laboratoire de Biométrie et Biologie Évolutive, PRABI, Pôle Rhône-Alpes de Bioinformatique, F-69622, Villeurbanne, France.
| | | | | | | | | |
Collapse
|
7
|
Gomes KA, Almeida TC, Gesteira AS, Lôbo IP, Guimarães ACR, de Miranda AB, Van Sluys MA, da Cruz RS, Cascardo JC, Carels N. ESTs from Seeds to Assist the Selective Breeding of Jatropha curcas L. for Oil and Active Compounds. GENOMICS INSIGHTS 2010. [PMID: 26217103 PMCID: PMC4510598 DOI: 10.4137/gei.s4340] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We report here on the characterization of a cDNA library from seeds of Jatropha curcas L. at three stages of fruit maturation before yellowing. We sequenced a total of 2200 clones and obtained a set of 931 non-redundant sequences (unigenes) after trimming and quality control, ie, 140 contigs and 791 singlets with PHRED quality ≥10. We found low levels of sequence redundancy and extensive metabolic coverage by homology comparison to GO. After comparison of 5841 non-redundant ESTs from a total of 13193 reads from GenBank with KEGG, we identified tags with nucleotide variations among J. curcas accessions for genes of fatty acid, terpene, alkaloid, quinone and hormone pathways of biosynthesis. More specifically, the expression level of four genes (palmitoyl-acyl carrier protein thioesterase, 3-ketoacyl-CoA thiolase B, lysophosphatidic acid acyltransferase and geranyl pyrophosphate synthase) measured by real-time PCR proved to be significantly different between leaves and fruits. Since the nucleotide polymorphism of these tags is associated to higher level of gene expression in fruits compared to leaves, we propose this approach to speed up the search for quantitative traits in selective breeding of J. curcas. We also discuss its potential utility for the selective breeding of economically important traits in J. curcas.
Collapse
Affiliation(s)
- Kleber A Gomes
- Universidade Estadual de Santa Cruz (UESC), Centro de Biotecnologia e Genética. Laboratório de Genômica e Proteômica, Ilhéus, Bahia, Brazil. ; Universidade de São Paulo, Instituto de Biociências, Departamento de Botânica, São Paulo, SP, Brazil
| | - Tiago C Almeida
- Universidade Estadual de Santa Cruz (UESC), Centro de Biotecnologia e Genética. Laboratório de Genômica e Proteômica, Ilhéus, Bahia, Brazil
| | - Abelmon S Gesteira
- Universidade Estadual de Santa Cruz (UESC), Centro de Biotecnologia e Genética. Laboratório de Genômica e Proteômica, Ilhéus, Bahia, Brazil. ; Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA) Mandioca e Fruticultura Tropical, Cruz das Almas, Bahia, Brazil
| | - Ivon P Lôbo
- Universidade Estadual de Santa Cruz (UESC), Grupo Bioenergia e Meio Ambiente, Ilhéus, Bahia, Brazil
| | - Ana Carolina R Guimarães
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | - Antonio B de Miranda
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | - Marie-Anne Van Sluys
- Universidade de São Paulo, Instituto de Biociências, Departamento de Botânica, São Paulo, SP, Brazil
| | - Rosenira S da Cruz
- Universidade Estadual de Santa Cruz (UESC), Grupo Bioenergia e Meio Ambiente, Ilhéus, Bahia, Brazil
| | - Júlio Cm Cascardo
- Universidade Estadual de Santa Cruz (UESC), Centro de Biotecnologia e Genética. Laboratório de Genômica e Proteômica, Ilhéus, Bahia, Brazil
| | - Nicolas Carels
- Universidade Estadual de Santa Cruz (UESC), Centro de Biotecnologia e Genética. Laboratório de Genômica e Proteômica, Ilhéus, Bahia, Brazil. ; Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| |
Collapse
|
8
|
Emery LR, Sharp PM. Impact of translational selection on codon usage bias in the archaeon Methanococcus maripaludis. Biol Lett 2010; 7:131-5. [PMID: 20810428 DOI: 10.1098/rsbl.2010.0620] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Patterns of codon usage have been extensively studied among Bacteria and Eukaryotes, but there has been little investigation of species from the third domain of life, the Archaea. Here, we examine the nature of codon usage bias in a methanogenic archaeon, Methanococcus maripaludis. Genome-wide patterns of codon usage are dominated by a strong A + T bias, presumably largely reflecting mutation patterns. Nevertheless, there is variation among genes in the use of a subset of putatively translationally optimal codons, which is strongly correlated with gene expression level. In comparison with Bacteria such as Escherichia coli, the strength of selected codon usage bias in highly expressed genes in M. maripaludis seems surprisingly high given its moderate growth rate. However, the pattern of selected codon usage differs between M. maripaludis and E. coli: in the archaeon, strongly selected codon usage bias is largely restricted to twofold degenerate amino acids (AAs). Weaker bias among the codons for fourfold degenerate AAs is consistent with the small number of tRNA genes in the M. maripaludis genome.
Collapse
Affiliation(s)
- Laura R Emery
- Institute of Evolutionary Biology, University of Edinburgh, King's Buildings, Edinburgh EH9 3JT, UK
| | | |
Collapse
|
9
|
Abstract
In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Collapse
Affiliation(s)
- Nicolas Carels
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | | |
Collapse
|
10
|
Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perrière G. Databases of homologous gene families for comparative genomics. BMC Bioinformatics 2009; 10 Suppl 6:S3. [PMID: 19534752 PMCID: PMC2697650 DOI: 10.1186/1471-2105-10-s6-s3] [Citation(s) in RCA: 104] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Background Comparative genomics is a central step in many sequence analysis studies, from gene annotation and the identification of new functional regions in genomes, to the study of evolutionary processes at the molecular level (speciation, single gene or whole genome duplications, etc.) and phylogenetics. In that context, databases providing users high quality homologous families and sequence alignments as well as phylogenetic trees based on state of the art algorithms are becoming indispensable. Methods We developed an automated procedure allowing massive all-against-all similarity searches, gene clustering, multiple alignments computation, and phylogenetic trees construction and reconciliation. The application of this procedure to a very large set of sequences is possible through parallel computing on a large computer cluster. Results Three databases were developed using this procedure: HOVERGEN, HOGENOM and HOMOLENS. These databases share the same architecture but differ in their content. HOVERGEN contains sequences from vertebrates, HOGENOM is mainly devoted to completely sequenced microbial organisms, and HOMOLENS is devoted to metazoan genomes from Ensembl. Access to the databases is provided through Web query forms, a general retrieval system and a client-server graphical interface. The later can be used to perform tree-pattern based searches allowing, among other uses, to retrieve sets of orthologous genes. The three databases, as well as the software required to build and query them, can be used or downloaded from the PBIL (Pôle Bioinformatique Lyonnais) site at .
Collapse
Affiliation(s)
- Simon Penel
- Laboratoire de Biométrie et Biologie Evolutive, CNRS, Université Claude Bernard - Lyon 1, 43 bd, du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Gout JF, Duret L, Kahn D. Differential retention of metabolic genes following whole-genome duplication. Mol Biol Evol 2009; 26:1067-72. [PMID: 19218282 DOI: 10.1093/molbev/msp026] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Classical studies in Metabolic Control Theory have shown that metabolic fluxes usually exhibit little sensitivity to changes in individual enzyme activity, yet remain sensitive to global changes of all enzymes in a pathway. Therefore, little selective pressure is expected on the dosage or expression of individual metabolic genes, yet entire pathways should still be constrained. However, a direct estimate of this selective pressure had not been evaluated. Whole-genome duplications (WGDs) offer a good opportunity to address this question by analyzing the fates of metabolic genes during the massive gene losses that follow. Here, we take advantage of the successive rounds of WGD that occurred in the Paramecium lineage. We show that metabolic genes exhibit different gene retention patterns than nonmetabolic genes. Contrary to what was expected for individual genes, metabolic genes appeared more retained than other genes after the recent WGD, which was best explained by selection for gene expression operating on entire pathways. Metabolic genes also tend to be less retained when present at high copy number before WGD, contrary to other genes that show a positive correlation between gene retention and preduplication copy number. This is rationalized on the basis of the classical concave relationship relating metabolic fluxes with enzyme expression.
Collapse
Affiliation(s)
- Jean-François Gout
- Université de Lyon, Université Lyon 1, CNRS, INRIA, UMR5558, Laboratoire de Biométrie et Biologie Evolutive,Villeurbanne, France
| | | | | |
Collapse
|
12
|
Picoeukaryotic sequences in the Sargasso sea metagenome. Genome Biol 2008; 9:R5. [PMID: 18179699 PMCID: PMC2395239 DOI: 10.1186/gb-2008-9-1-r5] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2007] [Revised: 12/06/2007] [Accepted: 01/07/2008] [Indexed: 11/21/2022] Open
Abstract
Many sequences from picoeukaryotes were found in DNA sequence data assembled from Sargasso seawater. Background With genome sequencing becoming more and more affordable, environmental shotgun sequencing of the microorganisms present in an environment generates a challenging amount of sequence data for the scientific community. These sequence data enable the diversity of the microbial world and the metabolic pathways within an environment to be investigated, a previously unthinkable achievement when using traditional approaches. DNA sequence data assembled from extracts of 0.8 μm filtered Sargasso seawater unveiled an unprecedented glimpse of marine prokaryotic diversity and gene content. Serendipitously, many sequences representing picoeukaryotes (cell size <2 μm) were also present within this dataset. We investigated the picoeukaryotic diversity of this database by searching sequences containing homologs of eight nuclear anchor genes that are well conserved throughout the eukaryotic lineage, as well as one chloroplastic and one mitochondrial gene. Results We found up to 41 distinct eukaryotic scaffolds, with a broad phylogenetic spread on the eukaryotic tree of life. The average eukaryotic scaffold size is 2,909 bp, with one gap every 1,253 bp. Strikingly, the AT frequency of the eukaryotic sequences (51.4%) is significantly lower than the average AT frequency of the metagenome (61.4%). This represents 4% to 18% of the estimated prokaryotic diversity, depending on the average prokaryotic versus eukaryotic genome size ratio. Conclusion Despite similar cell size, eukaryotic sequences of the Sargasso Sea metagenome have higher GC content, suggesting that different environmental pressures affect the evolution of their base composition.
Collapse
|
13
|
In silico whole-genome screening for cancer-related single-nucleotide polymorphisms located in human mRNA untranslated regions. BMC Genomics 2007; 8:2. [PMID: 17201911 PMCID: PMC1774567 DOI: 10.1186/1471-2164-8-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2006] [Accepted: 01/03/2007] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND A promising application of the huge amounts of genetic data currently available lies in developing a better understanding of complex diseases, such as cancer. Analysis of publicly available databases can help identify potential candidates for genes or mutations specifically related to the cancer phenotype. In spite of their huge potential to affect gene function, no systematic attention has been paid so far to the changes that occur in untranslated regions of mRNA. RESULTS In this study, we used Expressed Sequence Tag (EST) databases as a source for cancer-related sequence polymorphism discovery at the whole-genome level. Using a novel computational procedure, we focused on the identification of untranslated region (UTR)-localized non-coding Single Nucleotide Polymorphisms (UTR-SNPs) significantly associated with the tumoral state. To explore possible relationships between genetic mutation and phenotypic variation, bioinformatic tools were used to predict the potential impact of cancer-associated UTR-SNPs on mRNA secondary structure and UTR regulatory elements. We provide a comprehensive and unbiased description of cancer-associated UTR-SNPs that may be useful to define genotypic markers or to propose polymorphisms that can act to alter gene expression levels. Our results suggest that a fraction of cancer-associated UTR-SNPs may have functional consequences on mRNA stability and/or expression. CONCLUSION We have undertaken a comprehensive effort to identify cancer-associated polymorphisms in untranslated regions of mRNA and to characterize putative functional UTR-SNPs. Alteration of translational control can change the expression of genes in tumor cells, causing an increase or decrease in the concentration of specific proteins. Through the description of testable candidates and the experimental validation of a number of UTR-SNPs discovered on the secreted protein acidic and rich in cysteine (SPARC) gene, this report illustrates the utility of a cross-talk between in silico transcriptomics and cancer genetics.
Collapse
|
14
|
Aouacheria A, Navratil V, Barthelaix A, Mouchiroud D, Gautier C. Bioinformatic screening of human ESTs for differentially expressed genes in normal and tumor tissues. BMC Genomics 2006; 7:94. [PMID: 16640784 PMCID: PMC1459866 DOI: 10.1186/1471-2164-7-94] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2005] [Accepted: 04/26/2006] [Indexed: 11/24/2022] Open
Abstract
Background Owing to the explosion of information generated by human genomics, analysis of publicly available databases can help identify potential candidate genes relevant to the cancerous phenotype. The aim of this study was to scan for such genes by whole-genome in silico subtraction using Expressed Sequence Tag (EST) data. Methods Genes differentially expressed in normal versus tumor tissues were identified using a computer-based differential display strategy. Bcl-xL, an anti-apoptotic member of the Bcl-2 family, was selected for confirmation by western blot analysis. Results Our genome-wide expression analysis identified a set of genes whose differential expression may be attributed to the genetic alterations associated with tumor formation and malignant growth. We propose complete lists of genes that may serve as targets for projects seeking novel candidates for cancer diagnosis and therapy. Our validation result showed increased protein levels of Bcl-xL in two different liver cancer specimens compared to normal liver. Notably, our EST-based data mining procedure indicated that most of the changes in gene expression observed in cancer cells corresponded to gene inactivation patterns. Chromosomes and chromosomal regions most frequently associated with aberrant expression changes in cancer libraries were also determined. Conclusion Through the description of several candidates (including genes encoding extracellular matrix and ribosomal components, cytoskeletal proteins, apoptotic regulators, and novel tissue-specific biomarkers), our study illustrates the utility of in silico transcriptomics to identify tumor cell signatures, tumor-related genes and chromosomal regions frequently associated with aberrant expression in cancer.
Collapse
Affiliation(s)
- Abdel Aouacheria
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, 69622 Villeurbanne Cedex, France
- Current address: Apoptosis and Oncogenesis Laboratory, IBCP, UMR 5086 CNRS-UCBL, IFR 128, Lyon, France
| | - Vincent Navratil
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, 69622 Villeurbanne Cedex, France
| | | | - Dominique Mouchiroud
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, 69622 Villeurbanne Cedex, France
| | - Christian Gautier
- Laboratoire de Biométrie et Biologie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, 69622 Villeurbanne Cedex, France
| |
Collapse
|
15
|
Abstract
Classification of proteins into families of homologous sequences constitutes the basis of functional analysis or of evolutionary studies. Here we present INVertebrate HOmologous GENes (INVHOGEN), a database combining the available invertebrate protein genes from UniProt (consisting of Swiss-Prot and TrEMBL) into gene families. For each family INVHOGEN provides a multiple protein alignment, a maximum likelihood based phylogenetic tree and taxonomic information about the sequences. It is possible to download the corresponding GenBank flatfiles, the alignment and the tree in Newick format. Sequences and related information have been structured in an ACNUC database under a client/server architecture. Thus, complex selections can be performed. An external graphical tool (FamFetch) allows access to the data to evaluate homology relationships between genes and distinguish orthologous from paralogous sequences. Thus, INVHOGEN complements the well-known HOVERGEN database. The databank is available at .
Collapse
Affiliation(s)
- Ingo Paulsen
- Department of Bioinformatics, Institute for Computer Sciences, Heinrich-Heine-University Duesseldorf, Universitaetsstrasse 1, 40225 Duesseldorf, Germany.
| | | |
Collapse
|
16
|
Croce O, Lamarre M, Christen R. Querying the public databases for sequences using complex keywords contained in the feature lines. BMC Bioinformatics 2006; 7:45. [PMID: 16441875 PMCID: PMC1403806 DOI: 10.1186/1471-2105-7-45] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2005] [Accepted: 01/27/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use. CONCLUSION Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
Collapse
Affiliation(s)
- Olivier Croce
- Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France
| | - Michaël Lamarre
- Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France
| | - Richard Christen
- Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France
| |
Collapse
|
17
|
Aouacheria A, Navratil V, Wen W, Jiang M, Mouchiroud D, Gautier C, Gouy M, Zhang M. In silico whole-genome scanning of cancer-associated nonsynonymous SNPs and molecular characterization of a dynein light chain tumour variant. Oncogene 2005; 24:6133-42. [PMID: 15897869 DOI: 10.1038/sj.onc.1208745] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Last decade has led to the accumulation of large amounts of data on cancer genetics, opening an unprecedented access to the mapping of cancer genes in the human genome. Single-nucleotide polymorphisms (SNPs), the most common form of DNA variation in humans, emerge as an invaluable tool for cancer association studies. These genotypic markers can be used to assay how alleles of candidate genes correlate with the malignant phenotype, and may provide new clues into the genetic modifications that characterize cancer onset. In this cancer-oriented study, we detail an SNP mining strategy based on the analysis of expressed sequence tags among publicly available databases. Our whole-genome approach provides a comprehensive and unbiased description of nonsynonymous SNPs (nsSNPs) in tumoral versus normal tissues. To gain further insights into the possible relationships between genetic variation and altered phenotype, locations of a subset of nsSNPs were mapped onto protein domains known to be critical for protein function. Computational methods were also used to predict the potential impact of these cancer-associated nsSNPs on protein structure and function. We illustrate our approach through the detailed biochemical and structural characterization of a previously unknown cancer-associated mutation (G79C) affecting the 8 kDa dynein light chain (DNCL1).
Collapse
Affiliation(s)
- Abdel Aouacheria
- Laboratoire de Biométric et Biologie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, F-69622 Villeurbanne Cedex, France.
| | | | | | | | | | | | | | | |
Collapse
|
18
|
Stenøien HK, Stephan W. Global mRNA stability is not associated with levels of gene expression in Drosophila melanogaster but shows a negative correlation with codon bias. J Mol Evol 2005; 61:306-14. [PMID: 16044249 DOI: 10.1007/s00239-004-0271-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2004] [Accepted: 03/16/2005] [Indexed: 11/26/2022]
Abstract
A multitude of factors contribute to the regulation of gene expression in living cells. The relationship between codon usage bias and gene expression has been extensively studied, and it has been shown that codon bias may have adaptive significance in many unicellular and multicellular organisms. Given the central role of mRNA in post-transcriptional regulation, we hypothesize that mRNA stability is another important factor associated either with positive or negative regulation of gene expression. We have conducted genome-wide studies of the association between gene expression (measured as transcript abundance in public EST databases), mRNA stability, codon bias, GC content, and gene length in Drosophila melanogaster. To remove potential bias of gene length inherently present in EST libraries, gene expression is measured as normalized transcript abundance. It is demonstrated that codon bias and GC content in second codon position are positively associated with transcript abundance. Gene length is negatively associated with transcript abundance. The stability of thermodynamically predicted mRNA secondary structures is not associated with transcript abundance, but there is a negative correlation between mRNA stability and codon bias. This finding does not support the hypothesis that codon bias has evolved as an indirect consequence of selection favoring thermodynamically stable mRNA molecules.
Collapse
Affiliation(s)
- Hans K Stenøien
- Plant Ecology/Department of Ecology and Evolution, Evolutionary Biology Centre, Uppsala University, SE-752 36, Uppsala, Sweden
| | | |
Collapse
|
19
|
Khelifi A, Adel K, Duret L, Laurent D, Mouchiroud D, Dominique M. HOPPSIGEN: a database of human and mouse processed pseudogenes. Nucleic Acids Res 2005; 33:D59-66. [PMID: 15608268 PMCID: PMC540038 DOI: 10.1093/nar/gki084] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Processed pseudogenes result from reverse transcribed mRNAs. In general, because processed pseudogenes lack promoters, they are no longer functional from the moment they are inserted into the genome. Subsequently, they freely accumulate substitutions, insertions and deletions. Moreover, the ancestral structure of processed pseudogenes could be easily inferred using the sequence of their functional homologous genes. Owing to these characteristics, processed pseudogenes represent good neutral markers for studying genome evolution. Recently, there is an increasing interest for these markers, particularly to help gene prediction in the field of genome annotation, functional genomics and genome evolution analysis (patterns of substitution). For these reasons, we have developed a method to annotate processed pseudogenes in complete genomes. To make them useful to different fields of research, we stored them in a nucleic acid database after having annotated them. In this work, we screened both mouse and human complete genomes from ENSEMBL to find processed pseudogenes generated from functional genes with introns. We used a conservative method to detect processed pseudogenes in order to minimize the rate of false positive sequences. Within processed pseudogenes, some are still having a conserved open reading frame and some have overlapping gene locations. We designated as retroelements all reverse transcribed sequences and more strictly, we designated as processed pseudogenes, all retroelements not falling in the two former categories (having a conserved open reading or overlapping gene locations). We annotated 5823 retroelements (5206 processed pseudogenes) in the human genome and 3934 (3428 processed pseudogenes) in the mouse genome. Compared to previous estimations, the total number of processed pseudogenes was underestimated but the aim of this procedure was to generate a high-quality dataset. To facilitate the use of processed pseudogenes in studying genome structure and evolution, DNA sequences from processed pseudogenes, and their functional reverse transcribed homologs, are now stored in a nucleic acid database, HOPPSIGEN. HOPPSIGEN can be browsed on the PBIL (Pôle Bioinformatique Lyonnais) World Wide Web server (http://pbil.univ-lyon1.fr/) or fully downloaded for local installation.
Collapse
Affiliation(s)
- Adel Khelifi
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard-Lyon 1, 43 bd. du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | | | | | | | |
Collapse
|
20
|
Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res 2005; 33:1141-53. [PMID: 15728743 PMCID: PMC549432 DOI: 10.1093/nar/gki242] [Citation(s) in RCA: 305] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2004] [Revised: 01/10/2005] [Accepted: 01/23/2005] [Indexed: 12/21/2022] Open
Abstract
Among bacteria, many species have synonymous codon usage patterns that have been influenced by natural selection for those codons that are translated more accurately and/or efficiently. However, in other species selection appears to have been ineffective. Here, we introduce a population genetics-based model for quantifying the extent to which selection has been effective. The approach is applied to 80 phylogenetically diverse bacterial species for which whole genome sequences are available. The strength of selected codon usage bias, S, is found to vary substantially among species; in 30% of the genomes examined, there was no significant evidence that selection had been effective. Values of S are highly positively correlated with both the number of rRNA operons and the number of tRNA genes. These results are consistent with the hypothesis that species exposed to selection for rapid growth have more rRNA operons, more tRNA genes and more strongly selected codon usage bias. For example, Clostridium perfringens, the species with the highest value of S, can have a generation time as short as 7 min.
Collapse
Affiliation(s)
- Paul M Sharp
- Institute of Genetics, University of Nottingham, Queens Medical Centre, Nottingham NG7 2UH, UK.
| | | | | | | | | |
Collapse
|
21
|
Charif D, Thioulouse J, Lobry JR, Perrière G. Online synonymous codon usage analyses with the ade4 and seqinR packages. Bioinformatics 2004; 21:545-7. [PMID: 15374859 DOI: 10.1093/bioinformatics/bti037] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED Correspondence analysis of codon usage data is a widely used method in sequence analysis, but the variability in amino acid composition between proteins is a confounding factor when one wants to analyse synonymous codon usage variability. A simple and natural way to cope with this problem is to use within-group correspondence analysis. There is, however, no user-friendly implementation of this method available for genomic studies. Our motivation was to provide to the community a Web facility to easily study synonymous codon usage on a subset of data available in public genomic databases. AVAILABILITY Availability through the Pole Bioinformatique Lyonnais (PBIL) Web server at http://pbil.univ-lyon1.fr/datasets/charif04/ with a demo allowing us to reproduce the figure in the present application note. All underlying software is distributed under a GPL licence. CONTACT http://pbil.univ-lyon1.fr/members/lobry.
Collapse
Affiliation(s)
- D Charif
- Laboratoire de Biométrie et Biologie Evolutive-CNRS UMR 5558, and INRIA Helix project, Université Claude Bernard-Lyon I 43 bd. du 11 Novembre 1918, F-69622 Villeurbanne cedex, France
| | | | | | | |
Collapse
|
22
|
Abstract
The existence of a well conserved linear relationship between GC levels of genes' second and third codon positions (GC2, GC3) prompted us to focus on the landscape, or joint distribution, spanned by these two variables. In human, well curated coding sequences now cover at least 15%-30% of the estimated total gene set. Our analysis of the landscape defined by this gene set revealed not only the well documented linear crest, but also the presence of several peaks and valleys along that crest, a property that was also indicated in two other warm-blooded vertebrates represented by large gene databases, that is, mouse and chicken. GC2 is the sum of eight amino acid frequencies, whereas GC3 is linearly related to the GC level of the chromosomal region containing the gene. The landscapes therefore portray relations between proteins and the DNA environments of the genes that encode them.
Collapse
Affiliation(s)
- Stéphane Cruveiller
- Laboratorio di Evoluzione Molecolare, Stazione Zoologica Anton Dohrn, 80121 Napoli, Italy
| | | | | | | |
Collapse
|
23
|
Marais G, Charlesworth B, Wright SI. Recombination and base composition: the case of the highly self-fertilizing plant Arabidopsis thaliana. Genome Biol 2004; 5:R45. [PMID: 15239830 PMCID: PMC463295 DOI: 10.1186/gb-2004-5-7-r45] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2004] [Revised: 04/26/2004] [Accepted: 04/30/2004] [Indexed: 11/24/2022] Open
Abstract
The effects of recombination and self-fertilization on base composition were investigated both theoretically and experimentally in the Arabidopsis genome. Levels of inbreeding modulate the effect of recombination on base composition. Background Rates of recombination can vary among genomic regions in eukaryotes, and this is believed to have major effects on their genome organization in terms of base composition, DNA repeat density, intron size, evolutionary rates and gene order. In highly self-fertilizing species such as Arabidopsis thaliana, however, heterozygosity is expected to be strongly reduced and recombination will be much less effective, so that its influence on genome organization should be greatly reduced. Results Here we investigated theoretically the joint effects of recombination and self-fertilization on base composition, and tested the predictions with genomic data from the complete A. thaliana genome. We show that, in this species, both codon-usage bias and GC content do not correlate with the local rates of crossing over, in agreement with our theoretical results. Conclusions We conclude that levels of inbreeding modulate the effect of recombination on base composition, and possibly other genomic features (for example, transposable element dynamics). We argue that inbreeding should be considered when interpreting patterns of molecular evolution.
Collapse
Affiliation(s)
- G Marais
- Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT Edinburgh, UK
| | - B Charlesworth
- Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT Edinburgh, UK
| | - S I Wright
- Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT Edinburgh, UK
- Current address: Department of Biology, York University, 4700 Keele St, Toronto, Ontario M3J 1P3, Canada
| |
Collapse
|
24
|
Perrière G, Combet C, Penel S, Blanchet C, Thioulouse J, Geourjon C, Grassot J, Charavay C, Gouy M, Duret L, Deléage G. Integrated databanks access and sequence/structure analysis services at the PBIL. Nucleic Acids Res 2003; 31:3393-9. [PMID: 12824334 PMCID: PMC168937 DOI: 10.1093/nar/gkg530] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The World Wide Web server of the PBIL (Pôle Bioinformatique Lyonnais) provides on-line access to sequence databanks and to many tools of nucleic acid and protein sequence analyses. This server allows to query nucleotide sequence banks in the EMBL and GenBank formats and protein sequence banks in the SWISS-PROT and PIR formats. The query engine on which our data bank access is based is the ACNUC system. It allows the possibility to build complex queries to access functional zones of biological interest and to retrieve large sequence sets. Of special interest are the unique features provided by this system to query the data banks of gene families developed at the PBIL. The server also provides access to a wide range of sequence analysis methods: similarity search programs, multiple alignments, protein structure prediction and multivariate statistics. An originality of this server is the integration of these two aspects: sequence retrieval and sequence analysis. Indeed, thanks to the introduction of re-usable lists, it is possible to perform treatments on large sets of data. The PBIL server can be reached at: http://pbil.univ-lyon1.fr.
Collapse
Affiliation(s)
- Guy Perrière
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS no. 5558, Université Claude Bernard, Lyon 1, 43 bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Abstract
Receptor Tyrosine Kinases (RTK) are transmembrane receptors specifically found in metazoans. They represent an excellent model for studying evolution of cellular processes in metazoans because they encompass large families of modular proteins and belong to a major family of contingency generating molecules in eukaryotic cells: the protein kinases. Because tyrosine kinases have been under close scrutiny for many years in various species, they are associated with a wealth of information, mainly in mammals. Presently, most categories of RTK were identified in mammals, but in a near future other model species will be sequenced, and will bring us RTKs from other metazoan clades. Thus, collecting RTK sequences would provide a good starting point as a new model for comparative and evolutionary studies applying to multigene families. In this context, we are developing the Receptor Tyrosine Kinase database (RTKdb), which is the only database on tyrosine kinase receptors presently available. In this database, protein sequences from eight model metazoan species are organized under the format previously used for the HOVERGEN, HOBACGEN and NUREBASE systems. RTKdb can be accessed through the PBIL (Pôle Bioinformatique Lyonnais) World Wide Web server at http://pbil.univ-lyon1.fr/RTKdb/, or through the FamFetch graphical user interface available at the same address.
Collapse
Affiliation(s)
- Julien Grassot
- Centre de Génétique Moléculaire et Cellulaire, UMR CNRS 5534, Université Claude Bernard-Lyon 1, 43 bd. du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
26
|
Daubin V, Lerat E, Perrière G. The source of laterally transferred genes in bacterial genomes. Genome Biol 2003; 4:R57. [PMID: 12952536 PMCID: PMC193657 DOI: 10.1186/gb-2003-4-9-r57] [Citation(s) in RCA: 148] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2003] [Revised: 06/11/2003] [Accepted: 07/04/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Laterally transferred genes have often been identified on the basis of compositional features that distinguish them from ancestral genes in the genome. These genes are usually A+T-rich, arguing either that there is a bias towards acquiring genes from donor organisms having low G+C contents or that genes acquired from organisms of similar genomic base compositions go undetected in these analyses. RESULTS By examining the genome contents of closely related, fully sequenced bacteria, we uncovered genes confined to a single genome and examined the sequence features of these acquired genes. The analysis shows that few transfer events are overlooked by compositional analyses. Most observed lateral gene transfers do not correspond to free exchange of regular genes among bacterial genomes, but more probably represent the constituents of phages or other selfish elements. CONCLUSIONS Although bacteria tend to acquire large amounts of DNA, the origin of these genes remains obscure. We have shown that contrary to what is often supposed, their composition cannot be explained by a previous genomic context. In contrast, these genes fit the description of recently described genes in lambdoid phages, named 'morons'. Therefore, results from genome content and compositional approaches to detect lateral transfers should not be cited as evidence for genetic exchange between distantly related bacteria.
Collapse
Affiliation(s)
- Vincent Daubin
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard - Lyon 1, Cedex, France.
| | | | | |
Collapse
|
27
|
Abstract
Amino acids interact with each other, especially with neighboring amino acids, to generate protein structures. We studied the pattern of association and repulsion of amino acids based on 24,748 protein-coding genes from human, 11,321 from mouse, and 15,028 from Escherichia coli, and documented the pattern of neighbor preference of amino acids. All amino acids have different preferences for neighbors. We have also analyzed 7,342 proteins with known secondary structure and estimated the propensity of the 20 amino acids occurring in three of the major secondary structures, i.e., helices, sheets, and turns. Much of the neighbor preference can be explained by the propensity of the amino acids in forming different secondary structures, but there are also a number of intriguing association and repulsion patterns. The similarity in neighbor preference among amino acids is significantly correlated with the number of amino acid substitutions in both mitochondrial and nuclear genes, with amino acids having similar sets of neighbors replacing each other more frequently than those having very different sets of neighbors. This similarity in neighbor preference is incorporated into a new index of amino acid dissimilarities that can predict nonsynonymous codon substitutions better than the two existing indices of amino acid dissimilarities, i.e., Grantham's and Miyata's distances.
Collapse
Affiliation(s)
- Xuhua Xia
- Bioinformatics Laboratory, HKU-Pasteur Research Center, Dexter H.C. Man Building, 8 Sassoon Road, Pokfulam, Hong Kong.
| | | |
Collapse
|
28
|
Duarte J, Perrière G, Laudet V, Robinson-Rechavi M. NUREBASE: database of nuclear hormone receptors. Nucleic Acids Res 2002; 30:364-8. [PMID: 11752338 PMCID: PMC99117 DOI: 10.1093/nar/30.1.364] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Nuclear hormone receptors are an abundant class of ligand activated transcriptional regulators, found in varying numbers in all animals. Based on our experience of managing the official nomenclature of nuclear receptors, we have developed NUREBASE, a database containing protein and DNA sequences, reviewed protein alignments and phylogenies, taxonomy and annotations for all nuclear receptors. The reviewed NUREBASE is completed by NUREBASE_DAILY, automatically updated every 24 h. Both databases are organized under a client/server architecture, with a client written in Java which runs on any platform. This client, named FamFetch, integrates a graphical interface allowing selection of families, and manipulation of phylogenies and alignments. NUREBASE sequence data is also accessible through a World Wide Web server, allowing complex queries. All information on accessing and installing NUREBASE may be found at http://www.ens-lyon.fr/LBMC/laudet/nurebase.html.
Collapse
Affiliation(s)
- Jorge Duarte
- Laboratoire de Biologie Moléculaire et Cellulaire, CNRS UMR 5665, Ecole Normale Supérieure de Lyon, 46 allée d'Italie, 69364 Lyon Cedex 07, France
| | | | | | | |
Collapse
|
29
|
Ponger L, Duret L, Mouchiroud D. Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res 2001; 11:1854-60. [PMID: 11691850 PMCID: PMC311164 DOI: 10.1101/gr.174501] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
In an attempt to understand the origin of CpG islands (CGIs) in mammalian genomes, we have studied their location and structure according to the expression pattern of genes and to the G + C content of isochores in which they are embedded. We show that CGIs located over the transcription start site (named start CGIs) are very different structurally from the others (named no-start CGIs): (1) 61.6% of the no-start CGIs are due to repeated sequences (79 % are due to Alus), whereas only 5.6% of the start CGIs are due to such repeats; (2) start CGIs are longer and display a higher CpGo/e ratio and G + C level than no-start CGIs. The frequency of tissue-specific genes associated to a start CGI varies according to the genomic G + C content, from 25% in G + C-poor isochores to 64% in G + C-rich isochores. Conversely, the frequency of housekeeping genes associated to a start CGI (90%) is independent of the isochore context. Interestingly, the structure of start CGIs is very similar for tissue-specific and housekeeping genes. Moreover, 93% of genes expressed in early embryo are found to exhibit a CpG island over their transcription start point. These observations are consistent with the hypothesis that the occurrence of these CGIs is the consequence of gene expression at this stage, when the methylation pattern is installed.
Collapse
Affiliation(s)
- L Ponger
- Laboratoire de Biométrie et Biologie Evolutive, Unité Nixte de Recherche Centre National de la Recherche Scientifique 5558-Université Claude Bernard, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
30
|
Musto H, Cruveiller S, D'Onofrio G, Romero H, Bernardi G. Translational selection on codon usage in Xenopus laevis. Mol Biol Evol 2001; 18:1703-7. [PMID: 11504850 DOI: 10.1093/oxfordjournals.molbev.a003958] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A correspondence analysis of codon usage in Xenopus laevis revealed that the first axis is strongly correlated with the base composition at third codon positions. The second axis discriminates between putatively highly expressed genes and the other coding sequences, with expression levels being confirmed by the analysis of Expressed sequence tag frequencies. The comparison of codon usage of the sequences displaying the extreme values on the second axis indicates that several codons are statistically more frequent among the highly expressed (mainly housekeeping) genes. Translational selection appears, therefore, to influence synonymous codon usage in Xenopus.
Collapse
Affiliation(s)
- H Musto
- Laboratorio di Evoluzione Molecolare, Stazione Zoologica Anton Dohrn, Naples, Italy
| | | | | | | | | |
Collapse
|
31
|
Duret L, Galtier N. The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact. Mol Biol Evol 2000; 17:1620-5. [PMID: 11070050 DOI: 10.1093/oxfordjournals.molbev.a026261] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
CpG and TpA dinucleotides are underrepresented in the human genome. The CpG deficiency is due to the high mutation rate from C to T in methylated CpG's. The TpA suppression was thought to reflect a counterselection against TpA's destabilizing effect in RNA. Unexpectedly, the TpA and CpG deficiencies vary according to the G+C contents of sequences. It has been proposed that the variation in CpG suppression was correlated with a particular chromatin organization in G+C-rich isochores. Here, we present an improved model of dinucleotide evolution accounting for the overlap between successive dinucleotides. We show that an increased mutation rate from CpG to TpG or CpA induces both an apparent TpA deficiency and a correlation between CpG and TpA deficiencies and G+C content. Moreover, this model shows that the ratio of observed over expected CpG frequency underestimates the real CpG deficiency in G+C-rich sequences. The predictions of our model fit well with observed frequencies in human genomic data. This study suggests that previously published selectionist interpretations of patterns of dinucleotide frequencies should be taken with caution. Moreover, we propose new criteria to identify unmethylated CpG islands taking into account this bias in the measure of CpG depletion.
Collapse
Affiliation(s)
- L Duret
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Villeurbanne, France.
| | | |
Collapse
|
32
|
Gonçalves I, Duret L, Mouchiroud D. Nature and structure of human genes that generate retropseudogenes. Genome Res 2000; 10:672-8. [PMID: 10810090 PMCID: PMC310883 DOI: 10.1101/gr.10.5.672] [Citation(s) in RCA: 129] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human genome is estimated to contain 23,000 to 33,000 retropseudogenes. To study the properties of genes giving rise to these retroelements, we compared the structure and expression of genes with or without known retropseudogenes. Four main features have emerged from the analysis of 181 genes associated to retropseudogenes: Reverse-transcribed genes are (1) widely expressed, (2) highly conserved, (3) short, and (4) GC-poor. The first two properties probably reflect the fact that genes giving rise to retropseudogenes have to be expressed in the germ-line. The two latter points suggest that reverse-transcription and transposition is more efficient for short GC-poor mRNAs. In addition, this analysis allowed us to reject previous hypotheses that widely expressed genes are GC rich. Rather, globally, genes with a wide tissue distribution are GC poor.
Collapse
Affiliation(s)
- I Gonçalves
- Laboratoire de Biométrie et Biologie Evolutive Unité Mixte de Recherche-Centre National de la Recherche Scientifique 5558, Université Claude Bernard-Lyon 1 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
33
|
Abstract
We present here HOBACGEN, a database system devoted to comparative genomics in bacteria. HOBACGEN contains all available protein genes from bacteria, archaea, and yeast, taken from SWISS-PROT/TrEMBL and classified into families. It also includes multiple alignments and phylogenetic trees built from these families. The database is organized under a client/server architecture with a client written in Java, which may run on any platform. This client integrates a graphical interface allowing users to select families according to various criteria and notably to select homologs common to a given set of taxa. This interface also allows users to visualize multiple alignments and trees associated to families. In tree displays, protein gene names are colored according to the taxonomy of the corresponding organisms. Users may access all information associated to sequences and multiple alignments by clicking on genes. This graphic tool thus gives a rapid and simple access to all data required to interpret homology relationships between genes and distinguish orthologs from paralogs. Instructions for installation of the client or the server are available at http://pbil.univ-lyon1. fr/databases/hobacgen.html.
Collapse
Affiliation(s)
- G Perrière
- Laboratoire de Biométrie et Biologie Evolutive, Unité Mixte de Recherche Centre National de la Recherche Scientifique (UMR CNRS) n( degrees ). 5558, Université Claude Bernard-Lyon 1, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
34
|
Volpetti V, Gallerani R, De Benedetto C, Liuni S, Licciulli F, Ceci LR. PLMItRNA, a database for tRNAs and tRNA genes in plant mitochondria: enlargement and updating. Nucleic Acids Res 2000; 28:159-62. [PMID: 10592210 PMCID: PMC102413 DOI: 10.1093/nar/28.1.159] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The current version of PLMItRNA has been realized to constitute a database for tRNA molecules and genes identified in the mitochondria of all green plants ( Viridiplantae ). It is the enlargement of a previous database originally restricted to seed plants [Ceci,L.R., Volpicella,M., Liuni,S., Volpetti,V., Licciulli,F. and Gallerani,R. (1999) Nucleic Acids Res., 27, 156-157]. PLMItRNA reports information and multialignments on 254 genes and 16 tRNA molecules detected in 25 higher plants (one bryophyta and 24 vascular plants) and seven green algae. PLMItRNA is accessible via the WWW at http://bio-WWW.ba.cnr.it:8000/srs6/
Collapse
Affiliation(s)
- V Volpetti
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Via Amendola 165/A, 70126 Bari, Italy
| | | | | | | | | | | |
Collapse
|
35
|
Pesole G, Gissi C, Catalano D, Grillo G, Licciulli F, Liuni S, Attimonelli M, Saccone C. MitoNuc and MitoAln: two related databases of nuclear genes coding for mitochondrial proteins. Nucleic Acids Res 2000; 28:163-5. [PMID: 10592211 PMCID: PMC102385 DOI: 10.1093/nar/28.1.163] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/1999] [Revised: 09/03/1999] [Accepted: 09/16/1999] [Indexed: 11/14/2022] Open
Abstract
Mitochondria, besides their central role in energy metabolism, have recently been found to be involved in a number of basic processes of cell life and to contribute to the pathogenesis of many degenerative diseases. All functions of mitochondria depend on the interaction of nuclear and organellar genomes. Mitochondrial genomes have been extensively sequenced and analysed and the data collected in several specialised databases. In order to collect information on nuclear coded mitochondrial proteins we developed MitoNuc and MitoAln, two related databases containing, respectively, detailed information on sequenced nuclear genes coding for mitochondrial proteins in Metazoa and yeast, and the multiple alignments of the relevant homologous protein coding regions. MitoNuc and MitoAln retrieval through SRS at http://bio-www.ba.cnr.it:8000/srs6/ can easily allow the extraction of sequence data, subsequences defined by specific features and nucleotide or amino acid multiple alignments.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, via Celoria 26, 20122 Milano, Italy.
| | | | | | | | | | | | | | | |
Collapse
|
36
|
Lanave C, Liuni S, Licciulli F, Attimonelli M. Update of AMmtDB: a database of multi-aligned metazoa mitochondrial DNA sequences. Nucleic Acids Res 2000; 28:153-4. [PMID: 10592208 PMCID: PMC102422 DOI: 10.1093/nar/28.1.153] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/1999] [Accepted: 10/06/1999] [Indexed: 11/13/2022] Open
Abstract
The AMmtDB database (http://bio-www.ba.cnr.it:8000/srs6/ ) has been updated by collecting the multi-aligned sequences of Chordata mitochondrial genes coding for proteins and tRNAs. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. AMmtDB data selected through SRS can be viewed and managed using GeneDoc or other programs for the management of multi-aligned data depending on the user's operative system. The multiple alignments have been produced with CLUSTALW and PILEUP programs and then carefully optimized manually.
Collapse
Affiliation(s)
- C Lanave
- Centro di Studio sui Mitocondri e Metabolismo Energetico CNR, Via Amendola 165/A, 70126 Bari, Italy.
| | | | | | | |
Collapse
|
37
|
Duret L, Mouchiroud D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol 2000; 17:68-74. [PMID: 10666707 DOI: 10.1093/oxfordjournals.molbev.a026239] [Citation(s) in RCA: 402] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
To determine whether gene expression patterns affect mutation rates and/or selection intensity in mammalian genes, we studied the relationships between substitution rates and tissue distribution of gene expression. For this purpose, we analyzed 2,400 human/rodent and 834 mouse/rat orthologous genes, and we measured (using expressed sequence tag data) their expression patterns in 19 tissues from three development states. We show that substitution rates at nonsynonymous sites are strongly negatively correlated with tissue distribution breadth: almost threefold lower in ubiquitous than in tissue-specific genes. Nonsynonymous substitution rates also vary considerably according to the tissues: the average rate is twofold lower in brain-, muscle-, retina- and neuron-specific genes than in lymphocyte-, lung-, and liver-specific genes. Interestingly, 5' and 3' untranslated regions (UTRs) show exactly the same trend. These results demonstrate that the expression pattern is an essential factor in determining the selective pressure on functional sites in both coding and noncoding regions. Conversely, silent substitution rates do not vary with expression pattern, even in ubiquitously expressed genes. This latter result thus suggests that synonymous codon usage is not constrained by selection in mammals. Furthermore, this result also indicates that there is no reduction of mutation rates in genes expressed in the germ line, contrary to what had been hypothesized based on the fact that transcribed DNA is more efficiently repaired than nontranscribed DNA.
Collapse
Affiliation(s)
- L Duret
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Villeurbanne, France.
| | | |
Collapse
|
38
|
Perrière G, Bessières P, Labedan B. EMGLib: the enhanced microbial genomes library (update 2000). Nucleic Acids Res 2000; 28:68-71. [PMID: 10592183 PMCID: PMC102414 DOI: 10.1093/nar/28.1.68] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/1999] [Accepted: 10/04/1999] [Indexed: 11/13/2022] Open
Abstract
As the number of complete microbial genomes publicly available is still growing, the problem of annotation quality in these very large sequences remains unsolved. Indeed, the number of annotations associated with complete genomes is usually lower than those of the shorter entries encountered in the repository collections. Moreover, classical sequence database management systems have difficulties in handling entries of such size. In this context, the Enhanced Microbial Genomes Library (EMGLib) was developed to try to alleviate these problems. This library contains all the complete genomes from prokaryotes (bacteria and archaea) already sequenced and the yeast genome in GenBank format. The annotations are improved by the introduction of data on codon usage, gene orientation on the chromosome and gene families. It is possible to access EMGLib through two database systems set up on WWW servers: the PBIL server at http://pbil.univ-lyon1.fr/emglib.html and the MICADO server at http://locus.jouy.inra.fr/micado
Collapse
Affiliation(s)
- G Perrière
- Laboratoire de Biométrie et Biologie Evolutive, Université Claude Bernard, Lyon 1, 43 boulevard du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
39
|
Duret L, Mouchiroud D. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci U S A 1999; 96:4482-7. [PMID: 10200288 PMCID: PMC16358 DOI: 10.1073/pnas.96.8.4482] [Citation(s) in RCA: 607] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We measured the expression pattern and analyzed codon usage in 8,133, 1,550, and 2,917 genes, respectively, from Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana. In those three species, we observed a clear correlation between codon usage and gene expression levels and showed that this correlation is not due to a mutational bias. This provides direct evidence for selection on silent sites in those three distantly related multicellular eukaryotes. Surprisingly, there is a strong negative correlation between codon usage and protein length. This effect is not due to a smaller size of highly expressed proteins. Thus, for a same-expression pattern, the selective pressure on codon usage appears to be lower in genes encoding long rather than short proteins. This puzzling observation is not predicted by any of the current models of selection on codon usage and thus raises the question of how translation efficiency affects fitness in multicellular organisms.
Collapse
Affiliation(s)
- L Duret
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Unité Mixte de Recherche Centre National de la Recherche Scientifique 5558, Villeurbanne Cedex, France.
| | | |
Collapse
|
40
|
Ceci LR, Volpicella M, Liuni S, Volpetti V, Licciulli F, Gallerani R. PLMItRNA, a database for higher plant mitochondrial tRNAs and tRNA genes. Nucleic Acids Res 1999; 27:156-7. [PMID: 9847164 PMCID: PMC148119 DOI: 10.1093/nar/27.1.156] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The PLMItRNA database contains information and multialignments of tRNA genes and molecules detected in higher plant mitochondria. It has been developed from a previous compilation of higher plant mitochondrial tRNA genes [Sagliano,A., Volpicella,M., Gallerani,R. and Ceci,L.R. (1998) Nucleic Acids Res., 26, 154-155] and implemented with data and sequences of tRNA molecules retrieved from the literature. The current version of the database reports information on 171 genes and 16 tRNA molecules from 24 plants. PLMItRNA is accessible via WWW at http://bio-www.ba.cnr.it:8000/srs/
Collapse
Affiliation(s)
- L R Ceci
- Centro di Studio sui Mitocondri e Metabolismo Energetico, CNR, Via Amendola 165/A, 70126 Bari, Italy.
| | | | | | | | | | | |
Collapse
|
41
|
Lanave C, Attimonelli M, De Robertis M, Licciulli F, Liuni S, Sbisá E, Saccone C. Update of AMmtDB: a database of multi-aligned metazoa mitochondrial DNA sequences. Nucleic Acids Res 1999; 27:134-7. [PMID: 9847158 PMCID: PMC148113 DOI: 10.1093/nar/27.1.134] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The present paper describes AMmtDB, a database collecting the multi-aligned sequences of vertebrate mitochondrial genes coding for proteins and tRNAs, as well as the multiple alignment of the mammalian mtDNA main regulatory region (D-loop) sequences. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. As far as the genes coding for tRNAs are concerned, the multi-alignments based on the primary and the secondary structures are both provided; for the mammalian D-loop multi-alignments we report the conserved regions of the entire D-loop (CSB1, CSB2, CSB3, the central region, ETAS1 and ETAS2) as defined by Sbisà et al. [ Gene (1997), 205, 125-140). A flatfile format for AMmtDB has been designed allowing its implementation in SRS (http://bio-www.ba.cnr.it:8000/BioWWW/#AMMTDB ). Data selected through SRS can be managed using GeneDoc or other programs for the management of multi-aligned data depending on the user's operative system. The multiple alignments have been produced with CLUSTALV and PILEUP programs and then carefully optimized manually.
Collapse
Affiliation(s)
- C Lanave
- Centro di Studio sui Mitocondri e Metabolismo Energetico, C.N.R., Via Amendola 165/A, 70126 Bari, Italy
| | | | | | | | | | | | | |
Collapse
|
42
|
Abstract
Since the obtention of the complete sequence of Haemophilus influenzae Rd in 1995, the number of bacterial genomes entirely sequenced has regularly increased. A problem is that the quality of the annotations of these very large sequences is usually lower than those of the shorter entries encountered in the repository collections. Moreover, classical sequence database management systems have difficulties in handling entries of that size. In this context, we have decided to build the Enhanced Microbial Genomes Library (EMGLib) in which these two problems are alleviated. This library contains all the complete genomes from bacteria already sequenced and the yeast genome in GenBank format. The annotations are improved by the introduction of data on codon usage, gene orientation on the chromosome and gene families. It is possible to access EMGLib through two database systems set up on World Wide Web servers: the PBIL server at http://pbil.univ-lyon1.fr/emglib/emglib. html and the MICADO server at http://locus.jouy.inra.fr/micado
Collapse
Affiliation(s)
- G Perrière
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard - Lyon 1, 43, boulevard du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
43
|
Barakat A, Matassi G, Bernardi G. Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants. Proc Natl Acad Sci U S A 1998; 95:10044-9. [PMID: 9707597 PMCID: PMC21458 DOI: 10.1073/pnas.95.17.10044] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/1998] [Accepted: 06/22/1998] [Indexed: 11/18/2022] Open
Abstract
Previous work has shown that, in the large genomes of three Gramineae [rice, maize, and barley: 415, 2,500, and 5,300 megabases (Mb), respectively] most genes are clustered in long DNA segments (collectively called the "gene space") that represent a small fraction (12-24%) of nuclear DNA, cover a very narrow (0.8-1.6%) GC range, and are separated by vast expanses of gene-empty sequences. In the present work, we have analyzed the small (ca. 120 Mb) nuclear genome of Arabidopsis thaliana and shown that its organization is drastically different from that of the genomes of Gramineae. Indeed, (i) genes are distributed over about 85% of the main band of DNA in CsCl and cover an 8% GC range; (ii) ORFs are fairly evenly distributed in long (>50 kb) sequences from GenBank that amount to about 10 Mb; and (iii) the GC levels of protein-coding sequences (and of their third codon positions) are correlated with the GC levels of their flanking sequences. The different pattern of gene distribution of Arabidopsis compared with Gramineae appears to be because the genomes of the latter comprise (i) many large gene-empty regions separating gene clusters and (ii) abundant transposons in the intergenic sequences of gene clusters. Both sequences are absent or very scarce in the Arabidopsis genome. These observations provide a comparative view of angiosperm genome organization.
Collapse
Affiliation(s)
- A Barakat
- Laboratoire de Génétique Moléculaire, Institut Jacques Monod, 2, Place Jussieu, 75005 Paris, France
| | | | | |
Collapse
|
44
|
Völker U, Andersen KK, Antelmann H, Devine KM, Hecker M. One of two osmC homologs in Bacillus subtilis is part of the sigmaB-dependent general stress regulon. J Bacteriol 1998; 180:4212-8. [PMID: 9696771 PMCID: PMC107419 DOI: 10.1128/jb.180.16.4212-4218.1998] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
In this report we present the identification and analysis of two Bacillus subtilis genes, yklA and ykzA, which are homologous to the partially RpoS-controlled osmC gene from Escherichia coli. The yklA gene is expressed at higher levels in minimal medium than in rich medium and is driven by a putative vegetative promoter. Expression of ykzA is not medium dependent but increases dramatically when cells are exposed to stress and starvation. This stress-induced increase in ykzA expression is absolutely dependent on the alternative sigma factor sigmaB, which controls a large stationary-phase and stress regulon. ykzA is therefore another example of a gene common to the RpoS and sigmaB stress regulons of E. coli and B. subtilis, respectively. The composite complex expression pattern of the two B. subtilis genes is very similar to the expression profile of osmC in E. coli.
Collapse
Affiliation(s)
- U Völker
- Institut für Mikrobiologie und Molekularbiologie, Ernst-Moritz-Arndt-Universität Greifswald, 17487 Greifswald, Germany.
| | | | | | | | | |
Collapse
|
45
|
Attimonelli M, Calò D, De Montalvo A, Lanave C, Sasanelli D, Tommaseo Ponzetta M, Saccone C. Update of MmtDB: a Metazoa mitochondrial DNA variants database. Nucleic Acids Res 1998; 26:120-5. [PMID: 9399815 PMCID: PMC147228 DOI: 10.1093/nar/26.1.120] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The present paper describes the improvements in MmtDB, a specialised database designed to collect Metazoa mitochondrial DNA variants. Priority in the data collection has been given to Metazoa for which a large amount of variants is available, e.g., for humans. Starting from the sequences available in the Nucleotide Sequence Databases, the redundant sequences have been removed and new sequences from other sources have been added. Value-added information is associated to each variant sequence, e.g., analysed region, experimental method, tissue and cell lines, population data, sex, age, family code and information about the variation events (nucleotide position, involved gene, restriction site gain or loss). Cross-references are introduced to the EMBL Data Library, as well as an internal cross-referencing among MmtDB entries according to tissual, heteroplasmic, familiar and aplotypical correlation. Furthermore MmtDB has a new section, AMmtDB: Aligned Metazoan mitochondrial biosequences. MmtDB can be accessed through the World Wide Web at URL http://WWW.ba.cnr.it/[symbol: see text]areamt08/MmtDBWWW.htm
Collapse
Affiliation(s)
- M Attimonelli
- Dipartimento di Biochimica e Biologia Molecolare and Dipartimento di Zoologia e Anatomia Comparata, Università di Bari, 70126 Bari, Italy.
| | | | | | | | | | | | | |
Collapse
|
46
|
Perrière G, Gouy M, Gojobori T. The non-redundant Bacillus subtilis (NRSub) database: update 1998. Nucleic Acids Res 1998; 26:60-2. [PMID: 9399801 PMCID: PMC147236 DOI: 10.1093/nar/26.1.60] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The non-redundant Bacillus subtilis database (NRSub) has been developed in the context of the sequencing project devoted to this bacterium. As this project has reached completion, the whole genome is now available as a single contig. Thanks to the ACNUC database management system and its associated retrieval system Query_win, each functional region of the genome can be accessed individually. Extra annotations have been added such as accession numbers for the genes, locations on the genetic map, codon adaptation index values, as well as cross-references with other collections. NRSub is distributed through anonymous FTP as a text file in EMBL format and as an ACNUC database. It is also possible to access NRSub through two dedicated World Wide Web servers located in France (http://acnuc. univ-lyon1.fr/nrsub/nrsub.html ) and in Japan (http://ddbjs4h.genes. nig.ac.jp/ ).
Collapse
Affiliation(s)
- G Perrière
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Lyon 1, 43 boulevard du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
47
|
De Giorgi C, Martiradonna A, Pesole G, Saccone C. Lineage-specific evolution of echinoderm mitochondrial ATP synthase subunit 8. J Bioenerg Biomembr 1997; 29:233-9. [PMID: 9298708 DOI: 10.1023/a:1022406026196] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Peculiar evolutionary properties of the subunit 8 of mitochondrial ATP synthase (ATPase8) are revealed by comparative analyses carried out between both closely and distantly related species of echinoderms. The analysis of nucleotide substitution in the three echinoids demonstrated a relaxation of amino acid functional constraints. The deduced protein sequences display a well conserved domain at the N-terminus, while the central part is very variable. At the C-terminus, the broad distribution of positively charged amino acids, which is typical of other organisms, is not conserved in the two different echinoderm classes of the sea urchins and of the sea stars. Instead, a motif of three amino acids, so far not described elsewhere, is conserved in sea urchins and is found to be very similar to the motif present in the sea stars. Our results indicate that the N-terminal region seems to follow the same evolutionary pattern in different organisms, while the maintenance of the C-terminal part in a phylum-specific manner may reflect the co-evolution of mitochondrial and nuclear genes.
Collapse
Affiliation(s)
- C De Giorgi
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | | | |
Collapse
|
48
|
Abstract
In the context of the international project aiming at sequencing the whole genome of Bacillus subtilis we have developed NRSub, a non-redundant database of sequences from this organism. Starting from the B.subtilis sequences available in the repository collections we have removed all encountered duplications, then we have added extra annotations to the sequences (e.g. accession numbers for the genes, locations on the genetic map, codon usage index). We have also added cross-references with EMBL/GenBank/DDBJ, MEDLINE, SWISS-PROT and ENZYME databases. NRSub is distributed through anonymous FTP as a text file in EMBL format and as an ACNUC database. It is also possible to access the database through two dedicated World Wide Web servers located in France (http://acnuc.univ-lyon1.fr/nrsub/nrsub.++ +html ) and in Japan (http://ddbjs4h.genes.nig.ac.jp/ ).
Collapse
Affiliation(s)
- G Perrière
- Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Lyon 1, 43, bd. du 11 Novembre 1918, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
49
|
Bernardi G, Hughes S, Mouchiroud D. The major compositional transitions in the vertebrate genome. J Mol Evol 1997; 44 Suppl 1:S44-51. [PMID: 9071011 DOI: 10.1007/pl00000051] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The vertebrate genome underwent two major compositional transitions, between therapsids and mammals and between dinosaurs and birds. These transitions concerned a sizable part (roughly one-third) of the genome, the gene-richest part of it, and consisted in an increase in GC levels (GC is the molar fraction of guanine + cytosine in DNA) which affected both coding sequences (especially third codon positions) and noncoding sequences. These major transitions were studied here by comparing GC3 levels (GC3 is the GC of third codon positions) of orthologous genes from Xenopus, chicken, calf, and man.
Collapse
Affiliation(s)
- G Bernardi
- Laboratoire de Genetique Moleculaire, Institut Jacques Monod, Paris, France
| | | | | |
Collapse
|
50
|
Calò D, De Pascali A, Sasanelli D, Tanzariello F, Tommaseo Ponzetta M, Saccone C, Attimonelli M. MmtDB: a Metazoa mitochondrial DNA variants database. Nucleic Acids Res 1997; 25:200-5. [PMID: 9016536 PMCID: PMC146362 DOI: 10.1093/nar/25.1.200] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
The present paper describes the structure of MmtDB-a specialized database designed to collect Metazoa mitochondrial DNA variants. Priority in the data collection is given to the Metazoa species for which a large amount of variants is available, as it is the case for human variants. Starting from the sequences available in the Nucleotide Sequence Databases, the redundant sequences are removed and new sequences from other sources are added. Value-added information are associated to each variant sequence, e.g. analysed region, experimental method, tissue and cell lines, population data, sex, age, family code and information about the variation events (nucleotide position, involved gene, restriction site's gain or loss). Cross-references are introduced to the EMBL Data Library, as well as an internal cross-referencing among MmtDB entries according to their tissual, heteroplasmic, familiar and aplotypical correlation. MmtDB can be accessed through the World Wide Web at URL [see text].
Collapse
Affiliation(s)
- D Calò
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, 70125 Bari, Italy.
| | | | | | | | | | | | | |
Collapse
|