1
|
Bui LC, Evsikov AV, Khan DR, Archilla C, Peynot N, Hénaut A, Le Bourhis D, Vignon X, Renard JP, Duranthon V. Retrotransposon expression as a defining event of genome reprogramming in fertilized and cloned bovine embryos. Reproduction 2009; 138:289-99. [PMID: 19465487 DOI: 10.1530/rep-09-0042] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Genome reprogramming is the ability of a nucleus to modify its epigenetic characteristics and gene expression pattern when placed in a new environment. Low efficiency of mammalian cloning is attributed to the incomplete and aberrant nature of genome reprogramming after somatic cell nuclear transfer (SCNT) in oocytes. To date, the aspects of genome reprogramming critical for full-term development after SCNT remain poorly understood. To identify the key elements of this process, changes in gene expression during maternal-to-embryonic transition in normal bovine embryos and changes in gene expression between donor cells and SCNT embryos were compared using a new cDNA array dedicated to embryonic genome transcriptional activation in the bovine. Three groups of transcripts were mostly affected during somatic reprogramming: endogenous terminal repeat (LTR) retrotransposons and mitochondrial transcripts were up-regulated, while genes encoding ribosomal proteins were downregulated. These unexpected data demonstrate specific categories of transcripts most sensitive to somatic reprogramming and likely affecting viability of SCNT embryos. Importantly, massive transcriptional activation of LTR retrotransposons resulted in similar levels of their transcripts in SCNT and fertilized embryos. Taken together, these results open a new avenue in the quest to understand nuclear reprogramming driven by oocyte cytoplasm.
Collapse
Affiliation(s)
- L C Bui
- INRA, UMR 1198 Biologie du Développement et Reproduction, Jouy en Josas, France
| | | | | | | | | | | | | | | | | | | |
Collapse
|
2
|
|
3
|
Louis A, Chiapello H, Fabry C, Ollivier E, Hénaut A. Deciphering Arabidopsis thaliana gene neighborhoods through bibliographic co-citations. Comput Chem 2002; 26:511-9. [PMID: 12144179 DOI: 10.1016/s0097-8485(02)00011-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
In the framework of genome annotation, scientific literature is obviously the major source of biological knowledge. The aim of the work described in this paper is to exploit this source of data for the model plant Arabidopsis thaliana. The first step has consisted in constituting a relevant bibliographic references dataset for plant genomic research. Genes co-citations have then been systematically annotated in this reference dataset, starting from the simple idea that if genes are cited in the same publication, they must probably share some related functional properties. In order to deal with the synonymous gene name problem, a gene name reference list has been constituted starting from A. thaliana SwissProt entries. This list was used to build clusters of co-cited genes by a single linkage procedure such that any gene in a given cluster possesses at least one co-cited partner in the same cluster. Analysis of the clusters demonstrate the biological consistency of this approach, with only very few fortuitous links. As an example, a cluster including genes related to flowering time is more deeply described in the paper. Finally, a graphical representation of each cluster was performed, which provides a convenient way to retrieve the genes (the nodes of the graphs) and the references in which they were co-cited (the edges of the graphs). All the results can be accessed at the URL http://chlora.Igi.infobiogen.fr:1234/bib_arath/.
Collapse
Affiliation(s)
- A Louis
- Laboratoire Génome et Informatique, Tour Evry 2, France.
| | | | | | | | | |
Collapse
|
4
|
Abstract
SUMMARY GeneANOVA is an ANOVA-based software devoted to the analysis of gene expression data. AVAILABILITY GeneANOVA is freely available on request for non-commercial use.
Collapse
Affiliation(s)
- G Didier
- Laboratoire Génome et Informatique, Tour Evry 2, 523 place des terrasses de l'Agora, 91034 Evry, France.
| | | | | | | |
Collapse
|
5
|
Devauchelle C, Grossmann A, Hénaut A, Holschneider M, Monnerot M, Risler JL, Torrésani B. Rate matrices for analyzing large families of protein sequences. J Comput Biol 2002; 8:381-99. [PMID: 11571074 DOI: 10.1089/106652701752236205] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022] Open
Abstract
We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multidimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignment deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which exist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects fluctuations arising from the finite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this fact, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs, and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable and related to the evolutionary history of the species under consideration.
Collapse
Affiliation(s)
- C Devauchelle
- Laboratoire Génome et Informatique, Tour Evry2, Evry, France.
| | | | | | | | | | | | | |
Collapse
|
6
|
Laprevotte I, Pupin M, Coward E, Didier G, Terzian C, Devauchelle C, Hénaut A. HIV-1 and HIV-2 LTR nucleotide sequences: assessment of the alignment by N-block presentation, "retroviral signatures" of overrepeated oligonucleotides, and a probable important role of scrambled stepwise duplications/deletions in molecular evolution. Mol Biol Evol 2001; 18:1231-45. [PMID: 11420363 DOI: 10.1093/oxfordjournals.molbev.a003909] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Previous analyses of retroviral nucleotide sequences, suggest a so-called "scrambled duplicative stepwise molecular evolution" (many sectors with successive duplications/deletions of short and longer motifs) that could have stemmed from one or several starter tandemly repeated short sequence(s). In the present report, we tested this hypothesis by focusing on the long terminal repeats (LTRs) (and flanking sequences) of 24 human and 3 simian immunodeficiency viruses. By using a calculation strategy applicable to short sequences, we found consensus overrepresented motifs (often containing CTG or CAG) that were congruent with the previously defined "retroviral signature." We also show many local repetition patterns that are significant when compared with simply shuffled sequences. First- and second-order Markov chain analyses demonstrate that a major portion of the overrepresented oligonucleotides can be predicted from the dinucleotide compositions of the sequences, but by no means can biological mechanisms be deduced from these results: some of the listed local repetitions remain significant against dinucleotide-conserving shuffled sequences; together with previous results, this suggests that interspersed and/or local mononucleotide and oligonucleotide repetitions could have biased the dinucleotide compositions of the sequences. We searched for suggestive evolutionary patterns by scrutinizing a reliable multiple alignment of the 27 sequences. A manually constructed alignment based on homology blocks was in good agreement with the polypeptide alignment in the coding sectors and has been exhaustively assessed by using a multiplied alphabet obtained by the promising mathematical strategy called the N-block presentation (taking into account the environment of each nucleotide in a sequence). Sector by sector, we hypothesize many successive duplication/deletion scenarios that fit our previous evolutionary hypotheses. This suggests an important duplication/deletion role for the reverse transcriptase, particularly in inducing stuttering cryptic simplicity patterns.
Collapse
Affiliation(s)
- I Laprevotte
- Laboratoire Génome et Informatique, Université de Versailles Saint Quentin-en-Yvelines, Versailles, France.
| | | | | | | | | | | | | |
Collapse
|
7
|
Comet JP, Aude JC, Glémet E, Risler JL, Hénaut A, Slonimski PP, Codani JJ. Significance of Z-value statistics of Smith-Waterman scores for protein alignments. Comput Chem 1999; 23:317-31. [PMID: 10627144 DOI: 10.1016/s0097-8485(99)00008-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme value distribution, more precisely the Gumbel distribution (global EVD, Extreme Value Distribution). However, for real protein sequences, we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-values from those which follow the global EVD. We then show that the interesting part of the tail of distribution of Z-values can be approximated by another EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law. This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial genomes comprising altogether 16956 protein sequences.
Collapse
Affiliation(s)
- J P Comet
- INRIA Rocquencourt, Le-Chesnay Cedex, France.
| | | | | | | | | | | | | |
Collapse
|
8
|
Nitschké P, Guerdoux-Jamet P, Chiapello H, Faroux G, Hénaut C, Hénaut A, Danchin A. Indigo: a World-Wide-Web review of genomes and gene functions. FEMS Microbiol Rev 1998; 22:207-27. [PMID: 9862121 DOI: 10.1111/j.1574-6976.1998.tb00368.x] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The present article describes a genome database reviewing gene-related knowledge of two model bacteria, Bacillus subtilis and Escherichia coli. The database, Indigo, is open through the World-Wide Web (http://indigo.genetique.uvsq.fr). The concept used for organising the data, the concept of neighbourhood, allows one to explore the database content in an efficient although somewhat unusual way. Here, genes are related to each other by a variety of neighbourhoods, including proximity in the chromosome, phylogenetic kinship, participation in a common metabolic pathway, common presence in an article of the literature, or similar use of the genetic code. Several examples illustrate how this concept of neighbourhood permits one to review the available knowledge about a given gene or gene family, and elaborate unexpected, but revealing, analyses about gene functions.
Collapse
Affiliation(s)
- P Nitschké
- Université de Versailles Saint Quentin, Laboratoire Génome et Informatique, France.
| | | | | | | | | | | | | |
Collapse
|
9
|
Hénaut A, Lisacek F, Nitschké P, Moszer I, Danchin A. Global analysis of genomic texts: the distribution of AGCT tetranucleotides in the Escherichia coli and Bacillus subtilis genomes predicts translational frameshifting and ribosomal hopping in several genes. Electrophoresis 1998; 19:515-27. [PMID: 9588797 DOI: 10.1002/elps.1150190411] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Present availability of the genomic text of bacteria allows assignment of biological known functions to many genes (typically, half of the genome's gene content). It is now time to try and predict new unexpected functions, using inductive procedures that allow correlating the content of the genomic text to possible biological functions. We show here that analysis of the genomes of Escherichia coli and Bacillus subtilis for the distribution of AGCT motifs predicts that genes exist for which the mRNA molecule can be translated as several different proteins synthesized after ribosomal frameshifting or hopping. Among these genes we found that several coded for the same function in E. coli and B. subtilis. We analyzed in depth the situation of the infB gene (experimentally known to specify synthesis of several proteins differing in their translation starts), the aceF/pdhC gene, the eno gene, and the rplI gene. In addition, genes specific to E. coli were also studied: ompA, ompFand tolA (predicting epigenetic variation that could help escape infection by phages or colicins).
Collapse
Affiliation(s)
- A Hénaut
- Université de Versailles Saint Quentin, France
| | | | | | | | | |
Collapse
|
10
|
Abstract
In this paper, the relationship between codon usage and the physiological pattern of expression of a gene is investigated while considering a dataset of 815 nuclear genes of Arabidopsis thaliana. Factorial Correspondence Analysis, a commonly used multivariate statistical approach in codon usage analysis, was used in order to analyse codon usage bias gene by gene. The analysis reveals a single major trend in codon usage among genes in Arabidopsis. At one end of the trend lie genes with a highly G/C biased codon usage. This group contains mainly photosynthetic and housekeeping genes which are known to encode the most abundant proteins of the vegetal cell. At the other extreme lie genes with a weaker A/T-biased codon usage. This group contain genes with various functions which exhibits most of the time a strong tissue-specific pattern of expression in relation, for example, to stress conditions. These observations were confirmed by the detailed analysis of codon usage in the multigene family of tubulins and appear to be general in plant species, even as distant from Arabidopsis thaliana as a monocotyledonous plant such as maize.
Collapse
Affiliation(s)
- H Chiapello
- Laboratoire de Biologie Cellulaire, INRA, Cedex, France
| | | | | | | |
Collapse
|
11
|
Terzian C, Laprevotte I, Brouillet S, Hénaut A. Genomic signatures: tracing the origin of retroelements at the nucleotide level. Genetica 1998; 100:271-9. [PMID: 9440280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
We investigate the nucleotide sequences of 23 retroelements (4 mammalian retroviruses, 1 human, 3 yeast, 2 plant, and 13 invertebrate retrotransposons) in terms of their oligonucleotide composition in order to address the problem of relationship between retrotransposons and retroviruses, and the coadaptation of these retroelements to their host genomes. We have identified by computer analysis over-represented 3-through 6-mers in each sequence. Our results indicate retrotransposons are heterogeneous in contrast to retroviruses, suggesting different modes of evolution by slippage-like mechanisms. Moreover, we have calculated the Observed/Expected number ratio for each of the 256 tetramers and analysed the data using a multivariate approach. The tetramer composition of retroelement sequences appears to be influenced by host genomic factors like methylase activity.
Collapse
Affiliation(s)
- C Terzian
- Centre de Génétique Moléculaire, CNRS, Gif-sur-Yvette, France
| | | | | | | |
Collapse
|
12
|
Affiliation(s)
- A Danchin
- Régulation de l'Expression Génétique, Institut Pasteur, Paris, France.
| | | |
Collapse
|
13
|
Guerdoux-Jamet P, Hénaut A, Nitschké P, Risler JL, Danchin A. Using codon usage to predict genes origin: is the Escherichia coli outer membrane a patchwork of products from different genomes? DNA Res 1997; 4:257-65. [PMID: 9405933 DOI: 10.1093/dnares/4.4.257] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Analysis of the codon usage of genes coding for the structural components of the outer membrane in Escherichia coli, is consistent with the requirement for high expression of these genes. Because porins (which constitute the major protein component of the outer membrane), and LPS (which constitute the major outermost constituent of the outer membrane), are synthesized from genes displaying widely different codon usage, it is possible to investigate the origin of the outer membrane. The analysis predicts that the outer membrane might originate from a genome other than the genome coding for the major part of the cell. Such a special origin would explain in structural terms, the likely lethality of porins if they were inadvertently inserted within the inner membrane, giving rise to the Gram-negative bacterial type, having an envelope comprising two membranes, instead of a single cytoplasmic membrane and a murein sacculus.
Collapse
|
14
|
Rivals E, Delgrange O, Delahaye JP, Dauchet M, Delorme MO, Hénaut A, Ollivier E. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Comput Appl Biosci 1997; 13:131-6. [PMID: 9146959 DOI: 10.1093/bioinformatics/13.2.131] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
MOTIVATION Compression algorithms can be used to analyse genetic sequences. A compression algorithm tests a given property on the sequence and uses it to encode the sequence: if the property is true, it reveals some structure of the sequence which can be described briefly, this yields a description of the sequence which is shorter than the sequence of nucleotides given in extenso. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. RESULTS We present a compression algorithm that tests the presence of a particular type of dosDNA (defined ordered sequence-DNA): approximate tandem repeats of small motifs (i.e. of lengths < 4). This algorithm has been experimented with on four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes.
Collapse
Affiliation(s)
- E Rivals
- Laboratoire d'Informatique Fondamentale de Lille, CNRS URA 369, Villeneuve d'Ascq, France.
| | | | | | | | | | | | | |
Collapse
|
15
|
Laprevotte I, Brouillet S, Terzian C, Hénaut A. Retroviral oligonucleotide distributions correlate with biased nucleotide compositions of retrovirus sequences, suggesting a duplicative stepwise molecular evolution. J Mol Evol 1997; 44:214-25. [PMID: 9069182 DOI: 10.1007/pl00006138] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A computer-assisted analysis was made of 24 complete nucleotide sequences selected from the vertebrate retroviruses to represent the ten viral groups. The conclusions of this analysis extend and strengthen the previously made hypothesis on the Moloney murine leukemia virus: The evolution of the nucleotide sequence appears to have occurred mainly through at least three overlapping levels of duplication: (1) The distributions of overrepresented (3-6)-mers are consistent with the universal rule of a trend toward TG/CT excess and with the persistence of a certain degree of symmetry between the two strands of DNA. This suggests one or several original tandemly repeated sequences and some inverted duplications. (2) The existence of two general core consensuses at the level of these (3-6)-mers supports the hypothesis of a common evolutionary origin of vertebrate retroviruses. Consensuses more specific to certain sequences are compatible with phylogenetic trees established independently. The consensuses could correspond to intermediary evolutionary stages. (3) Most of the (3-6)-mers with a significantly higher than average frequency appear to be internally repeated (with monomeric or oligomeric internal iterations) and seem to be at least partly the cause of the bias observed by other researchers at the level of retroviral nucleotide composition. They suggest a third evolutionary stage by slippage-like stepwise local duplications.
Collapse
Affiliation(s)
- I Laprevotte
- Laboratoire Rétrovirus et Rétrotransposons des Vertébrés, UPR 43 CNRS, Université Paris 7, Hôpital Saint Louis, France
| | | | | | | |
Collapse
|
16
|
Hénaut A, Rouxel T, Gleizes A, Moszer I, Danchin A. Uneven distribution of GATC motifs in the Escherichia coli chromosome, its plasmids and its phages. J Mol Biol 1996; 257:574-85. [PMID: 8648625 DOI: 10.1006/jmbi.1996.0186] [Citation(s) in RCA: 50] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
This work reconsiders the GATC motif distribution in a 1.6 Mb segment of the Escherichia coli genome, compared to its distribution in phages and plasmids. At first sight the distribution of GATC words looks random. But when a realistic model of the chromosome (made of average genes having the same codon usage as in the real chromasome), is used as a theoretical reference, strong biasesare observed. GATC pairs such as GATCNNGATC are under-represented while there is a strong positive selection for motifs separated by 10, 19, 70 and 1100 bp. The last class is the only one present in E. coli parasites. It can be ascribed to the triggering sequences of the long-patch mismatch repair system. The 6 bp class overlaps with the consensus of CAP (catabolite activator protein) and FNR (fumarate/nitrate regulator) binding sites, thus accounting for counter-selection. The other classes, which could be targets for a nucleic acid-binding protein, are almost always present inside protein coding sequences, and are members of clusters of GATC motifs. Analysis of the genes containing these motifs suggests that they correspond to a regulatory process monitoring the shift from anaerobic to aerobic growth conditions. In particular this regulation, closing down transcription of a large number of genes involved in intermediary metabolism would be well suited for the cold and oxygen shift from the mammal's gut to the standard environmental conditions. In this process the methylation status of GATC clusters would be very important for tuning transcription, and a DNA binding protein, probably a member of the cold-shock proteins family would be needed for alleviating the effects mediated by slackening of the pace of methylation during the shift.
Collapse
Affiliation(s)
- A Hénaut
- Centre de Génétique Moléculaire, CNRS, Gif sur Yvette, France
| | | | | | | | | |
Collapse
|
17
|
Diaz-Lazcoz Y, Hénaut A, Vigier P, Risler JL. Differential codon usage for conserved amino acids: evidence that the serine codons TCN were primordial. J Mol Biol 1995; 250:123-7. [PMID: 7608964 DOI: 10.1006/jmbi.1995.0363] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
The availability of specialized sequence databanks for Escherichia coli, Saccharomyces cerevisiae and Bacillus subtilis made it possible to build a set of 105 protein-coding genes that are homologous in these three species. An analysis of the triplets at both the nucleotide and amino acid level revealed that the codon bias of some amino acids are significantly higher at conserved rather than at non-conserved positions. Comparisons of homologous genes in E. coli and Salmonella typhimurium, and in S. cerevisiae and Drosophila melanogaster, led to the same conclusion. A special case was made for serine in E. coli, whose major codon is AGC for non-conserved and TCC for conserved residues. We interpret this observation as evidence that the primordial codons for serine were TCN, while codons AGY appeared later. This conclusion is substantiated by an analysis of the codon usage of catalytic serine residues in ancient, ubiquitous and essential proteins (ATP synthases and topoisomerases). It is shown that in these proteins the proportion of the catalytic serine residues coded by TCN is significantly higher than the one expected from the overall codon usage of serine residues.
Collapse
Affiliation(s)
- Y Diaz-Lazcoz
- Centre de Génétique Moléculaire du CNRS, Laboratoire associé à l'Université P. et M. Curie, Gif sur Yvette, France
| | | | | | | |
Collapse
|
18
|
Ollivier E, Delorme MO, Hénaut A. DosDNA occurs along yeast chromosomes, regardless of functional significance of the sequence. C R Acad Sci III 1995; 318:599-608. [PMID: 7671006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Complex genomes contain numerous simple sequence repeats, the biological significance of which remains obscure. Recently it has been shown that several human diseases are the result of changes in such sequences. Thus it has become urgent to undertake a systematic study of their properties. We have set the task of describing as completely as possible the set of sequences which contain bases organized according to symmetrical elements, the dosDNA: defined ordered sequence. Examination of local anomalies in dinucleotide composition serves to identify dosDNA zones in the genome. The study of chromosomes II, III, VIII and XI of Saccharomyces cerevisiae reveals these dosDNA zones comprise about 2% of the genome. They are regularly distributed along the chromosomes, regardless of the functional significance of the sequence. A more detailed analysis of dosDNA segments seems to indicate that simple repeats are the consequence of local properties of the chromosome, and not due to any motif in particular.
Collapse
Affiliation(s)
- E Ollivier
- Centre de génétique moléculaire du CNRS, Laboratoire associé à l'Université Pierre-et-Marie-Curie et à l'Université Versaille-Saint-Quentin, Gif-sur-Yvette, France
| | | | | |
Collapse
|
19
|
Abstract
A program for assembling sequences by using a global approach has been developed. By successive steps, a more and more precise classification of DNA fragments permits the positioning of the sequences on the contig; after having detected the pairs of overlapping sequences, groups are formed such that all sequences in a group overlap. Sequences common to several groups enable the groups to be ordered in a series. Ambiguities in the order of groups can arise at this stage, due to the presence of repeated fragments; different solutions are then proposed. Putting the groups into order leads to a preclassification of sequences. The fragments are then aligned by group, by searching for words common to all sequences in the group, and using an algorithm of dynamic programming. A detailed example on a set of nine sequences accompanies the description of the method.
Collapse
Affiliation(s)
- A Gleizes
- Centre de Génétique Moléculaire du CNRS, Gif sur Yvette, France
| | | |
Collapse
|
20
|
Abstract
Several data libraries have been created to organize all the data obtained worldwide about the Escherichia coli genome. Because the known data now amount to more than 40% of the whole genome sequence, it has become necessary to organize the data in such a way that appropriate procedures can associate knowledge produced by experiments about each gene to its position on the chromosome and its relation to other relevant genes, for example. In addition, global properties of genes, affected by the introduction of new entries, should be present as appropriate description fields. A data base, implemented on Macintosh by using the data base management system 4th Dimension, is described. It is constructed around a core constituted by known contigs of E. coli sequences and links data collected in general libraries (unmodified) to data associated with evolving knowledge (with modifiable fields). Biologically significant results obtained through the coupling of appropriate procedures (learning or statistical data analysis) are presented. The data base is available through a 4th Dimension runtime and through FTP on Internet. It has been regularly updated and will be systematically linked to other E. coli data bases (M. Kroger, R. Wahl, G. Schachtel, and P. Rice, Nucleic Acids Res. 20(Suppl.):2119-2144, 1992; K. E. Rudd, W. Miller, C. Werner, J. Ostell, C. Tolstoshev, and S. G. Satterfield, Nucleic Acids Res. 19:637-647, 1991) in the near future.
Collapse
|
21
|
Abstract
A method aimed at classifying protein sequences without resorting to pairwise alignment is presented. Called DOCMA (DOt-plot Comparisons by Multivariate Analysis), it is based on a multivariate analysis of the pairwise dot-plots between all the sequences in the set. The dot-plots are first simplified by considering only the projections of the 'diagonal' segments of similarity onto the axes. From these projections a data matrix is built, in which each column is representative of the comparisons of one given sequence with all the other ones. This data matrix is then transformed into a distance matrix by a chi-squared analysis, from which the coordinates of the sequences in an orthonormal Euclidean space are obtained. The sequences are finally classified by a dynamic clustering procedure followed by a search for strong clusters. Application of this method to protein families such as the globins, the cytochromes c and the aminoacyl-tRNA synthetases shows that it is quite effective in delineating subgroups that contain even distantly related sequences.
Collapse
Affiliation(s)
- C Landès
- Centre de Génétique Moléculaire du CNRS, Laboratoire Associé à l'Université Gif-sur-Yvette, France
| | | | | |
Collapse
|
22
|
Landès C, Hénaut A, Risler JL. A comparison of several similarity indices used in the classification of protein sequences: a multivariate analysis. Nucleic Acids Res 1992; 20:3631-7. [PMID: 1641329 PMCID: PMC334011 DOI: 10.1093/nar/20.14.3631] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The present work describes an attempt to identify reliable criteria which could be used as distance indices between protein sequences. Seven different criteria have been tested: i and ii) the scores of the alignments as given by the BESTFIT and the FASTA programs; iii) the ratio parameter, i.e. the BESTFIT score divided by the length of the aligned peptides; iv and v) the statistical significance (Z-scores) of the scores calculated by BESTFIT and FASTA, as obtained by comparison with shuffled sequences; vi) the Z-scores provided by the program RELATE which performs a segment-by-segment comparison of 2 sequences, and vii) an original distance index calculated by the program DOCMA from all the pairwise dotplots between the sequences. These 7 criteria have been tested against the aminoacid sequences of 39 globins and those of the 20 aminoacyl-tRNA synthetases from E. coli. The distances between the sequences were analyzed by the multivariate analysis techniques. The results show that the distances calculated from the scores of the pairwise alignments are not adequately sensitive. The Z-score from RELATE is not selective enough and too demanding in computer time. Three criteria gave a classification consistent with the known similarities between the sequences in the sets, namely the Z-scores from BESTFIT and FASTA and the multiple dotplot comparison distance index from DOCMA.
Collapse
Affiliation(s)
- C Landès
- Centre de Génétique Moléculaire du CNRS, Gif sur Yvette, France
| | | | | |
Collapse
|
23
|
Abstract
A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.
Collapse
Affiliation(s)
- S Brouillet
- Centre de Génétique Moléculaire du CNRS, Laboratoire Associé à l'Université P et M Curie, Gif-sur-Yvette, France
| | | | | | | |
Collapse
|
24
|
Abstract
After extracting more than 780 identified Escherichia coli genes from available data libraries, we investigated the codon usage of the corresponding coding sequences and extended the study of gene classes, thus obtained, to the nature and intensity of short nucleotide sequence selection, related to constraints operating at the nucleotide level. Using Factorial Correspondence Analysis we found that three classes ought to be included in order to match all data now available. The first two classes, as known, encompass genes expressed either continuously at a high level, or at a low level and/or rarely; the third class consists of genes corresponding to surface elements of the cell, genes coming from mobile elements as well as genes resulting in a high fidelity of DNA replication. This suggests that bacterial strains cultivated in the laboratory have been fixed by specific use of antimutator genes that are horizontally exchanged.
Collapse
Affiliation(s)
- C Médigue
- Atelier de BioInformatique, Section Physique-Chimie, Institut Curie, Paris, France
| | | | | | | | | |
Collapse
|
25
|
Abstract
A characteristic profile of the fluctuations of codon usage is observed in bacteriophages and mitochondria. By following the DNA in the direction of transcription, one moves slowly from a region where selective pressure favours codons ending with C to a region where the bias is in favour of codons ending with T; then, abruptly, one again enters a region of codons ending in C. The transcription end point takes place in the area of abrupt change in codon usage. By comparing Drosophila yakuba and mouse mitochondrial genomes, it is possible to show that the strategy of codon usage for a given gene depends on its location along the transcription unit and not on the encoded protein. The choice of codons ending in T or C allows large scale variations of DNA stability which could regulate the speed of propagation of the RNA polymerase.
Collapse
Affiliation(s)
- M O Delorme
- Centre de Génétique Moléculaire, Laboratoire propre du CNRS associé à l'Université Pierre et Marie Curie, Paris VI, Gif-sur-Yvette, France
| | | |
Collapse
|
26
|
Abstract
The DNA sequence data for Escherichia coli deposited in the EMBL library (release 27), together with miscellaneous data obtained from several laboratories, have been localized on an updated and corrected version of the restriction map of the chromosome generated by Kohara et al. (1987) and modified by others. This second update adds a further 500 kbp, increasing the amount of the E. coli chromosome sequenced to about one third of the total: 1510 kbp of sequenced DNA is included in the present data base. The accuracy of the map is assessed, and allows us to propose a precise genetic map position for every sequenced gene. The location of rare-cutting sites such as AvrII, NotI and SfiI have also been included in the update in order to combine the data obtained from different sources into one single file. The distribution of palindromic sequences (to which most restriction sites belong) has been studied in coding sequences. There appears to be a significant counter-selection against several such sequences in E. coli coding sequences (but not in other organisms such as Saccharomyces cerevisiae), suggesting the existence of constraints on DNA structure in E. coli, perhaps indicative of a functional role for horizontal gene transfer, preserving coding sequences, in this type of bacteria.
Collapse
Affiliation(s)
- C Médigue
- Section Physique-Chimie, Institut Curie, Paris, France
| | | | | | | |
Collapse
|
27
|
Abstract
The information collected in national and international libraries on nucleotide and protein sequences cannot be directly treated for proper handling by existing software. Therefore we evaluated the feasibility of constructing a data base for Escherichia coli using the data present in the banks. The knowhow thus acquired was applied to Bacillus subtilis. Specific examples of the general procedure are given.
Collapse
Affiliation(s)
- A Danchin
- Unité Régulation de l'Expression Génétique, Institut Pasteur, Paris, France
| | | | | | | | | |
Collapse
|
28
|
Médigue C, Hénaut A, Danchin A. Escherichia coli molecular genetic map (1000 kbp): update I. Mol Microbiol 1990; 4:1443-54. [PMID: 2287271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The sequenced genes from Escherichia coli that are available in the EMBL library (release 21) have been localized on an updated and corrected version of the restriction map of the chromosome generated by Kohara et al. (1987). One thousand kbp of sequenced DNA are incorporated in this update; this is equivalent to 23% of the total genome. The accuracy of the map is assessed, and it is corrected and updated where appropriate. A significant number of sites were missing from the original map, mainly involving two of the eight enzymes used by Kohara et al. (1987), ie. PvuII and EcoRV. The nucleotide environment of such missing sites was examined and, using an Artificial Intelligence approach, it appears that the site for these enzymes is sensitive to context effects. Several genes of known position on the E. coli chromosome could not be placed on the restriction map; this suggests that additional gaps are likely to exist on the restriction map, in addition to the original seven identified by Kohara et al. We have also obtained information about the probable direction of transcription of chromosomal genes with respect to the map. Most genes are transcribed in the same direction as the replication forks, particularly around oriC at 84 min.
Collapse
Affiliation(s)
- C Médigue
- Atelier de Bioinformatique, Section Physique-Chimie, Institut Curie, Paris, France
| | | | | |
Collapse
|
29
|
Abstract
This paper describes software (written in Pascal and running on Macintosh computers) allowing localization of unknown DNA fragments from the Escherichia coli chromosome on the restriction map established by Kohara et al. (1987). The program identifies the segment's map position using a restriction pattern analysis obtained with all, or some, of the eight enzymes used by Kohara et al. (1987). Therefore, the sequenced genes available in the EMBL library may be localized on the E. coli chromosome restriction map. This allowed correction of the map (mainly by introducing missing sites in the published maps) at the corresponding positions. Analysis of the data indicates that there is only a very low level of polymorphism, at the nucleotide level, between the E. coli K12 strains used by the various laboratories involved in DNA sequencing. The program is versatile enough to be used with other genomes.
Collapse
Affiliation(s)
- C Médigue
- Atelier de BioInformatique, Section Physique-Chimie, Institut Curie, Paris, France
| | | | | | | |
Collapse
|
30
|
Abstract
The graphical representation of distance matrices in a Euclidean space allows the merging of two distance matrices since the two matrices have shared elements. The graphical representation of the merging of the two distance matrices is associated with a robust method of classification that allows one to distinguish species for which membership to a cluster cannot be established with certainty. These possibilities are exploited to test the consistency of phylogenetic trees, and to establish exact relations between species for which one possesses different independent distance measurements (distance matrices established from several types of sequences for instance). The whole set of programs is written in BASIC and runs on microcomputers.
Collapse
Affiliation(s)
- M O Delorme
- Centre de Génétique Moleculaire du CNRS, Université Pierre et Marie Curie, Paris VI, Gif sur Yvette, France
| | | |
Collapse
|
31
|
Abstract
We analyzed the DNA sequences taking as an elementary pattern segments of increasing length from the codon to the gene. We have thus been able to identify part of the constraints from which originates the use of the code degeneracy in each gene. Our results show that the strategy of codon use is not solely related to the translation apparatus characteristics.
Collapse
|
32
|
|