1
|
Characterization of circRNA‑associated ceRNA networks in patients with nonvalvular persistent atrial fibrillation. Mol Med Rep 2018; 19:638-650. [PMID: 30483740 DOI: 10.3892/mmr.2018.9695] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 09/13/2018] [Indexed: 11/05/2022] Open
Abstract
Circular RNAs (circRNAs) are non-coding RNAs forming closed-loop structures, and their aberrant expression may lead to disease. However, the potential network of circRNA‑associated competing endogenous RNA (ceRNA) involved in nonvalvular persistent atrial fibrillation (NPAF) has not been previously reported. In the present study, four left atrial appendages (LAA) of patients with NPAF and four normal LAAs were examined via RNA sequencing, and their potential functions were investigated via bioinformatics analysis. The circRNA‑enriched genes were analyzed using Gene Ontology (GO) categories, while the enrichment of circRNAs was detected via the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. A total of 296 significantly dysregulated circRNA transcripts were obtained, with 238 upregulated and 58 downregulated. A number of circRNAs were further confirmed using reverse transcription‑quantitative polymerase chain reaction analysis. Furthermore, the more comprehensive circRNA‑associated ceRNA networks were examined in patients with NPAF. GO categories and KEGG annotation analysis of circRNAs revealed that the circRNA‑associated ceRNA networks were likely to influence AF though alterations in calcium and cardiac muscle contraction. The circRNA‑associated ceRNA networks revealed that dysregulated circRNAs in NPAF may be involved in regulating hsa‑microRNA (miR)‑208b and hsa‑miR‑21. To the best of our knowledge, this study presents the circRNA‑associated ceRNA networks in NPAF for the first time, which may have potential implications for the pathogenesis of AF. This study reveals a potential perspective from which to investigate circRNAs in circRNA‑associated ceRNA networks (hsa_circRNA002085, hsa_circRNA001321) in NPAF, and provides a potential biomarker for AF.
Collapse
|
2
|
Wang Y, Hu H, Li X. MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinformatics 2015; 16:36. [PMID: 25652152 PMCID: PMC4339733 DOI: 10.1186/s12859-015-0473-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 01/22/2015] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Binning environmental shotgun reads is one of the most fundamental tasks in metagenomic studies, in which mixed reads from different species or operational taxonomical units (OTUs) are separated into different groups. While dozens of binning methods are available, there is still room for improvement. RESULTS We developed a novel taxonomy-independent approach called MBBC (Metagenomic Binning Based on Clustering) to cluster environmental shotgun reads, by considering k-mer frequency in reads and Markov properties of the inferred OTUs. Tested on twelve simulated datasets, MBBC reliably estimated the species number, the genome size, and the relative abundance of each species, independent of whether there are errors in reads. Tested on multiple experimental datasets, MBBC outperformed two state-of-the-art taxonomy-independent methods, in terms of the accuracy of the estimated species number, genome sizes, and percentages of correctly assigned reads, among other metrics. CONCLUSIONS We have developed a novel method for binning metagenomic reads based on clustering. This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets. Our method also has a high accuracy in read binning. The MBBC software is freely available at http://eecs.ucf.edu/~xiaoman/MBBC/MBBC.html .
Collapse
Affiliation(s)
- Ying Wang
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
| | - Haiyan Hu
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
| | - Xiaoman Li
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL, 32816, USA.
| |
Collapse
|
3
|
Abstract
We extend the self-organizing approach for annotation of a bacterial genome to analyze the raw sequencing data of the human gut metagenome without sequence assembling. The original approach divides the genomic sequence of a bacterium into non-overlapping segments of equal length and assigns to each segment one of seven 'phases', among which one is for the noncoding regions, three for the direct coding regions to indicate the three possible codon positions of the segment starting site, and three for the reverse coding regions. The noncoding phase and the six coding phases are described by two frequency tables of the 64 triplet types or 'codon usages'. A set of codon usages can be used to update the phase assignment and vice versa. An iteration after an initialization leads to a convergent phase assignment to give an annotation of the genome. In the extension of the approach to a metagenome, we consider a mixture model of a number of categories described by different codon usages. The Illumina Genome Analyzer sequencing data of the total DNA from faecal samples are then examined to understand the diversity of the human gut microbiome.
Collapse
Affiliation(s)
- Jianfeng Zhu
- Beijing Genomics Institute, Tianjin (BGI-TJ), Tianjin 300308, China
| | - Wei-Mou Zheng
- Beijing Genomics Institute, Tianjin (BGI-TJ), Tianjin 300308, China; Institute of Theoretical Physics, Academia Sinica, Beijing 100190, China.
| |
Collapse
|
4
|
Abstract
We present here a novel methodology for predicting new genes in prokaryotic genomes on the basis of inherent energetics of DNA. Regions of higher thermodynamic stability were identified, which were filtered based on already known annotations to yield a set of potentially new genes. These were then processed for their compatibility with the stereo-chemical properties of proteins and tripeptide frequencies of proteins in Swissprot data, which results in a reliable set of new genes in a genome. Quite surprisingly, the methodology identifies new genes even in well-annotated genomes. Also, the methodology can handle genomes of any GC-content, size and number of annotated genes.
Collapse
|
5
|
Abstract
In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Collapse
Affiliation(s)
- Nicolas Carels
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | | |
Collapse
|
6
|
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res 2008; 15:387-96. [PMID: 18940874 PMCID: PMC2608843 DOI: 10.1093/dnares/dsn027] [Citation(s) in RCA: 471] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Recent advances in DNA sequencers are accelerating genome sequencing, especially in microbes, and complete and draft genomes from various species have been sequenced in rapid succession. Here, we present a comprehensive gene prediction tool, the MetaGeneAnnotator (MGA), which precisely predicts all kinds of prokaryotic genes from a single or a set of anonymous genomic sequences having a variety of lengths. The MGA integrates statistical models of prophage genes, in addition to those of bacterial and archaeal genes, and also uses a self-training model from input sequences for predictions. As a result, the MGA sensitively detects not only typical genes but also atypical genes, such as horizontally transferred and prophage genes in a prokaryotic genome. In this paper, we also propose a novel approach for analyzing the ribosomal binding site (RBS), which enables us to detect species-specific patterns of the RBSs. The MGA has the ingenious RBS model based on this approach, and precisely predicts translation starts of genes. The MGA also succeeds in improving prediction accuracies for short sequences by using the adapted RBS models (96% sensitivity and 93% specificity for 700 bp fragments). These features of the MGA expedite wide ranges of microbial genome studies, such as genome annotations and metagenome analyses.
Collapse
Affiliation(s)
- Hideki Noguchi
- Advanced Science and Technology Research Group, Mitsubishi Research Institute, Inc., 2-3-6 Otemachi, Chiyoda-ku, Tokyo 100-8141, Japan.
| | | | | |
Collapse
|
7
|
Lescot M, Audic S, Robert C, Nguyen TT, Blanc G, Cutler SJ, Wincker P, Couloux A, Claverie JM, Raoult D, Drancourt M. The genome of Borrelia recurrentis, the agent of deadly louse-borne relapsing fever, is a degraded subset of tick-borne Borrelia duttonii. PLoS Genet 2008; 4:e1000185. [PMID: 18787695 PMCID: PMC2525819 DOI: 10.1371/journal.pgen.1000185] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2008] [Accepted: 07/31/2008] [Indexed: 01/22/2023] Open
Abstract
In an effort to understand how a tick-borne pathogen adapts to the body louse, we sequenced and compared the genomes of the recurrent fever agents Borrelia recurrentis and B. duttonii. The 1,242,163–1,574,910-bp fragmented genomes of B. recurrentis and B. duttonii contain a unique 23-kb linear plasmid. This linear plasmid exhibits a large polyT track within the promoter region of an intact variable large protein gene and a telomere resolvase that is unique to Borrelia. The genome content is characterized by several repeat families, including antigenic lipoproteins. B. recurrentis exhibited a 20.4% genome size reduction and appeared to be a strain of B. duttonii, with a decaying genome, possibly due to the accumulation of genomic errors induced by the loss of recA and mutS. Accompanying this were increases in the number of impaired genes and a reduction in coding capacity, including surface-exposed lipoproteins and putative virulence factors. Analysis of the reconstructed ancestral sequence compared to B. duttonii and B. recurrentis was consistent with the accelerated evolution observed in B. recurrentis. Vector specialization of louse-borne pathogens responsible for major epidemics was associated with rapid genome reduction. The correlation between gene loss and increased virulence of B. recurrentis parallels that of Rickettsia prowazekii, with both species being genomic subsets of less-virulent strains. Borreliae are vector-borne spirochetes that are responsible for Lyme disease and recurrent fevers. We completed the genome sequences of the tick-borne Borrelia duttonii and the louse-borne B. recurrentis. The former of these is responsible for emerging infections that mimic malaria in Africa and in travellers, and the latter is responsible for severe recurrent fever in poor African populations. Diagnostic tools for these pathogens remain poor with regard to sensitivity and specificity due, in part, to the lack of genomic sequences. In this study, we show that the genomic content of B. recurrentis is a subset of that of B. duttonii, the genes of which are undergoing a decay process. These phenomena are common to all louse-borne pathogens compared to their tick-borne counterparts. In B. recurrentis, this process may be due to the inactivation of genes encoding DNA repair mechanisms, implying the accumulation of errors in the genome. The increased virulence of B. recurrentis could not be traced back to specific virulence factors, illustrating the lack of correlation between the virulence of a pathogen and so-called virulence genes. Knowledge of these genomes will allow for the development of new molecular tools that provide a more-accurate, sensitive, and specific diagnosis of these emerging infections.
Collapse
Affiliation(s)
- Magali Lescot
- Structural and Genomic Information Laboratory, CNRS UPR2589, IFR88, Parc Scientifique de Luminy, Marseille, France
| | - Stéphane Audic
- Structural and Genomic Information Laboratory, CNRS UPR2589, IFR88, Parc Scientifique de Luminy, Marseille, France
| | - Catherine Robert
- Unité des Rickettsies, UMR CNRS-IRD 6236, IFR48, Faculté de Médecine, Université de la Méditerranée, Marseille, France
| | - Thi Tien Nguyen
- Unité des Rickettsies, UMR CNRS-IRD 6236, IFR48, Faculté de Médecine, Université de la Méditerranée, Marseille, France
| | - Guillaume Blanc
- Structural and Genomic Information Laboratory, CNRS UPR2589, IFR88, Parc Scientifique de Luminy, Marseille, France
| | - Sally J. Cutler
- School of Health and Bioscience, University of East London, Stratford, London, United Kingdom
| | | | | | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory, CNRS UPR2589, IFR88, Parc Scientifique de Luminy, Marseille, France
| | - Didier Raoult
- Unité des Rickettsies, UMR CNRS-IRD 6236, IFR48, Faculté de Médecine, Université de la Méditerranée, Marseille, France
| | - Michel Drancourt
- Unité des Rickettsies, UMR CNRS-IRD 6236, IFR48, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- * E-mail:
| |
Collapse
|
8
|
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 694] [Impact Index Per Article: 40.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.
Collapse
|
9
|
Singhal P, Jayaram B, Dixit SB, Beveridge DL. Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations. Biophys J 2008; 94:4173-83. [PMID: 18326660 PMCID: PMC2480686 DOI: 10.1529/biophysj.107.116392] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2007] [Accepted: 11/29/2007] [Indexed: 01/27/2023] Open
Abstract
An ab initio model for gene prediction in prokaryotic genomes is proposed based on physicochemical characteristics of codons calculated from molecular dynamics (MD) simulations. The model requires a specification of three calculated quantities for each codon: the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. The base pairing and stacking energies for each codon are obtained from recently reported MD simulations on all unique tetranucleotide steps, and the third parameter is assigned based on the conjugate rule previously proposed to account for the wobble hypothesis with respect to degeneracies in the genetic code. The third interaction propensity parameter values correlate well with ab initio MD calculated solvation energies and flexibility of codon sequences as well as codon usage in genes and amino acid composition frequencies in approximately 175,000 protein sequences in the Swissprot database. Assignment of these three parameters for each codon enables the calculation of the magnitude and orientation of a cumulative three-dimensional vector for a DNA sequence of any length in each of the six genomic reading frames. Analysis of 372 genomes comprising approximately 350,000 genes shows that the orientations of the gene and nongene vectors are well differentiated and make a clear distinction feasible between genic and nongenic sequences at a level equivalent to or better than currently available knowledge-based models trained on the basis of empirical data, presenting a strong support for the possibility of a unique and useful physicochemical characterization of DNA sequences from codons to genomes.
Collapse
Affiliation(s)
- Poonam Singhal
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| | | | | | | |
Collapse
|
10
|
Tamaki S, Arakawa K, Kono N, Tomita M. Restauro-G: a rapid genome re-annotation system for comparative genomics. GENOMICS PROTEOMICS & BIOINFORMATICS 2007; 5:53-8. [PMID: 17572364 PMCID: PMC5054091 DOI: 10.1016/s1672-0229(07)60014-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
Abstract
Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/under the GNU General Public License.
Collapse
Affiliation(s)
- Satoshi Tamaki
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| | - Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Corresponding author.
| | - Nobuaki Kono
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| |
Collapse
|
11
|
Audic S, Robert C, Campagna B, Parinello H, Claverie JM, Raoult D, Drancourt M. Genome analysis of Minibacterium massiliensis highlights the convergent evolution of water-living bacteria. PLoS Genet 2007; 3:e138. [PMID: 17722982 PMCID: PMC1950954 DOI: 10.1371/journal.pgen.0030138] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2007] [Accepted: 07/03/2007] [Indexed: 12/25/2022] Open
Abstract
Filtration usually eliminates water-living bacteria. Here, we report on the complete genome sequence of Minibacterium massiliensis, a beta-proteobacteria that was recovered from 0.22-mum filtered water used for patients in the hospital. The unexpectedly large 4,110,251-nucleotide genome sequence of M. massiliensis was determined using the traditional shotgun sequencing approach. Bioinformatic analyses shows that the M. massiliensis genome sequence illustrates characteristic features of water-living bacteria, including overrepresentation of genes encoding transporters and transcription regulators. Phylogenomic analysis based on the gene content of available bacterial genome sequences displays a congruent evolution of water-living bacteria from various taxonomic origins, principally for genes involved in energy production and conversion, cell division, chromosome partitioning, and lipid metabolism. This phylogenomic clustering partially results from lateral gene transfer, which appears to be more frequent in water than in other environments. The M. massiliensis genome analyses strongly suggest that water-living bacteria are a common source for genes involved in heavy-metal resistance, antibiotics resistance, and virulence factors.
Collapse
Affiliation(s)
- Stéphane Audic
- Structural and Genomic Information Laboratory, Institute for Structural Biology and Microbiology, Marseille, France
- CNRS, UPR2589, Marseille, France
- * To whom correspondence should be addressed. E-mail: (SA); (MD)
| | - Catherine Robert
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- CNRS, UMR6020, Marseille, France
| | - Bernard Campagna
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- CNRS, UMR6020, Marseille, France
| | - Hugues Parinello
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- CNRS, UMR6020, Marseille, France
| | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory, Institute for Structural Biology and Microbiology, Marseille, France
- CNRS, UPR2589, Marseille, France
| | - Didier Raoult
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- CNRS, UMR6020, Marseille, France
| | - Michel Drancourt
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerranée, Marseille, France
- CNRS, UMR6020, Marseille, France
- * To whom correspondence should be addressed. E-mail: (SA); (MD)
| |
Collapse
|
12
|
Blanc G, Ogata H, Robert C, Audic S, Suhre K, Vestris G, Claverie JM, Raoult D. Reductive genome evolution from the mother of Rickettsia. PLoS Genet 2007; 3:e14. [PMID: 17238289 PMCID: PMC1779305 DOI: 10.1371/journal.pgen.0030014] [Citation(s) in RCA: 130] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2006] [Accepted: 12/08/2006] [Indexed: 11/30/2022] Open
Abstract
The Rickettsia genus is a group of obligate intracellular α-proteobacteria representing a paradigm of reductive evolution. Here, we investigate the evolutionary processes that shaped the genomes of the genus. The reconstruction of ancestral genomes indicates that their last common ancestor contained more genes, but already possessed most traits associated with cellular parasitism. The differences in gene repertoires across modern Rickettsia are mainly the result of differential gene losses from the ancestor. We demonstrate using computer simulation that the propensity of loss was variable across genes during this process. We also analyzed the ratio of nonsynonymous to synonymous changes (Ka/Ks) calculated as an average over large sets of genes to assay the strength of selection acting on the genomes of Rickettsia, Anaplasmataceae, and free-living γ-proteobacteria. As a general trend, Ka/Ks were found to decrease with increasing divergence between genomes. The high Ka/Ks for closely related genomes are probably due to a lag in the removal of slightly deleterious nonsynonymous mutations by natural selection. Interestingly, we also observed a decrease of the rate of gene loss with increasing divergence, suggesting a similar lag in the removal of slightly deleterious pseudogene alleles. For larger divergence (Ks > 0.2), Ka/Ks converge toward similar values indicating that the levels of selection are roughly equivalent between intracellular α-proteobacteria and their free-living relatives. This contrasts with the view that obligate endocellular microorganisms tend to evolve faster as a consequence of reduced effectiveness of selection, and suggests a major role of enhanced background mutation rates on the fast protein divergence in the obligate intracellular α-proteobacteria. Genome downsizing and fast sequence divergence are frequently observed in bacteria living exclusively within the cells of higher eukaryotes. However, the driving forces and contributions of these processes to the genome diversity of the microorganisms remain poorly understood. The genus Rickettsia, a group of small obligate intracellular pathogens of humans, provides a fascinating model to study the genome downsizing process. In this article, we used seven Rickettsia genomes to reconstruct the genome of their ancestor and inferred the origin and fate of the genes found in today's species. We identify the process of gene loss as the main cause of genome diversification within the genus and show that the rate of gene loss, sequence divergence, and genome rearrangements are highly variable across the various Rickettsia lineages. This heterogeneity likely reflects the intricate effects of specialization to distinct arthropod hosts and critical alterations of the gene repertoire, such as the losses of DNA repair genes and the amplification of mobile genes. In contrast, we did not find evidence for the role of reduced population sizes on the long-term acceleration of sequence evolution. Overall, the data presented in this article shed new light on the fundamental evolutionary processes that drive the evolution of obligate intracellular bacteria.
Collapse
Affiliation(s)
- Guillaume Blanc
- Structural and Genomic Information Laboratory, Institut de Biologie Structurale et Microbiologie, Parc Scientifique de Luminy, Marseille, France
- * To whom correspondence should be addressed. E-mail: (GB), (DR)
| | - Hiroyuki Ogata
- Structural and Genomic Information Laboratory, Institut de Biologie Structurale et Microbiologie, Parc Scientifique de Luminy, Marseille, France
| | | | - Stéphane Audic
- Structural and Genomic Information Laboratory, Institut de Biologie Structurale et Microbiologie, Parc Scientifique de Luminy, Marseille, France
| | - Karsten Suhre
- Structural and Genomic Information Laboratory, Institut de Biologie Structurale et Microbiologie, Parc Scientifique de Luminy, Marseille, France
| | - Guy Vestris
- Unité des Rickettsies, Faculté de Médecine, Marseille, France
| | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory, Institut de Biologie Structurale et Microbiologie, Parc Scientifique de Luminy, Marseille, France
| | - Didier Raoult
- Unité des Rickettsies, Faculté de Médecine, Marseille, France
- * To whom correspondence should be addressed. E-mail: (GB), (DR)
| |
Collapse
|
13
|
Raoult D, La Scola B, Birtles R. The discovery and characterization of Mimivirus, the largest known virus and putative pneumonia agent. Clin Infect Dis 2007; 45:95-102. [PMID: 17554709 DOI: 10.1086/518608] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 03/05/2007] [Indexed: 11/03/2022] Open
Abstract
During recent years, the usefulness of amoebal co-cultures as an alternative means of isolating and cultivating fastidious microorganisms has been increasingly recognized. While characterizing a collection of bacteria that had been isolated using this approach, we encountered an organism that, on preliminary analysis, appeared to be a gram-positive coccus. However, additional examination revealed that it was not a bacterium but rather, surprisingly, a virus. The dimensions of the virus particle (diameter, 0.8 microm) and its genome size (1.2 Mb) are far more akin to those of bacteria than to those of previously recognized viruses. These characteristics, together with such features as the breadth and complexity of its gene content, challenge the current definition of a "virus." Furthermore, the virus, now named "Mimivirus," has been implicated as an agent of pneumonia in humans and, thus, should be considered a putative emerging pathogen.
Collapse
Affiliation(s)
- Didier Raoult
- Unité des Rickettsies, Faculté de Médecine, Université de la Méditerrannée, Marseille, France.
| | | | | |
Collapse
|
14
|
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 2006; 34:5623-30. [PMID: 17028096 PMCID: PMC1636498 DOI: 10.1093/nar/gkl723] [Citation(s) in RCA: 636] [Impact Index Per Article: 33.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2006] [Revised: 09/01/2006] [Accepted: 09/19/2006] [Indexed: 11/13/2022] Open
Abstract
Exhaustive gene identification is a fundamental goal in all metagenomics projects. However, most metagenomic sequences are unassembled anonymous fragments, and conventional gene-finding methods cannot be applied. We have developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures. MetaGene can predict a whole range of prokaryotic genes based on the anonymous genomic sequences of a few hundred bases, with a sensitivity of 95% and a specificity of 90% for artificial shotgun sequences (700 bp fragments from 12 species). MetaGene has two sets of codon frequency interpolations, one for bacteria and one for archaea, and automatically selects the proper set for a given sequence using the domain classification method we propose. The domain classification works properly, correctly assigning domain information to more than 90% of the artificial shotgun sequences. Applied to the Sargasso Sea dataset, MetaGene predicted almost all of the annotated genes and a notable number of novel genes. MetaGene can be applied to wide variety of metagenomic projects and expands the utility of metagenomics.
Collapse
Affiliation(s)
- Hideki Noguchi
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba 277-8562, Japan.
| | | | | |
Collapse
|
15
|
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 2005; 33:6494-506. [PMID: 16314312 PMCID: PMC1298918 DOI: 10.1093/nar/gki937] [Citation(s) in RCA: 625] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2005] [Revised: 10/12/2005] [Accepted: 10/12/2005] [Indexed: 11/25/2022] Open
Abstract
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
Collapse
Affiliation(s)
- Alexandre Lomsadze
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | | | - Yury O. Chernoff
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
| | - Mark Borodovsky
- School of Biology, Georgia Institute of TechnologyAtlanta, GA 30332-0230, USA
- Department of Biomedical Engineering, Georgia Institute of TechnologyAtlanta, GA 30332-0535, USA
| |
Collapse
|
16
|
Abstract
The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.
Collapse
Affiliation(s)
- Kevin Chen
- *To whom correspondence should be addressed. E-mail: (KC), (LP)
| | - Lior Pachter
- *To whom correspondence should be addressed. E-mail: (KC), (LP)
| |
Collapse
|
17
|
Ogata H, Renesto P, Audic S, Robert C, Blanc G, Fournier PE, Parinello H, Claverie JM, Raoult D. The genome sequence of Rickettsia felis identifies the first putative conjugative plasmid in an obligate intracellular parasite. PLoS Biol 2005; 3:e248. [PMID: 15984913 PMCID: PMC1166351 DOI: 10.1371/journal.pbio.0030248] [Citation(s) in RCA: 202] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2005] [Accepted: 05/11/2005] [Indexed: 12/04/2022] Open
Abstract
We sequenced the genome of Rickettsia felis, a flea-associated obligate intracellular α-proteobacterium causing spotted fever in humans. Besides a circular chromosome of 1,485,148 bp, R. felis exhibits the first putative conjugative plasmid identified among obligate intracellular bacteria. This plasmid is found in a short (39,263 bp) and a long (62,829 bp) form. R.felis contrasts with previously sequenced Rickettsia in terms of many other features, including a number of transposases, several chromosomal toxin–antitoxin genes, many more spoT genes, and a very large number of ankyrin- and tetratricopeptide-motif-containing genes. Host-invasion-related genes for patatin and RickA were found. Several phenotypes predicted from genome analysis were experimentally tested: conjugative pili and mating were observed, as well as β-lactamase activity, actin-polymerization-driven mobility, and hemolytic properties. Our study demonstrates that complete genome sequencing is the fastest approach to reveal phenotypic characters of recently cultured obligate intracellular bacteria. Rickettsia felis is an obligate intracellular bacterium that lives in fleas and causes spotted fever in humans. Its genome sequence provides the first evidence that such bacteria can undergo conjugation.
Collapse
Affiliation(s)
- Hiroyuki Ogata
- 1 Structural and Genomic Information Laboratory, UPR 2589, IBSM, CNRS, Marseille Cedex, France
| | - Patricia Renesto
- 2 Unité des Rickettsies, UMR 6020, IFR 48, CNRS, Faculté de Médecine, Marseille Cedex, France
| | - Stéphane Audic
- 1 Structural and Genomic Information Laboratory, UPR 2589, IBSM, CNRS, Marseille Cedex, France
| | - Catherine Robert
- 2 Unité des Rickettsies, UMR 6020, IFR 48, CNRS, Faculté de Médecine, Marseille Cedex, France
| | - Guillaume Blanc
- 1 Structural and Genomic Information Laboratory, UPR 2589, IBSM, CNRS, Marseille Cedex, France
| | - Pierre-Edouard Fournier
- 1 Structural and Genomic Information Laboratory, UPR 2589, IBSM, CNRS, Marseille Cedex, France
- 2 Unité des Rickettsies, UMR 6020, IFR 48, CNRS, Faculté de Médecine, Marseille Cedex, France
| | - Hugues Parinello
- 2 Unité des Rickettsies, UMR 6020, IFR 48, CNRS, Faculté de Médecine, Marseille Cedex, France
| | - Jean-Michel Claverie
- 1 Structural and Genomic Information Laboratory, UPR 2589, IBSM, CNRS, Marseille Cedex, France
| | - Didier Raoult
- 2 Unité des Rickettsies, UMR 6020, IFR 48, CNRS, Faculté de Médecine, Marseille Cedex, France
| |
Collapse
|
18
|
Raoult D, Ogata H, Audic S, Robert C, Suhre K, Drancourt M, Claverie JM. Tropheryma whipplei Twist: a human pathogenic Actinobacteria with a reduced genome. Genome Res 2003; 13:1800-9. [PMID: 12902375 PMCID: PMC403771 DOI: 10.1101/gr.1474603] [Citation(s) in RCA: 100] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The human pathogen Tropheryma whipplei is the only known reduced genome species (<1 Mb) within the Actinobacteria [high G+C Gram-positive bacteria]. We present the sequence of the 927303-bp circular genome of T. whipplei Twist strain, encoding 808 predicted protein-coding genes. Specific genome features include deficiencies in amino acid metabolisms, the lack of clear thioredoxin and thioredoxin reductase homologs, and a mutation in DNA gyrase predicting a resistance to quinolone antibiotics. Moreover, the alignment of the two available T. whipplei genome sequences (Twist vs. TW08/27) revealed a large chromosomal inversion the extremities of which are located within two paralogous genes. These genes belong to a large cell-surface protein family defined by the presence of a common repeat highly conserved at the nucleotide level. The repeats appear to trigger frequent genome rearrangements in T. whipplei, potentially resulting in the expression of different subsets of cell surface proteins. This might represent a new mechanism for evading host defenses. The T. whipplei genome sequence was also compared to other reduced bacterial genomes to examine the generality of previously detected features. The analysis of the genome sequence of this previously largely unknown human pathogen is now guiding the development of molecular diagnostic tools and more convenient culture conditions.
Collapse
Affiliation(s)
- Didier Raoult
- Unité des Rickettsies, Faculté de Médecine, CNRS UMR6020, Université de la Méditerranée, 13385 Marseille Cedex 05, France.
| | | | | | | | | | | | | |
Collapse
|
19
|
Zheng WM, Wu F. In-phase implies large likelihood for independent codon model: distinguishing coding from non-coding sequences. J Theor Biol 2003; 223:199-203. [PMID: 12814602 DOI: 10.1016/s0022-5193(03)00086-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
It is proven that under the independent codon model, the likelihood of a DNA coding sequence read according to the correct frame is asymptotically larger than that read with an incorrect frame. Based on this proposition, a single set of probabilities of the codon usage is enough for discriminating the six frames of coding sequences under the independent codon model. The direct coding sequence of Escherichia coli genome is taken as an example to examine the codon independency by using the mutual information and chi2 analysis. The contrast between the coding frame and the two offset frames is evident. A self-learning approach for generating training set is proposed to estimate probability parameters.
Collapse
Affiliation(s)
- Wei-Mou Zheng
- Institute of Theoretical Physics, Academia Sinica, PO Box 2735, Beijing 100080, People's Republic of China.
| | | |
Collapse
|
20
|
Abstract
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.
Collapse
Affiliation(s)
- H-M Müller
- Division of Biology and W. K. Kellogg Radiation Laboratory, California Institute of Technology, 1201 East California Boulevard, Pasadena, CA 91125, USA.
| | | |
Collapse
|
21
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
22
|
Abstract
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
Collapse
Affiliation(s)
- Tetsuo Shibuya
- Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan
| | | |
Collapse
|
23
|
Aggarwal G, Ramaswamy R. Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER. J Biosci 2002; 27:7-14. [PMID: 11927773 DOI: 10.1007/bf02703679] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We compare the annotation of three complete genomes using the ab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail.
Collapse
Affiliation(s)
- Gautam Aggarwal
- School of Physical Sciences, Jawaharlal Nehru University, New Delhi 110 067, India
| | | |
Collapse
|
24
|
Lin TH, Wang GM, Wang YT. Prediction of beta-turns in proteins using the first-order Markov models. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 2002; 42:123-33. [PMID: 11855976 DOI: 10.1021/ci0103020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
We present a method based on the first-order Markov models for predicting simple beta-turns and loops containing multiple turns in proteins. Sequences of 338 proteins in a database are divided using the published turn criteria into the following three regions, namely, the turn, the boundary, and the nonturn ones. A transition probability matrix is constructed for either the turn or the nonturn region using the weighted transition probabilities computed for dipeptides identified from each region. There are two such matrices constructed for the boundary region since the transition probabilities for dipeptides immediately preceding or following a turn are different. The window used for scanning a protein sequence from amino (N-) to carboxyl (C-) terminal is a hexapeptide since the transition probability computed for a turn tetrapeptide is capped at both the N- and C- termini with a boundary transition probability indexed respectively from the two boundary transition matrices. A sum of the averaged product of the transition probabilities of all the hexapeptides involving each residue is computed. This is then weighted with a probability computed from assuming that all the hexapeptides are from the nonturn region to give the final prediction quantity. Both simple beta-turns and loops containing multiple turns in a protein are then identified by the rising of the prediction quantity computed. The performance of the prediction scheme or the percentage (%) of correct prediction is evaluated through computation of Matthews correlation coefficients for each protein predicted. It is found that the prediction method is capable of giving prediction results with better correlation between the percent of correct prediction and the Matthews correlation coefficients for a group of test proteins as compared with those predicted using some secondary structural prediction methods. The prediction accuracy for about 40% of proteins in the database or 50% of proteins in the test set is better than 70%. Such a percentage for the test set is reduced to 30 if the structures of all the proteins in the set are treated as unknown.
Collapse
Affiliation(s)
- Thy-Hou Lin
- Department of Life Science, National Tsing Hua University, Hsinchu, Taiwan 30043, ROC.
| | | | | |
Collapse
|
25
|
Abstract
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Collapse
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory,UMR 1889 CNRS-AVENTIS, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France.
| | | | | | | |
Collapse
|
26
|
Ogata H, Audic S, Renesto-Audiffren P, Fournier PE, Barbe V, Samson D, Roux V, Cossart P, Weissenbach J, Claverie JM, Raoult D. Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science 2001; 293:2093-8. [PMID: 11557893 DOI: 10.1126/science.1061471] [Citation(s) in RCA: 303] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Rickettsia conorii is an obligate intracellular bacterium that causes Mediterranean spotted fever in humans. We determined the 1,268,755-nucleotide complete genome sequence of R. conorii, containing 1374 open reading frames. This genome exhibits 804 of the 834 genes of the previously determined R. prowazekii genome plus 552 supplementary open reading frames and a 10-fold increase in the number of repetitive elements. Despite these differences, the two genomes exhibit a nearly perfect colinearity that allowed the clear identification of different stages of gene alterations with gene remnants and 37 genes split in 105 fragments, of which 59 are transcribed. A 38-kilobase sequence inversion was dated shortly after the divergence of the genus.
Collapse
MESH Headings
- Adaptation, Physiological
- Chlamydia/genetics
- Computational Biology
- DNA, Bacterial/genetics
- DNA, Intergenic
- Evolution, Molecular
- Gene Dosage
- Gene Silencing
- Gene Transfer, Horizontal
- Genes, Bacterial
- Genome, Bacterial
- Open Reading Frames
- Phylogeny
- Polymerase Chain Reaction
- Repetitive Sequences, Nucleic Acid
- Rickettsia/genetics
- Rickettsia conorii/genetics
- Rickettsia conorii/physiology
- Rickettsia prowazekii/genetics
- Rickettsia prowazekii/physiology
- Sequence Analysis, DNA
- Transcription, Genetic
Collapse
Affiliation(s)
- H Ogata
- Information Génétique & Structurale, CNRS-AVENTIS UMR 1889, 31 chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001; 29:2607-18. [PMID: 11410670 PMCID: PMC55746 DOI: 10.1093/nar/29.12.2607] [Citation(s) in RCA: 1779] [Impact Index Per Article: 74.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2001] [Revised: 05/03/2001] [Accepted: 05/03/2001] [Indexed: 02/06/2023] Open
Abstract
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.
Collapse
Affiliation(s)
- J Besemer
- School of Biology and School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA
| | | | | |
Collapse
|
28
|
Abstract
While most organisms grow at temperatures ranging between 20 and 50 degrees C, many archaea and a few bacteria have been found capable of withstanding temperatures close to 100 degrees C, or beyond, such as Pyrococcus or Aquifex. Here we report the results of two independent large scale unbiased approaches to identify global protein properties correlating with an extreme thermophile lifestyle. First, we performed a comparative proteome analyses using 30 complete genome sequences from the three kingdoms. A large difference between the proportions of charged versus polar (noncharged) amino acids was found to be a signature of all hyperthermophilic organisms. Second, we analyzed the water accessible surfaces of 189 protein structures belonging to mesophiles or hyperthermophiles. We found that the surfaces of hyperthermophilic proteins exhibited the shift already observed at the genomic level, i.e. a proportion of solvent accessible charged residues strongly increased at the expense of polar residues. The biophysical requirements for the presence of charged residues at the protein surface, allowing protein stabilization through ion bonds, is therefore clearly imprinted and detectable in all genome sequences available to date.
Collapse
Affiliation(s)
- C Cambillau
- Architecture et Fonction des Macromolécules Biologiques, CNRS UPR9039, France.
| | | |
Collapse
|
29
|
Ogata H, Audic S, Barbe V, Artiguenave F, Fournier PE, Raoult D, Claverie JM. Selfish DNA in protein-coding genes of Rickettsia. Science 2000; 290:347-50. [PMID: 11030655 DOI: 10.1126/science.290.5490.347] [Citation(s) in RCA: 99] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Rickettsia conorii, the aetiological agent of Mediterranean spotted fever, is an intracellular bacterium transmitted by ticks. Preliminary analyses of the nearly complete genome sequence of R. conorii have revealed 44 occurrences of a previously undescribed palindromic repeat (150 base pairs long) throughout the genome. Unexpectedly, this repeat was found inserted in-frame within 19 different R. conorii open reading frames likely to encode functional proteins. We found the same repeat in proteins of other Rickettsia species. The finding of a mobile element inserted in many unrelated genes suggests the potential role of selfish DNA in the creation of new protein sequences.
Collapse
MESH Headings
- Bacterial Proteins/chemistry
- Bacterial Proteins/genetics
- Base Sequence
- Conserved Sequence
- DNA, Bacterial/genetics
- Evolution, Molecular
- Genome, Bacterial
- Interspersed Repetitive Sequences
- Molecular Sequence Data
- Mutagenesis, Insertional
- Nucleic Acid Conformation
- Open Reading Frames/genetics
- Protein Biosynthesis
- Protein Conformation
- Protein Structure, Secondary
- RNA, Bacterial/chemistry
- RNA, Bacterial/genetics
- RNA, Bacterial/metabolism
- RNA, Messenger/chemistry
- RNA, Messenger/genetics
- RNA, Messenger/metabolism
- Rickettsia/genetics
- Rickettsia conorii/genetics
Collapse
Affiliation(s)
- H Ogata
- Information Génétique & Structurale, CNRS-AVENTIS UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
| | | | | | | | | | | | | |
Collapse
|
30
|
Hirosawa M, Ishikawa K, Nagase T, Ohara O. Detection of spurious interruptions of protein-coding regions in cloned cDNA sequences by GeneMark analysis. Genome Res 2000; 10:1333-41. [PMID: 10984451 DOI: 10.1101/gr.129500] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
cDNA is an artificial copy of mRNA and, therefore, no cDNA can be completely free from suspicion of cloning errors. Because overlooking these cloning errors results in serious misinterpretation of cDNA sequences, development of an alerting system targeting spurious sequences in cloned cDNAs is an urgent requirement for massive cDNA sequence analysis. We describe here the application of a modified GeneMark program, originally designed for prokaryotic gene finding, for detection of artifacts in cDNA clones. This program serves to provide a warning when any spurious split of protein-coding regions is detected through statistical analysis of cDNA sequences based on Markov models. In this study, 817 cDNA sequences deposited in public databases by us were subjected to analysis using this alerting system to assess its sensitivity and specificity. The results indicated that any spurious split of protein-coding regions in cloned cDNAs could be sensitively detected and systematically revised by means of this system after the experimental validation of the alerts. Furthermore, this study offered us, for the first time, statistical data regarding the rates and types of errors causing protein-coding splits in cloned cDNAs obtained by conventional cloning methods.
Collapse
Affiliation(s)
- M Hirosawa
- Kazusa DNA Research Institute, Kisarazu, Chiba 292-0812, Japan
| | | | | | | |
Collapse
|
31
|
Alimi JP, Poirot O, Lopez F, Claverie JM. Reverse transcriptase-polymerase chain reaction validation of 25 "orphan" genes from Escherichia coli K-12 MG1655. Genome Res 2000; 10:959-66. [PMID: 10899145 PMCID: PMC310931 DOI: 10.1101/gr.10.7.959] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Despite the accumulation of sequence information sampling from a broad spectrum of phyla, newly sequenced genomes continue to reveal a high proportion (50%-30%) of "uncharacterized" genes, including a significant number of strictly "orphan" genes, i.e., putative open reading frames (ORFs) without any resemblance to previously determined protein-coding sequences. Most genes found in databases have only been predicted by computer methods and have never been experimentally validated. Although theoretical evolutionary arguments support the reality of genes when homologs are found in a variety of distant species, this is not the case for orphan genes. Here, we report the direct reverse transcriptase-polymerase chain reaction assay of 25 strictly orphan ORFs of Escherichia coli. Two growth conditions, exponential and stationary phases, were tested. Transcripts were identified for a total of 19 orphan genes, with 2 genes found to be expressed in only one of the two growth conditions. Our results suggest that a vast majority of E. coli ORFs presently annotated as "hypothetical" correspond to bona fide genes. By extension, this implies that randomly occurring "junk" ORFs have been actively counter selected during the evolution of the dense E. coli genome.
Collapse
Affiliation(s)
- J P Alimi
- Structural and Genetic Information Laboratory, Marseille, France
| | | | | | | |
Collapse
|
32
|
Abstract
Complete genomic sequences of microbial pathogens and hosts offer sophisticated new strategies for studying host-pathogen interactions. DNA microarrays exploit primary sequence data to measure transcript levels and detect sequence polymorphisms, for every gene, simultaneously. The design and construction of a DNA microarray for any given microbial genome are straightforward. By monitoring microbial gene expression, one can predict the functions of uncharacterized genes, probe the physiologic adaptations made under various environmental conditions, identify virulence-associated genes, and test the effects of drugs. Similarly, by using host gene microarrays, one can explore host response at the level of gene expression and provide a molecular description of the events that follow infection. Host profiling might also identify gene expression signatures unique for each pathogen, thus providing a novel tool for diagnosis, prognosis, and clinical management of infectious disease.
Collapse
|
33
|
Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol 1998; 283:707-25. [PMID: 9790834 DOI: 10.1006/jmbi.1998.2144] [Citation(s) in RCA: 262] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.
Collapse
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, PF 10.2209, Germany.
| | | | | | | | | | | |
Collapse
|