51
|
Niu XN, Wei ZQ, Zou HF, Xie GG, Wu F, Li KJ, Jiang W, Tang JL, He YQ. Complete sequence and detailed analysis of the first indigenous plasmid from Xanthomonas oryzae pv. oryzicola. BMC Microbiol 2015; 15:233. [PMID: 26498126 PMCID: PMC4619425 DOI: 10.1186/s12866-015-0562-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 10/08/2015] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND Bacterial plasmids have a major impact on metabolic function and adaptation of their hosts. An indigenous plasmid was identified in a Chinese isolate (GX01) of the invasive phytopathogen Xanthomonas oryzae pv. oryzicola (Xoc), the causal agent of rice bacterial leaf streak (BLS). To elucidate the biological functions of the plasmid, we have sequenced and comprehensively annotated the plasmid. METHODS The plasmid DNA was extracted from Xoc strain GX01 by alkaline lysis and digested with restriction enzymes. The cloned and subcloned DNA fragments in pUC19 were sequenced by Sanger sequencing. Sequences were assembled by using Sequencher software. Gaps were closed by primer walking and sequencing, and multi-PCRs were conducted through the whole plasmid sequence for verification. BLAST, phylogenetic analysis and dinucleotide calculation were performed for gene annotation and DNA structure analysis. Transformation, transconjugation and stress tolerance tests were carried out for plasmid function assays. RESULTS The indigenous plasmid from Xoc strain GX01, designated pXOCgx01, is 53,206-bp long and has been annotated to possess 64 open reading frames (ORFs), including genes encoding type IV secretion system, heavy metal exporter, plasmid stability factors, and DNA mobile factors, i.e., the Tn3-like transposon. Bioinformatics analysis showed that pXOCgx01 has a mosaic structure containing different genome contexts with distinct genomic heterogeneities. Phylogenetic analysis indicated that the closest relative of pXOCgx01 is pXAC64 from Xanthomonas axonopodis pv. citri str. 306. It was estimated that there are four copies of pXOCgx01 per cell of Xoc GX01 by PCR assay and the calculation of whole genome shotgun sequencing data. We demonstrate that pXOCgx01 is a self-transmissible plasmid and can replicate in some Xanthomonas spp. strains, but not in Escherichia coli DH5α. It could significantly enhance the tolerance of Xanthomonas oryzae pv. oryzae PXO99A to the stresses of heavy metal ions. The plasmid survey indicated that nine out of 257 Xoc Chinese isolates contain plasmids. CONCLUSIONS pXOCgx01 is the first report of indigenous plasmid from Xanthomonas oryzae pv. oryzicola, and the first completely sequenced plasmid from Xanthomonas oryzae species. It is a self-transmissible plasmid and has a mosaic structure, containing genes for macromolecule secretion, heavy metal exportation, and DNA mobile factors, especially the Tn3-like transposon which may provide transposition function for mobile insertion cassette and play a major role in the spread of pathogenicity determinants. The results will be helpful to elucidate the biological significance of this cryptic plasmid and the adaptive evolution of Xoc.
Collapse
Affiliation(s)
- Xiang-Na Niu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Zhi-Qiong Wei
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Hai-Fan Zou
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Gui-Gang Xie
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Feng Wu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Kang-Jia Li
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Wei Jiang
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Ji-Liang Tang
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| | - Yong-Qiang He
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, The Key Laboratory of Ministry of Education for Microbial and Plant Genetic Engineering, and College of Life Science and Technology, Guangxi University, 100 Daxue Road, Nanning, 530004, China.
| |
Collapse
|
52
|
Spring-Pearson SM, Stone JK, Doyle A, Allender CJ, Okinaka RT, Mayo M, Broomall SM, Hill JM, Karavis MA, Hubbard KS, Insalaco JM, McNew LA, Rosenzweig CN, Gibbons HS, Currie BJ, Wagner DM, Keim P, Tuanyok A. Pangenome Analysis of Burkholderia pseudomallei: Genome Evolution Preserves Gene Order despite High Recombination Rates. PLoS One 2015; 10:e0140274. [PMID: 26484663 PMCID: PMC4613141 DOI: 10.1371/journal.pone.0140274] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 09/23/2015] [Indexed: 11/19/2022] Open
Abstract
The pangenomic diversity in Burkholderia pseudomallei is high, with approximately 5.8% of the genome consisting of genomic islands. Genomic islands are known hotspots for recombination driven primarily by site-specific recombination associated with tRNAs. However, recombination rates in other portions of the genome are also high, a feature we expected to disrupt gene order. We analyzed the pangenome of 37 isolates of B. pseudomallei and demonstrate that the pangenome is ‘open’, with approximately 136 new genes identified with each new genome sequenced, and that the global core genome consists of 4568±16 homologs. Genes associated with metabolism were statistically overrepresented in the core genome, and genes associated with mobile elements, disease, and motility were primarily associated with accessory portions of the pangenome. The frequency distribution of genes present in between 1 and 37 of the genomes analyzed matches well with a model of genome evolution in which 96% of the genome has very low recombination rates but 4% of the genome recombines readily. Using homologous genes among pairs of genomes, we found that gene order was highly conserved among strains, despite the high recombination rates previously observed. High rates of gene transfer and recombination are incompatible with retaining gene order unless these processes are either highly localized to specific sites within the genome, or are characterized by symmetrical gene gain and loss. Our results demonstrate that both processes occur: localized recombination introduces many new genes at relatively few sites, and recombination throughout the genome generates the novel multi-locus sequence types previously observed while preserving gene order.
Collapse
Affiliation(s)
- Senanu M. Spring-Pearson
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Joshua K. Stone
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Adina Doyle
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Christopher J. Allender
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Richard T. Okinaka
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Mark Mayo
- Menzies School of Health Research and Infectious Disease Department, Royal Darwin Hospital. Darwin, Northern Territory, Australia
| | - Stacey M. Broomall
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Jessica M. Hill
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Mark A. Karavis
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Kyle S. Hubbard
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Joseph M. Insalaco
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Lauren A. McNew
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - C. Nicole Rosenzweig
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Henry S. Gibbons
- BioSciences Division, Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD, United States of America
| | - Bart J. Currie
- Menzies School of Health Research and Infectious Disease Department, Royal Darwin Hospital. Darwin, Northern Territory, Australia
| | - David M. Wagner
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
| | - Paul Keim
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
- * E-mail:
| | - Apichai Tuanyok
- Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ 86011, United States of America
- Department of Infectious Diseases and Pathology, University of Florida, Gainesville, FL, United States of America
| |
Collapse
|
53
|
Abstract
Formation of heat-resistant endospores is a specific property of the members of the phylum Firmicutes (low-G+C Gram-positive bacteria). It is found in representatives of four different classes of Firmicutes, Bacilli, Clostridia, Erysipelotrichia, and Negativicutes, which all encode similar sets of core sporulation proteins. Each of these classes also includes non-spore-forming organisms that sometimes belong to the same genus or even species as their spore-forming relatives. This chapter reviews the diversity of the members of phylum Firmicutes, its current taxonomy, and the status of genome-sequencing projects for various subgroups within the phylum. It also discusses the evolution of the Firmicutes from their apparently spore-forming common ancestor and the independent loss of sporulation genes in several different lineages (staphylococci, streptococci, listeria, lactobacilli, ruminococci) in the course of their adaptation to the saprophytic lifestyle in a nutrient-rich environment. It argues that the systematics of Firmicutes is a rapidly developing area of research that benefits from the evolutionary approaches to the ever-increasing amount of genomic and phenotypic data and allows arranging these data into a common framework.
Collapse
|
54
|
Natural selection causes adaptive genetic resistance in wild emmer wheat against powdery mildew at "Evolution Canyon" microsite, Mt. Carmel, Israel. PLoS One 2015; 10:e0122344. [PMID: 25856164 PMCID: PMC4391946 DOI: 10.1371/journal.pone.0122344] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Accepted: 02/13/2015] [Indexed: 12/05/2022] Open
Abstract
Background “Evolution Canyon” (ECI) at Lower Nahal Oren, Mount Carmel, Israel, is an optimal natural microscale model for unraveling evolution in action highlighting the basic evolutionary processes of adaptation and speciation. A major model organism in ECI is wild emmer, Triticum dicoccoides, the progenitor of cultivated wheat, which displays dramatic interslope adaptive and speciational divergence on the tropical-xeric “African” slope (AS) and the temperate-mesic “European” slope (ES), separated on average by 250 m. Methods We examined 278 single sequence repeats (SSRs) and the phenotype diversity of the resistance to powdery mildew between the opposite slopes. Furthermore, 18 phenotypes on the AS and 20 phenotypes on the ES, were inoculated by both Bgt E09 and a mixture of powdery mildew races. Results In the experiment of genetic diversity, very little polymorphism was identified intra-slope in the accessions from both the AS or ES. By contrast, 148 pairs of SSR primers (53.23%) amplified polymorphic products between the phenotypes of AS and ES. There are some differences between the two wild emmer wheat genomes and the inter-slope SSR polymorphic products between genome A and B. Interestingly, all wild emmer types growing on the south-facing slope (SFS=AS) were susceptible to a composite of Blumeria graminis, while the ones growing on the north-facing slope (NFS=ES) were highly resistant to Blumeria graminis at both seedling and adult stages. Conclusion/Significance Remarkable inter-slope evolutionary divergent processes occur in wild emmer wheat, T. dicoccoides at EC I, despite the shot average distance of 250 meters. The AS, a dry and hot slope, did not develop resistance to powdery mildew, whereas the ES, a cool and humid slope, did develop resistance since the disease stress was strong there. This is a remarkable demonstration in host-pathogen interaction on how resistance develops when stress causes an adaptive result at a micro-scale distance.
Collapse
|
55
|
Labonté JM, Swan BK, Poulos B, Luo H, Koren S, Hallam SJ, Sullivan MB, Woyke T, Wommack KE, Stepanauskas R. Single-cell genomics-based analysis of virus-host interactions in marine surface bacterioplankton. ISME JOURNAL 2015; 9:2386-99. [PMID: 25848873 PMCID: PMC4611503 DOI: 10.1038/ismej.2015.48] [Citation(s) in RCA: 152] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 01/27/2015] [Accepted: 02/26/2015] [Indexed: 02/01/2023]
Abstract
Viral infections dynamically alter the composition and metabolic potential of marine microbial communities and the evolutionary trajectories of host populations with resulting feedback on biogeochemical cycles. It is quite possible that all microbial populations in the ocean are impacted by viral infections. Our knowledge of virus–host relationships, however, has been limited to a minute fraction of cultivated host groups. Here, we utilized single-cell sequencing to obtain genomic blueprints of viruses inside or attached to individual bacterial and archaeal cells captured in their native environment, circumventing the need for host and virus cultivation. A combination of comparative genomics, metagenomic fragment recruitment, sequence anomalies and irregularities in sequence coverage depth and genome recovery were utilized to detect viruses and to decipher modes of virus–host interactions. Members of all three tailed phage families were identified in 20 out of 58 phylogenetically and geographically diverse single amplified genomes (SAGs) of marine bacteria and archaea. At least four phage–host interactions had the characteristics of late lytic infections, all of which were found in metabolically active cells. One virus had genetic potential for lysogeny. Our findings include first known viruses of Thaumarchaeota, Marinimicrobia, Verrucomicrobia and Gammaproteobacteria clusters SAR86 and SAR92. Viruses were also found in SAGs of Alphaproteobacteria and Bacteroidetes. A high fragment recruitment of viral metagenomic reads confirmed that most of the SAG-associated viruses are abundant in the ocean. Our study demonstrates that single-cell genomics, in conjunction with sequence-based computational tools, enable in situ, cultivation-independent insights into host–virus interactions in complex microbial communities.
Collapse
Affiliation(s)
| | - Brandon K Swan
- Bigelow Laboratory for Ocean Sciences, East Boothbay, ME, USA
| | - Bonnie Poulos
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Haiwei Luo
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Sergey Koren
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Steven J Hallam
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, Canada
| | - Matthew B Sullivan
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Walnut Creek, CA, USA
| | - K Eric Wommack
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE, USA
| | | |
Collapse
|
56
|
Abstract
Dinucleotide usage is known to vary in the genomes of organisms. The dinucleotide usage profiles or genome signatures are similar for sequence samples taken from the same genome, but are different for taxonomically distant species. This concept of genome signatures has been used to study several organisms including viruses, to elucidate the signatures of evolutionary processes at the genome level. Genome signatures assume greater importance in the case of host-pathogen interactions, where molecular interactions between the two species take place continuously, and can influence their genomic composition. In this study, analyses of whole genome sequences of the HIV-1 subtype B, a retrovirus that caused global pandemic of AIDS, have been carried out to analyse the variation in genome signatures of the virus from 1983 to 2007. We show statistically significant temporal variations in some dinucleotide patterns highlighting the selective evolution of the dinucleotide profiles of HIV-1 subtype B, possibly a consequence of host specific selection.
Collapse
|
57
|
Iwasaki Y, Abe T, Okada N, Wada K, Wada Y, Ikemura T. Evolutionary changes in vertebrate genome signatures with special focus on coelacanth. DNA Res 2014; 21:459-67. [PMID: 24800745 PMCID: PMC4195492 DOI: 10.1093/dnares/dsu012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Institute of Science and Technology, Niigata University, Niigata-ken 950-2181, Japan
| | - Norihiro Okada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan Faculty of Bioscience and Biotechnology, Tokyo Institute of Technology, Yokohama, Kanagawa 226, Japan Department of Life Sciences, National Cheng Kung University, Tainan 701, Taiwan
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| |
Collapse
|
58
|
Furuta Y, Namba-Fukuyo H, Shibata TF, Nishiyama T, Shigenobu S, Suzuki Y, Sugano S, Hasebe M, Kobayashi I. Methylome diversification through changes in DNA methyltransferase sequence specificity. PLoS Genet 2014; 10:e1004272. [PMID: 24722038 PMCID: PMC3983042 DOI: 10.1371/journal.pgen.1004272] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2013] [Accepted: 02/13/2014] [Indexed: 12/20/2022] Open
Abstract
Epigenetic modifications such as DNA methylation have large effects on gene expression and genome maintenance. Helicobacter pylori, a human gastric pathogen, has a large number of DNA methyltransferase genes, with different strains having unique repertoires. Previous genome comparisons suggested that these methyltransferases often change DNA sequence specificity through domain movement--the movement between and within genes of coding sequences of target recognition domains. Using single-molecule real-time sequencing technology, which detects N6-methyladenines and N4-methylcytosines with single-base resolution, we studied methylated DNA sites throughout the H. pylori genome for several closely related strains. Overall, the methylome was highly variable among closely related strains. Hypermethylated regions were found, for example, in rpoB gene for RNA polymerase. We identified DNA sequence motifs for methylation and then assigned each of them to a specific homology group of the target recognition domains in the specificity-determining genes for Type I and other restriction-modification systems. These results supported proposed mechanisms for sequence-specificity changes in DNA methyltransferases. Knocking out one of the Type I specificity genes led to transcriptome changes, which suggested its role in gene expression. These results are consistent with the concept of evolution driven by DNA methylation, in which changes in the methylome lead to changes in the transcriptome and potentially to changes in phenotype, providing targets for natural or artificial selection.
Collapse
Affiliation(s)
- Yoshikazu Furuta
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Minato-ku, Tokyo, Japan
- Institute of Medical Science, University of Tokyo, Minato-ku, Tokyo, Japan
| | - Hiroe Namba-Fukuyo
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Minato-ku, Tokyo, Japan
| | | | - Tomoaki Nishiyama
- Advanced Science Research Center, Kanazawa University, Kanazawa, Japan
| | - Shuji Shigenobu
- National Institute for Basic Biology, Okazaki, Japan
- Department of Basic Biology, School of Life Science, Graduate University for Advanced Studies, Okazaki, Japan
| | - Yutaka Suzuki
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Minato-ku, Tokyo, Japan
| | - Sumio Sugano
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Minato-ku, Tokyo, Japan
| | - Mitsuyasu Hasebe
- National Institute for Basic Biology, Okazaki, Japan
- Department of Basic Biology, School of Life Science, Graduate University for Advanced Studies, Okazaki, Japan
| | - Ichizo Kobayashi
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Minato-ku, Tokyo, Japan
- Institute of Medical Science, University of Tokyo, Minato-ku, Tokyo, Japan
- * E-mail:
| |
Collapse
|
59
|
A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data. BIOMED RESEARCH INTERNATIONAL 2014; 2014:765648. [PMID: 24804244 PMCID: PMC3996302 DOI: 10.1155/2014/765648] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Accepted: 02/14/2014] [Indexed: 11/17/2022]
Abstract
With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
Collapse
|
60
|
Visualization of genome signatures of eukaryote genomes by batch-learning self-organizing map with a special emphasis on Drosophila genomes. BIOMED RESEARCH INTERNATIONAL 2014; 2014:985706. [PMID: 24741568 PMCID: PMC3967822 DOI: 10.1155/2014/985706] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2013] [Accepted: 02/04/2014] [Indexed: 11/24/2022]
Abstract
A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method “BLSOM” for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.
Collapse
|
61
|
Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 2014; 15:193-204. [PMID: 24514441 DOI: 10.1038/nrg3520] [Citation(s) in RCA: 402] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Short open reading frames (sORFs) are a common feature of all genomes, but their coding potential has mostly been disregarded, partly because of the difficulty in determining whether these sequences are translated. Recent innovations in computing, proteomics and high-throughput analyses of translation start sites have begun to address this challenge and have identified hundreds of putative coding sORFs. The translation of some of these has been confirmed, although the contribution of their peptide products to cellular functions remains largely unknown. This Review examines this hitherto overlooked component of the proteome and considers potential roles for sORF-encoded peptides.
Collapse
|
62
|
Srivastava SK, Huang X, Brar HK, Fakhoury AM, Bluhm BH, Bhattacharyya MK. The genome sequence of the fungal pathogen Fusarium virguliforme that causes sudden death syndrome in soybean. PLoS One 2014; 9:e81832. [PMID: 24454689 PMCID: PMC3891557 DOI: 10.1371/journal.pone.0081832] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 10/28/2013] [Indexed: 02/02/2023] Open
Abstract
UNLABELLED Fusarium virguliforme causes sudden death syndrome (SDS) of soybean, a disease of serious concern throughout most of the soybean producing regions of the world. Despite the global importance, little is known about the pathogenesis mechanisms of F. virguliforme. Thus, we applied Next-Generation DNA Sequencing to reveal the draft F. virguliforme genome sequence and identified putative pathogenicity genes to facilitate discovering the mechanisms used by the pathogen to cause this disease. METHODOLOGY/PRINCIPAL FINDINGS We have generated the draft genome sequence of F. virguliforme by conducting whole-genome shotgun sequencing on a 454 GS-FLX Titanium sequencer. Initially, single-end reads of a 400-bp shotgun library were assembled using the PCAP program. Paired end sequences from 3 and 20 Kb DNA fragments and approximately 100 Kb inserts of 1,400 BAC clones were used to generate the assembled genome. The assembled genome sequence was 51 Mb. The N50 scaffold number was 11 with an N50 Scaffold length of 1,263 Kb. The AUGUSTUS gene prediction program predicted 14,845 putative genes, which were annotated with Pfam and GO databases. Gene distributions were uniform in all but one of the major scaffolds. Phylogenic analyses revealed that F. virguliforme was closely related to the pea pathogen, Nectria haematococca. Of the 14,845 F. virguliforme genes, 11,043 were conserved among five Fusarium species: F. virguliforme, F. graminearum, F. verticillioides, F. oxysporum and N. haematococca; and 1,332 F. virguliforme-specific genes, which may include pathogenicity genes. Additionally, searches for candidate F. virguliforme pathogenicity genes using gene sequences of the pathogen-host interaction database identified 358 genes. CONCLUSIONS The F. virguliforme genome sequence and putative pathogenicity genes presented here will facilitate identification of pathogenicity mechanisms involved in SDS development. Together, these resources will expedite our efforts towards discovering pathogenicity mechanisms in F. virguliforme. This will ultimately lead to improvement of SDS resistance in soybean.
Collapse
Affiliation(s)
- Subodh K. Srivastava
- Department of Agronomy, Iowa State University, Ames, Iowa, United States of America
| | - Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, Iowa, United States of America
| | - Hargeet K. Brar
- Department of Agronomy, Iowa State University, Ames, Iowa, United States of America
| | - Ahmad M. Fakhoury
- Department of Plant, Soil Science, and Agricultural Systems, Southern Illinois University, Carbondale, Illinois, United States of America
| | - Burton H. Bluhm
- Department of Plant Pathology, University of Arkansas, Fayetteville, Arkansas, United States of America
| | | |
Collapse
|
63
|
Satapathy SS, Powdel BR, Dutta M, Buragohain AK, Ray SK. Constraint on di-nucleotides by codon usage bias in bacterial genomes. Gene 2013; 536:18-28. [PMID: 24333347 DOI: 10.1016/j.gene.2013.11.098] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 11/18/2013] [Accepted: 11/25/2013] [Indexed: 10/25/2022]
Abstract
It has been reported earlier that the relative di-nucleotide frequency (RDF) in different parts of a genome is similar while the frequency is variable among different genomes. So RDF is termed as genome signature in bacteria. It is not known if the constancy in RDF is governed by genome wide mutational bias or by selection. Here we did comparative analysis of RDF between the inter-genic and the coding sequences in seventeen bacterial genomes, whose gene expression data was available. The constraint on di-nucleotides was found to be higher in the coding sequences than that in the inter-genic regions and the constraint at the 2nd codon position was more than that in the 3rd position within a genome. Further analysis revealed that the constraint on di-nucleotides at the 2nd codon position is greater in the high expression genes (HEG) than that in the whole genomes as well as in the low expression genes (LEG). We analyzed RDF at the 2nd and the 3rd codon positions in simulated coding sequences that were computationally generated by keeping the codon usage bias (CUB) according to genome G+C composition and the sequence of amino acids unaltered. In the simulated coding sequences, the constraint observed was significantly low and no significant difference was observed between the HEG and the LEG in terms of di-nucleotide constraint. This indicated that the greater constraint on di-nucleotides in the HEG was due to the stronger selection on CUB in these genes in comparison to the LEG within a genome. Further, we did comparative analyses of the RDF in the HEG rpoB and rpoC of 199 bacteria, which revealed a common pattern of constraints on di-nucleotides at the 2nd codon position across these bacteria. To validate the role of CUB on di-nucleotide constraint, we analyzed RDF at the 2nd and the 3rd codon positions in simulated rpoB/rpoC sequences. The analysis revealed that selection on CUB is an important attribute for the constraint on di-nucleotides at these positions in bacterial genomes. We believe that this study has come with major findings of the role of CUB on di-nucleotide constraint in bacterial genomes.
Collapse
Affiliation(s)
| | - Bhes Raj Powdel
- Department of Statistics, Darrang College, Tezpur, Assam 784001, India
| | - Malay Dutta
- Department of Computer Science and Engineering, Tezpur University, Tezpur, Assam 784 028, India
| | - Alak Kumar Buragohain
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam 784 028, India; Dibrugarh University, Dibrugarh, Assam 786004, India
| | - Suvendra Kumar Ray
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam 784 028, India.
| |
Collapse
|
64
|
Selection on GGU and CGU codons in the high expression genes in bacteria. J Mol Evol 2013; 78:13-23. [PMID: 24271854 DOI: 10.1007/s00239-013-9596-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Accepted: 11/11/2013] [Indexed: 12/22/2022]
Abstract
The fourfold degenerate site (FDS) in coding sequences is important for studying the effect of any selection pressure on codon usage bias (CUB) because nucleotide substitution per se is not under any such pressure at the site due to the unaltered amino acid sequence in a protein. We estimated the frequency variation of nucleotides at the FDS across the eight family boxes (FBs) defined as Um(g), the unevenness measure of a gene g. The study was made in 545 species of bacteria. In many bacteria, the Um(g) correlated strongly with Nc'-a measure of the CUB. Analysis of the strongly correlated bacteria revealed that the U-ending codons (GGU, CGU) were preferred to the G-ending codons (GGG, CGG) in Gly and Arg FBs even in the genomes with G+C % higher than 65.0. Further evidence suggested that these codons can be used as a good indicator of selection pressure on CUB in genomes with higher G+C %.
Collapse
|
65
|
Iwasaki Y, Abe T, Wada K, Wada Y, Ikemura T. A Novel Bioinformatics Strategy to Analyze Microbial Big Sequence Data for Efficient Knowledge Discovery: Batch-Learning Self-Organizing Map (BLSOM). Microorganisms 2013; 1:137-157. [PMID: 27694768 PMCID: PMC5029494 DOI: 10.3390/microorganisms1010137] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Revised: 11/05/2013] [Accepted: 11/08/2013] [Indexed: 11/24/2022] Open
Abstract
With the remarkable increase of genomic sequence data of microorganisms, novel tools are needed for comprehensive analyses of the big sequence data available. The self-organizing map (SOM) is an effective tool for clustering and visualizing high-dimensional data, such as oligonucleotide composition on one map. By modifying the conventional SOM, we developed batch-learning SOM (BLSOM), which allowed classification of sequence fragments (e.g., 1 kb) according to phylotypes, solely depending on oligonucleotide composition. Metagenomics studies of uncultivable microorganisms in clinical and environmental samples should allow extensive surveys of genes important in life sciences. BLSOM is most suitable for phylogenetic assignment of metagenomic sequences, because fragmental sequences can be clustered according to phylotypes, solely depending on oligonucleotide composition. We first constructed oligonucleotide BLSOMs for all available sequences from genomes of known species, and by mapping metagenomic sequences on these large-scale BLSOMs, we can predict phylotypes of individual metagenomic sequences, revealing a microbial community structure of uncultured microorganisms, including viruses. BLSOM has shown that influenza viruses isolated from humans and birds clearly differ in oligonucleotide composition. Based on this host-dependent oligonucleotide composition, we have proposed strategies for predicting directional changes of virus sequences and for surveilling potentially hazardous strains when introduced into humans from non-human sources.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Japan Society for the Promotion of Science, Chiyoda-ku, Tokyo 102-0083, Japan.
| | - Takashi Abe
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata-ken 950-2181, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Faculty of Medicine, Shiga University of Medical Science, Shiga-ken 520-2121, Japan.
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| |
Collapse
|
66
|
Sharma R, Ahlawat S, Maitra A, Roy M, Mandakmale S, Tantia MS. Polymorphism of BMP4 gene in Indian goat breeds differing in prolificacy. Gene 2013; 532:140-5. [PMID: 24013084 DOI: 10.1016/j.gene.2013.08.086] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2013] [Revised: 07/22/2013] [Accepted: 08/26/2013] [Indexed: 10/26/2022]
Abstract
Bone morphogenetic proteins (BMPs) are members of the TGF-β (transforming growth factor-beta) superfamily, of which BMP4 is the most important due to its crucial role in follicular growth and differentiation, cumulus expansion and ovulation. Reproduction is a crucial trait in goat breeding and based on the important role of BMP4 gene in reproduction it was considered as a possible candidate gene for the prolificacy of goats. The objective of the present study was to detect polymorphism in intronic, exonic and 3' un-translated regions of BMP4 gene in Indian goats. Nine different goat breeds (Barbari, Beetal, Black Bengal, Malabari, Jakhrana (Twinning>40%), Osmanabadi, Sangamneri (Twinning 20-30%), Sirohi and Ganjam (Twinning<10%)) differing in prolificacy and geographic distribution were employed for polymorphism scanning. Cattle sequence (AC_000167.1) was used to design primers for the amplification of a targeted region followed by direct DNA sequencing to identify the genetic variations. Single nucleotide polymorphisms (SNPs) were not detected in exon 3, the intronic region and the 3' flanking region. A SNP (G1534A) was identified in exon 2. It was a non-synonymous mutation resulting in an arginine to lysine change in a corresponding protein sequence. G to A transition at the 1534 locus revealed two genotypes GG and GA in the nine investigated goat breeds. The GG genotype was predominant with a genotype frequency of 0.98. The GA genotype was present in the Black Bengal as well as Jakhrana breed with a genotype frequency of 0.02. A microsatellite was identified in the 3' flanking region, only 20 nucleotides downstream from the termination site of the coding region, as a short sequence with more than nineteen continuous and repeated CA dinucleotides. Since the gene is highly evolutionarily conserved, identification of a non-synonymous SNP (G1534A) in the coding region gains further importance. To our knowledge, this is the first report of a mutation in the coding region of the caprine BMP4 gene. But whether the reproduction trait of goat is associated with the BMP4 polymorphism, needs to be further defined by association studies in more populations so as to delineate an effect on it.
Collapse
Affiliation(s)
- Rekha Sharma
- National Bureau of Animal Genetic Resources, Karnal 132001, India.
| | | | | | | | | | | |
Collapse
|
67
|
Skewes AD, Welch RD. A Markovian analysis of bacterial genome sequence constraints. PeerJ 2013; 1:e127. [PMID: 24010012 PMCID: PMC3757466 DOI: 10.7717/peerj.127] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2013] [Accepted: 07/18/2013] [Indexed: 11/20/2022] Open
Abstract
The arrangement of nucleotides within a bacterial chromosome is influenced by numerous factors. The degeneracy of the third codon within each reading frame allows some flexibility of nucleotide selection; however, the third nucleotide in the triplet of each codon is at least partly determined by the preceding two. This is most evident in organisms with a strong G + C bias, as the degenerate codon must contribute disproportionately to maintaining that bias. Therefore, a correlation exists between the first two nucleotides and the third in all open reading frames. If the arrangement of nucleotides in a bacterial chromosome is represented as a Markov process, we would expect that the correlation would be completely captured by a second-order Markov model and an increase in the order of the model (e.g., third-, fourth-…order) would not capture any additional uncertainty in the process. In this manuscript, we present the results of a comprehensive study of the Markov property that exists in the DNA sequences of 906 bacterial chromosomes. All of the 906 bacterial chromosomes studied exhibit a statistically significant Markov property that extends beyond second-order, and therefore cannot be fully explained by codon usage. An unrooted tree containing all 906 bacterial chromosomes based on their transition probability matrices of third-order shares ∼25% similarity to a tree based on sequence homologies of 16S rRNA sequences. This congruence to the 16S rRNA tree is greater than for trees based on lower-order models (e.g., second-order), and higher-order models result in diminishing improvements in congruence. A nucleotide correlation most likely exists within every bacterial chromosome that extends past three nucleotides. This correlation places significant limits on the number of nucleotide sequences that can represent probable bacterial chromosomes. Transition matrix usage is largely conserved by taxa, indicating that this property is likely inherited, however some important exceptions exist that may indicate the convergent evolution of some bacteria.
Collapse
Affiliation(s)
- Aaron D Skewes
- Department of Biology, Syracuse University , Syracuse, NY, United States ; Department of Mathematics, Syracuse University , Syracuse, NY , United States
| | | |
Collapse
|
68
|
Iwasaki Y, Abe T, Wada Y, Wada K, Ikemura T. Novel bioinformatics strategies for prediction of directional sequence changes in influenza virus genomes and for surveillance of potentially hazardous strains. BMC Infect Dis 2013; 13:386. [PMID: 23964903 PMCID: PMC3765179 DOI: 10.1186/1471-2334-13-386] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2013] [Accepted: 08/05/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the remarkable increase of microbial and viral sequence data obtained from high-throughput DNA sequencers, novel tools are needed for comprehensive analysis of the big sequence data. We have developed "Batch-Learning Self-Organizing Map (BLSOM)" which can characterize very many, even millions of, genomic sequences on one plane. Influenza virus is one of zoonotic viruses and shows clear host tropism. Important issues for bioinformatics studies of influenza viruses are prediction of genomic sequence changes in the near future and surveillance of potentially hazardous strains. METHODS To characterize sequence changes in influenza virus genomes after invasion into humans from other animal hosts, we applied BLSOMs to analyses of mono-, di-, tri-, and tetranucleotide compositions in all genome sequences of influenza A and B viruses and found clear host-dependent clustering (self-organization) of the sequences. RESULTS Viruses isolated from humans and birds differed in mononucleotide composition from each other. In addition, host-dependent oligonucleotide compositions that could not be explained with the host-dependent mononucleotide composition were revealed by oligonucleotide BLSOMs. Retrospective time-dependent directional changes of mono- and oligonucleotide compositions, which were visualized for human strains on BLSOMs, could provide predictive information about sequence changes in newly invaded viruses from other animal hosts (e.g. the swine-derived pandemic H1N1/09). CONCLUSIONS Basing on the host-dependent oligonucleotide composition, we proposed a strategy for prediction of directional changes of virus sequences and for surveillance of potentially hazardous strains when introduced into human populations from non-human sources. Millions of genomic sequences from infectious microbes and viruses have become available because of their medical and social importance, and BLSOM can characterize the big data and support efficient knowledge discovery.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shiShiga-ken, 526-0829, Japan
| | | | | | | | | |
Collapse
|
69
|
Iwasaki Y, Wada K, Wada Y, Abe T, Ikemura T. Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance. Chromosome Res 2013; 21:461-74. [PMID: 23896648 PMCID: PMC3761090 DOI: 10.1007/s10577-013-9371-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Revised: 06/14/2013] [Accepted: 06/14/2013] [Indexed: 11/29/2022]
Abstract
Since oligonucleotide composition in the genome sequence varies significantly among species even among those possessing the same genome G + C%, the composition has been used to distinguish a wide range of genomes and called as “genome signature”. Oligonucleotides often represent motif sequences responsible for sequence-specific protein binding (e.g., transcription-factor binding). Occurrences of such motif oligonucleotides in the genome should be biased compared to those observed in random sequences and may differ among genomes and genomic portions. Self-Organizing Map (SOM) is a powerful tool for clustering high-dimensional data such as oligonucleotide composition on one plane. We previously modified the conventional SOM for genome informatics to batch learning SOM or “BLSOM”. When we constructed BLSOMs to analyze pentanucleotide composition in 20-, 50-, and 100-kb sequences derived from the human genome, BLSOMs did not classify human sequences according to chromosome but revealed several specific zones composed primarily of sequences derived from pericentric regions. Interestingly, various transcription-factor-binding motifs were characteristically overrepresented in pericentric regions but underrepresented in most genomic sequences. When we focused on much shorter sequences (e.g., 1 kb), the clustering of transcription-factor-binding motifs was evident in pericentric, subtelomeric and sex chromosome pseudoautosomal regions. The biological significance of the clustering in these regions was discussed in connection with cell-type and -stage-dependent chromocenter formation and nuclear organization.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | | | | | | | | |
Collapse
|
70
|
Salmonella utilizes D-glucosaminate via a mannose family phosphotransferase system permease and associated enzymes. J Bacteriol 2013; 195:4057-66. [PMID: 23836865 DOI: 10.1128/jb.00290-13] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Salmonella enterica is a globally significant bacterial food-borne pathogen that utilizes a variety of carbon sources. We report here that Salmonella enterica subsp. enterica serovar Typhimurium (S. Typhimurium) uses d-glucosaminate (2-amino-2-deoxy-d-gluconic acid) as a carbon and nitrogen source via a previously uncharacterized mannose family phosphotransferase system (PTS) permease, and we designate the genes encoding the permease dgaABCD (d-glucosaminate PTS permease components EIIA, EIIB, EIIC, and EIID). Two other genes in the dga operon (dgaE and dgaF) were required for wild-type growth of S. Typhimurium with d-glucosaminate. Transcription of dgaABCDEF was dependent on RpoN (σ(54)) and an RpoN-dependent activator gene we designate dgaR. Introduction of a plasmid bearing dgaABCDEF under the control of the lac promoter into Escherichia coli strains DH5α, BL21, and JM101 allowed these strains to grow on minimal medium containing d-glucosaminate as the sole carbon and nitrogen source. Biochemical and genetic data support a catabolic pathway in which d-glucosaminate, as it is transported across the cell membrane, is phosphorylated at the C-6 position by DgaABCD. DgaE converts the resulting d-glucosaminate-6-phosphate to 2-keto-3-deoxygluconate 6-phosphate (KDGP), which is subsequently cleaved by the aldolase DgaF to form glyceraldehyde-3-phosphate and pyruvate. DgaF catalyzes the same reaction as that catalyzed by Eda, a KDGP aldolase in the Entner-Doudoroff pathway, and the two enzymes can substitute for each other in their respective pathways. Examination of the Integrated Microbial Genomes database revealed that orthologs of the dga genes are largely restricted to certain enteric bacteria and a few species in the phylum Firmicutes.
Collapse
|
71
|
Alsop EB, Raymond J. Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification. PLoS One 2013; 8:e67337. [PMID: 23840870 PMCID: PMC3698125 DOI: 10.1371/journal.pone.0067337] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 05/16/2013] [Indexed: 11/19/2022] Open
Abstract
Oligonucleotide signatures, especially tetranucleotide signatures, have been used as method for homology binning by exploiting an organism’s inherent biases towards the use of specific oligonucleotide words. Tetranucleotide signatures have been especially useful in environmental metagenomics samples as many of these samples contain organisms from poorly classified phyla which cannot be easily identified using traditional homology methods, including NCBI BLAST. This study examines oligonucleotide signatures across 1,424 completed genomes from across the tree of life, substantially expanding upon previous work. A comprehensive analysis of mononucleotide through nonanucleotide word lengths suggests that longer word lengths substantially improve the classification of DNA fragments across a range of sizes of relevance to high throughput sequencing. We find that, at present, heptanucleotide signatures represent an optimal balance between prediction accuracy and computational time for resolving taxonomy using both genomic and metagenomic fragments. We directly compare the ability of tetranucleotide and heptanucleotide world lengths (tetranucleotide signatures are the current standard for oligonucleotide word usage analyses) for taxonomic binning of metagenome reads. We present evidence that heptanucleotide word lengths consistently provide more taxonomic resolving power, particularly in distinguishing between closely related organisms that are often present in metagenomic samples. This implies that longer oligonucleotide word lengths should replace tetranucleotide signatures for most analyses. Finally, we show that the application of longer word lengths to metagenomic datasets leads to more accurate taxonomic binning of DNA scaffolds and have the potential to substantially improve taxonomic assignment and assembly of metagenomic data.
Collapse
Affiliation(s)
- Eric B Alsop
- School of Earth and Space Exploration, Arizona State University, Tempe, Arizona, United States of America.
| | | |
Collapse
|
72
|
Genome implosion elicits host-confinement in Alcaligenaceae: evidence from the comparative genomics of Tetrathiobacter kashmirensis, a pathogen in the making. PLoS One 2013; 8:e64856. [PMID: 23741407 PMCID: PMC3669393 DOI: 10.1371/journal.pone.0064856] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 04/19/2013] [Indexed: 11/24/2022] Open
Abstract
This study elucidates the genomic basis of the evolution of pathogens alongside free-living organisms within the family Alcaligenaceae of Betaproteobacteria. Towards that end, the complete genome sequence of the sulfur-chemolithoautotroph Tetrathiobacter kashmirensis WT001T was determined and compared with the soil isolate Achromobacter xylosoxidans A8 and the two pathogens Bordetella bronchiseptica RB50 and Taylorella equigenitalis MCE9. All analyses comprehensively indicated that the RB50 and MCE9 genomes were almost the subsets of A8 and WT001T, respectively. In the immediate evolutionary past Achromobacter and Bordetella shared a common ancestor, which was distinct from the other contemporary stock that gave rise to Tetrathiobacter and Taylorella. The Achromobacter-Bordetella precursor, after diverging from the family ancestor, evolved through extensive genome inflation, subsequent to which the two genera separated via differential gene losses and acquisitions. Tetrathiobacter, meanwhile, retained the core characteristics of the family ancestor, and Taylorella underwent massive genome degeneration to reach an evolutionary dead-end. Interestingly, the WT001T genome, despite its conserved architecture, had only 85% coding density, besides which 578 out of its 4452 protein-coding sequences were found to be pseudogenized. Translational impairment of several DNA repair-recombination genes in the first place seemed to have ushered the rampant and indiscriminate frame-shift mutations across the WT001T genome. Presumably, this strain has just come out of a recent evolutionary bottleneck, representing a unique transition state where genome self-degeneration has started comprehensively but selective host-confinement has not yet set in. In the light of this evolutionary link, host-adaptation of Taylorella clearly appears to be the aftereffect of genome implosion in another member of the same bottleneck. Remarkably again, potent virulence factors were found widespread in Alcaligenaceae, corroborating which hemolytic and mammalian cell-adhering abilities were discovered in WT001T. So, while WT001T relatives/derivatives in nature could be going the Taylorella way, the lineage as such was well-prepared for imminent host-confinement.
Collapse
|
73
|
Transfer RNA gene numbers may not be completely responsible for the codon usage bias in asparagine, isoleucine, phenylalanine, and tyrosine in the high expression genes in bacteria. J Mol Evol 2012; 75:34-42. [PMID: 23053196 DOI: 10.1007/s00239-012-9524-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2012] [Accepted: 09/24/2012] [Indexed: 10/27/2022]
Abstract
It is generally believed that the effect of translational selection on codon usage bias is related to the number of transfer RNA genes in bacteria, which is more with respect to the high expression genes than the whole genome. Keeping this in the background, we analyzed codon usage bias with respect to asparagine, isoleucine, phenylalanine, and tyrosine amino acids. Analysis was done in seventeen bacteria with the available gene expression data and information about the tRNA gene number. In most of the bacteria, it was observed that codon usage bias and tRNA gene number were not in agreement, which was unexpected. We extended the study further to 199 bacteria, limiting to the codon usage bias in the two highly expressed genes rpoB and rpoC which encode the RNA polymerase subunits β and β', respectively. In concordance with the result in the high expression genes, codon usage bias in rpoB and rpoC genes was also found to not be in agreement with tRNA gene number in many of these bacteria. Our study indicates that tRNA gene numbers may not be the sole determining factor for translational selection of codon usage bias in bacterial genomes.
Collapse
|
74
|
Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 2012; 40:e126. [PMID: 22584627 PMCID: PMC3439882 DOI: 10.1093/nar/gks406] [Citation(s) in RCA: 350] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Prophages are phages in lysogeny that are integrated into, and replicated as part of, the host bacterial genome. These mobile elements can have tremendous impact on their bacterial hosts’ genomes and phenotypes, which may lead to strain emergence and diversification, increased virulence or antibiotic resistance. However, finding prophages in microbial genomes remains a problem with no definitive solution. The majority of existing tools rely on detecting genomic regions enriched in protein-coding genes with known phage homologs, which hinders the de novo discovery of phage regions. In this study, a weighted phage detection algorithm, PhiSpy was developed based on seven distinctive characteristics of prophages, i.e. protein length, transcription strand directionality, customized AT and GC skew, the abundance of unique phage words, phage insertion points and the similarity of phage proteins. The first five characteristics are capable of identifying prophages without any sequence similarity with known phage genes. PhiSpy locates prophages by ranking genomic regions enriched in distinctive phage traits, which leads to the successful prediction of 94% of prophages in 50 complete bacterial genomes with a 6% false-negative rate and a 0.66% false-positive rate.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA.
| | | | | |
Collapse
|
75
|
Dass JFP, Sudandiradoss C. Insight into pattern of codon biasness and nucleotide base usage in serotonin receptor gene family from different mammalian species. Gene 2012; 503:92-100. [PMID: 22480817 DOI: 10.1016/j.gene.2012.03.057] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2011] [Revised: 03/14/2012] [Accepted: 03/17/2012] [Indexed: 11/16/2022]
Abstract
5-HT (5-Hydroxy-tryptamine) or serotonin receptors are found both in central and peripheral nervous system as well as in non-neuronal tissues. In the animal and human nervous system, serotonin produces various functional effects through a variety of membrane bound receptors. In this study, we focus on 5-HT receptor family from different mammals and examined the factors that account for codon and nucleotide usage variation. A total of 110 homologous coding sequences from 11 different mammalian species were analyzed using relative synonymous codon usage (RSCU), correspondence analysis (COA) and hierarchical cluster analysis together with nucleotide base usage frequency of chemically similar amino acid codons. The mean effective number of codon (ENc) value of 37.06 for 5-HT(6) shows very high codon bias within the family and may be due to high selective translational efficiency. The COA and Spearman's rank correlation reveals that the nucleotide compositional mutation bias as the major factors influencing the codon usage in serotonin receptor genes. The hierarchical cluster analysis suggests that gene function is another dominant factor that affects the codon usage bias, while species is a minor factor. Nucleotide base usage was reported using Goldman, Engelman, Stietz (GES) scale reveals the presence of high uracil (>45%) content at functionally important hydrophobic regions. Our in silico approach will certainly help for further investigations on critical inference on evolution, structure, function and gene expression aspects of 5-HT receptors family which are potential antipsychotic drug targets.
Collapse
Affiliation(s)
- J Febin Prabhu Dass
- School of Biosciences and Technology, VIT University, Vellore, Tamil Nadu State, India
| | | |
Collapse
|
76
|
NYEO SULONG, YU JUIPING. LENGTH DISTRIBUTIONS OF SIMPLE TANDEM REPEATS IN GENOMES. J BIOL SYST 2011. [DOI: 10.1142/s0218339007002246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The length distributions of simple tandem repeats in the genomes of several organisms are evaluated and found to exhibit long-range correlations in A and T nucleotide bases related repeats for most eukaryotes. In particular, the length distributions of the mononucleotide A/T repeat units have longer tails than those of the C/G repeat units. Also, the length distributions of the dinucleotide repeat unit CG show a simple monotonously fast decreasing behavior, while those of repeat units AT, AG and AC have complicated structures at larger repeat lengths, especially for human, mouse and rat chromosomes. These distributive behaviors are due to the CpG deficiency in different genomes with different methylation activities. Especially, methyltransferases in vertebrates appear to methylate specifically the cytosine in CpG dinucleotides, and the methylated cytosines is prone to mutate to thymine by spontaneous deamination. The dinucleotide CpG would gradually decay into TpG and CpA. In addition, there is a peak in the distributions of repeat unit A at repeat-repeat separation 153 nt for humans and chimpanzees. We show that the long-tail behavior of mononucleotide repeat unit A and the peak at repeat separation 153 nt are due to the interspersed repetitive DNA sequences in humans and chimpanzees.
Collapse
Affiliation(s)
- SU-LONG NYEO
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| | - JUI-PING YU
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| |
Collapse
|
77
|
Uehara H, Iwasaki Y, Wada C, Ikemura T, Abe T. A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries. Genes Genet Syst 2011; 86:53-66. [PMID: 21498923 DOI: 10.1266/ggs.86.53] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Although remarkable progress in metagenomic sequencing of various environmental samples has been made, large numbers of fragment sequences have been registered in the international DNA databanks, primarily without information on gene function and phylotype, and thus with limited usefulness. Industrial useful biological activity is often carried out by a set of genes, such as those constituting an operon. In this connection, metagenomic approaches have a weakness because sets of the genes are usually split up, since the sequences obtained by metagenome analyses are fragmented into 1-kb or much shorter segments. Therefore, even when a set of genes responsible for an industrially useful function is found in one metagenome library, it is usually difficult to know whether a single genome harbors the entire gene set or whether different genomes have individual genes. By modifying Self-Organizing Map (SOM), we previously developed BLSOM for oligonucleotide composition, which allowed classification (self-organization) of sequence fragments according to genomes. Because BLSOM could reassociate genomic fragments according to genomes, BLSOM may ameliorate the abovementioned weakness of metagenome analyses. Here, we have developed a strategy for clustering of metagenomic sequences according to phylotypes and genomes, by testing a gene set contributing to environment preservation.
Collapse
|
78
|
Porceddu A, Camiolo S. Spatial analyses of mono, di and trinucleotide trends in plant genes. PLoS One 2011; 6:e22855. [PMID: 21829660 PMCID: PMC3148226 DOI: 10.1371/journal.pone.0022855] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Accepted: 06/30/2011] [Indexed: 11/24/2022] Open
Abstract
Genomic DNA sequences display compositional heterogeneity on many scales. In this paper we analyzed tendencies and anomalies in the occurence of mono, di and trinucleotides in structural regions of plant genes. Representation of these trends as a function of position along genic sequences highlighted compositional features peculiar of either monocots or eudicots that were remarkably uniform within these two evolutionary clades. The most evident of these features appeared in the form of gradient of base content along the direction of transcription. The robustness of such a representation was validated in sequences sub-datasets generated considering structural and compositional features such as total length of cds, overall GC content and genic orientation in the genome. Piecewise regression analyses indicated that the gradients could be conveniently approximated to a two segmented model where a first region featuring a steep slope is followed by a second segment fitting a milder variation. In general, monocots species showed steeper segments than eudicots. The guanine gradient was the most distinctive feature between the two evolutionary clades, being moderately increasing in eudicots and firmly decreasing in monocots. Single gene investigation revealed that a high proportion of genes show compositional trends compatible with a segmented model suggesting that these features are essential attributes of gene organization. Dinucleotide and trinucleotide biases were referred to expectation based on a random union of the component elements. The average bias at dinucleotide level identified a significant undererpresentation of some dinucleotide and the overrepresention of others. The bias at trinucleotide level was on average low. Finally, the analysis of bryophyte coding sequences showed mononucleotide, dinucleotide and trinucleotide compositional trends resembling those of higher plants. This finding suggested that the emergenge of compositional bias is an ancient event in evolution which was already present at the time of land conquest by green plants.
Collapse
Affiliation(s)
- Andrea Porceddu
- Dipartimento di Scienze Agronomiche e Genetica Vegetale Agraria, Università degli Studi di Sassari, Sassari, Italy.
| | | |
Collapse
|
79
|
Epps J, Ying H, Huttley GA. Statistical methods for detecting periodic fragments in DNA sequence data. Biol Direct 2011; 6:21. [PMID: 21527008 PMCID: PMC3111405 DOI: 10.1186/1745-6150-6-21] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Accepted: 04/28/2011] [Indexed: 11/10/2022] Open
Abstract
Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.
Collapse
Affiliation(s)
- Julien Epps
- School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia.
| | | | | |
Collapse
|
80
|
Iwasaki Y, Abe T, Wada K, Itoh M, Ikemura T. Prediction of directional changes of influenza A virus genome sequences with emphasis on pandemic H1N1/09 as a model case. DNA Res 2011; 18:125-36. [PMID: 21444341 PMCID: PMC3077041 DOI: 10.1093/dnares/dsr005] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Influenza virus poses a significant threat to public health, as exemplified by the recent introduction of the new pandemic strain H1N1/09 into human populations. Pandemics have been initiated by the occurrence of novel changes in animal sources that eventually adapt to human. One important issue in studies of viral genomes, particularly those of influenza virus, is to predict possible changes in genomic sequence that will become hazardous. We previously established a clustering method termed ‘BLSOM’ (batch-learning self-organizing map) that does not depend on sequence alignment and can characterize and compare even 1 million genomic sequences in one run. Strategies for comparing a vast number of genomic sequences simultaneously become increasingly important in genome studies because of remarkable progresses in nucleotide sequencing. In this study, we have constructed BLSOMs based on the oligonucleotide and codon composition of all influenza A viral strains available. Without prior information with regard to their hosts, sequences derived from strains isolated from avian or human sources were successfully clustered according to the hosts. Notably, the pandemic H1N1/09 strains have oligonucleotide and codon compositions that are clearly different from those of human seasonal influenza A strains. This enables us to infer future directional changes in the influenza A viral genome.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken 526-0829, Japan
| | | | | | | | | |
Collapse
|
81
|
Fang X, Du Y, Zhang C, Shi X, Chen D, Sun J, Jin Q, Lan X, Chen H. Polymorphism in a microsatellite of the acrp30 gene and its association with growth traits in goats. Biochem Genet 2011; 49:533-9. [PMID: 21369822 DOI: 10.1007/s10528-011-9428-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2010] [Accepted: 12/06/2010] [Indexed: 11/30/2022]
Abstract
Acrp30 plays a critical role in the regulation of glucose and lipid homeostasis. In this study, polymorphism of the Acrp30 gene was detected by PCR-SSCP and DNA sequencing methods in 321 individuals from three goat breeds, and the association of Acrp30 gene polymorphism with growth traits in the three goat breeds was analyzed. A novel insert/deletion (GT)(5) microsatellite sequence was detected in the 5' flanking region of the gene. Three genotypes (AA, AB, and BB) were found in three breeds. There was moderate genetic diversity in the locus in the analyzed populations. Significant associations were observed between the genotypes of the locus and growth traits in the Boer goat population. The chest circumference of individuals with genotype BB was significantly greater than that of individuals with genotype AA.
Collapse
Affiliation(s)
- Xingtang Fang
- Institute of Cellular and Molecular Biology, College of Life Science, Xuzhou Normal University, China
| | | | | | | | | | | | | | | | | |
Collapse
|
82
|
Visualization of sequence and structural features of genomes and chromosome fragments. Application to CpG islands, Alu sequences and whole genomes. Gene X 2011; 473:76-81. [DOI: 10.1016/j.gene.2010.11.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2010] [Revised: 11/24/2010] [Accepted: 11/24/2010] [Indexed: 11/20/2022] Open
|
83
|
Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One 2011; 6:e16065. [PMID: 21386877 PMCID: PMC3031530 DOI: 10.1371/journal.pone.0016065] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2010] [Accepted: 12/04/2010] [Indexed: 11/21/2022] Open
Abstract
Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.
Collapse
Affiliation(s)
- Sara P Garcia
- Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal.
| | | | | | | | | |
Collapse
|
84
|
Delaye L, González-Domenech CM, Garcillán-Barcia MP, Peretó J, de la Cruz F, Moya A. Blueprint for a minimal photoautotrophic cell: conserved and variable genes in Synechococcus elongatus PCC 7942. BMC Genomics 2011; 12:25. [PMID: 21226929 PMCID: PMC3025956 DOI: 10.1186/1471-2164-12-25] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2010] [Accepted: 01/12/2011] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Simpler biological systems should be easier to understand and to engineer towards pre-defined goals. One way to achieve biological simplicity is through genome minimization. Here we looked for genomic islands in the fresh water cyanobacteria Synechococcus elongatus PCC 7942 (genome size 2.7 Mb) that could be used as targets for deletion. We also looked for conserved genes that might be essential for cell survival. RESULTS By using a combination of methods we identified 170 xenologs, 136 ORFans and 1401 core genes in the genome of S. elongatus PCC 7942. These represent 6.5%, 5.2% and 53.6% of the annotated genes respectively. We considered that genes in genomic islands could be found if they showed a combination of: a) unusual G+C content; b) unusual phylogenetic similarity; and/or c) a small number of the highly iterated palindrome 1 (HIP1) motif plus an unusual codon usage. The origin of the largest genomic island by horizontal gene transfer (HGT) could be corroborated by lack of coverage among metagenomic sequences from a fresh water microbialite. Evidence is also presented that xenologous genes tend to cluster in operons. Interestingly, most genes coding for proteins with a diguanylate cyclase domain are predicted to be xenologs, suggesting a role for horizontal gene transfer in the evolution of Synechococcus sensory systems. CONCLUSIONS Our estimates of genomic islands in PCC 7942 are larger than those predicted by other published methods like SIGI-HMM. Our results set a guide to non-essential genes in S. elongatus PCC 7942 indicating a path towards the engineering of a model photoautotrophic bacterial cell.
Collapse
Affiliation(s)
- Luis Delaye
- Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Valencia, Spain
- Departamento de Ingeniería Genética CINVESTAV-Irapuato, Km. 9.6 Libramiento Norte, Carretera Irapuato-León, 36821 Irapuato, Guanajuato, México
| | - Carmen M González-Domenech
- Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Valencia, Spain
- Facultad de Farmacia, Universidad de Granada, Granada, Spain
| | - María P Garcillán-Barcia
- Departamento de Biología Molecular e Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de Cantabria-CSIC-IDICAN, Santander, Spain
| | - Juli Peretó
- Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Valencia, Spain
- Departament de Bioquimica i Biologia Molecular, Universitat de València, València, Spain
| | - Fernando de la Cruz
- Departamento de Biología Molecular e Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de Cantabria-CSIC-IDICAN, Santander, Spain
| | - Andrés Moya
- Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Valencia, Spain
- Departament de Genètica, Universitat de València, València, Spain
| |
Collapse
|
85
|
Abstract
Despite their name, synonymous mutations have significant consequences for cellular processes in all taxa. As a result, an understanding of codon bias is central to fields as diverse as molecular evolution and biotechnology. Although recent advances in sequencing and synthetic biology have helped to resolve longstanding questions about codon bias, they have also uncovered striking patterns that suggest new hypotheses about protein synthesis. Ongoing work to quantify the dynamics of initiation and elongation is as important for understanding natural synonymous variation as it is for designing transgenes in applied contexts.
Collapse
Affiliation(s)
- Joshua B Plotkin
- Department of Biology and Program in Applied Mathematics and Computational Science, University of Pennsylvania, 433 South University Avenue, Philadelphia, Pennsylvania 19104, USA.
| | | |
Collapse
|
86
|
Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics. ISME JOURNAL 2010; 5:918-28. [PMID: 21160538 DOI: 10.1038/ismej.2010.180] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.
Collapse
|
87
|
Zhang Z, Yu J. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences. Biol Direct 2010; 5:63. [PMID: 21059261 PMCID: PMC2989939 DOI: 10.1186/1745-6150-5-63] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2010] [Accepted: 11/08/2010] [Indexed: 12/03/2022] Open
Abstract
Background Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge. Results To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences. Conclusions We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms. Reviewers This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft.
Collapse
Affiliation(s)
- Zhang Zhang
- Plant Stress Genomics Research Center, Division of Chemical and Life Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia
| | | |
Collapse
|
88
|
Liang H, Barakat A, Schlarbaum SE, Mandoli DF, Carlson JE. Comparison of gene order of GIGANTEA loci in yellow-poplar, monocots, and eudicots. Genome 2010; 53:533-44. [PMID: 20616875 DOI: 10.1139/g10-031] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
GIGANTEA plays an important role in the control of circadian rhythms and photoperiodic flowering. The GIGANTEA gene has been studied in various species, but not in basal angiosperms. Moreover, to the best of our knowledge, no study of the genome organization of a basal angiosperm has yet been published. In this study, we sequenced a bacterial artificial chromosome (BAC) harboring GIGANTEA from yellow-poplar (Liriodendron tulipifera L.) and compared the genomic organization of this gene in yellow-poplar with that in other species from various angiosperm clades. This is the first report on the gene structure and organization of a large contig in any basal angiosperm species. The BAC clone, covering a region of approximately 122 kb from the yellow-poplar genome, was sequenced and assembled by coupling the 454 pyrosequencing technology with ABI capillary sequencing. In addition to GIGANTEA, the gene RPS18.A (encoding ribosomal protein S18.A) was found in this segment of the genome. We found that gene content and order in this region of the yellow-poplar genome were similar to those in the corresponding region in eudicots but not in Oryza sativa and Sorghum bicolor, implying that clustering of the GIGANTEA and RPS18.A genes is ancestral and separation of the genes occurred after the phylogenetic split of monocots from dicots. Phylogenetic analysis of GIGANTEA amino acid sequences placed yellow-poplar closer to eudicots than to monocots. In addition, evidence for transposition and large insertions and duplications was found, suggesting multiple and complex mechanisms of basal angiosperm genome evolution.
Collapse
Affiliation(s)
- Haiying Liang
- School of Forest Resources and Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.
| | | | | | | | | |
Collapse
|
89
|
Cuff WR, Duvvuri VRSK, Liang B, Duvvuri B, Wu GE, Wu J, Tsang RSW. A novel interpretation of structural dot plots of genomes derived from the analysis of two strains of Neisseria meningitidis. GENOMICS PROTEOMICS & BIOINFORMATICS 2010; 8:159-69. [PMID: 20970744 PMCID: PMC5054114 DOI: 10.1016/s1672-0229(10)60018-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Neisseria meningitidis is the agent of invasive meningococcal disease, including cerebral meningitis and septicemia. Because the diseases caused by different clonal groups (sequence types) have their own epidemiological characteristics, it is important to understand the differences among the genomes of the N. meningitidis clonal groups. To this end, a novel interpretation of a structural dot plot of genomes was devised and applied; exact nucleotide matches between the genomes of N. meningitidis serogroup A strain Z2491 and serogroup B strain MC58 were identified, leading to the specification of various structural regions. Known and putative virulence genes for each N. meningitidis strain were then classified into these regions. We found that virulence genes of MC58 tend more to the translocated regions (chromosomal segments in new sequence contexts) than do those of Z2491, notably tending towards the interface between one of the translocated regions and the collinear region. Within the collinear region, virulence genes tend to occur within 16 kb of gaps in the exact matches. Verification of these tendencies using genes clustered in the cps locus was sufficiently supportive to suggest that these tendencies can be used to focus the search for and understanding of virulence genes and mechanisms of pathogenicity in these two organisms.
Collapse
Affiliation(s)
- Wilfred R Cuff
- Public Health Agency of Canada, Canadian Science Centre for Human and Animal Health, Winnipeg, Canada.
| | | | | | | | | | | | | |
Collapse
|
90
|
Mitrofanov SI, Panchin AY, Spirin SA, Alexeevski AV, Panchin YV. Exclusive sequences of different genomes. J Bioinform Comput Biol 2010; 8:519-34. [PMID: 20556860 DOI: 10.1142/s0219720010004719] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2009] [Revised: 12/31/2009] [Accepted: 01/16/2010] [Indexed: 11/18/2022]
Abstract
We studied the distribution of 1-7 bp words in a dataset that includes 139 complete eukaryotic genomes, 33 masked eukaryotic genomes and coding regions from 35 genomes. We tested different statistical models to determine over- and under-represented words. The method described by Karlin et al. has the strongest predictive power compared to other methods. Using this method we identified over- and under-represented words consistent within a large array of taxonomic groups. Some of those words have not yet been described as exclusive. For example, CGCG is over-represented in CG-deficient organisms. We also describe exceptions for widely known exclusive words, such as CG and TA.
Collapse
Affiliation(s)
- Sergey I Mitrofanov
- Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia.
| | | | | | | | | |
Collapse
|
91
|
Du H, Hu H, Meng Y, Zheng W, Ling F, Wang J, Zhang X, Nie Q, Wang X. The correlation coefficient of GC content of the genome-wide genes is positively correlated with animal evolutionary relationships. FEBS Lett 2010; 584:3990-3994. [PMID: 20691688 DOI: 10.1016/j.febslet.2010.08.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2010] [Revised: 07/29/2010] [Accepted: 08/02/2010] [Indexed: 11/16/2022]
Abstract
In this study, we present a new method for evaluating animal evolutionary relationships. We used the GC% levels of genome-wide genes to determine the correlation between the GC% content and evolutionary relationship. The correlation coefficients of the GC% content of the orthologous genes of the paired animal species were calculated for a total of 21 species, and the evolutionary branching dates of these 21 species were derived from fossil records. The correlation coefficient of the GC% content of the orthologous genes of the species pair under study served as an indicator of their evolutionary relationship. Moreover, there was a decreasing linear relationship between the correlation coefficient and evolutionary branching date (R(2)=0.930).
Collapse
Affiliation(s)
- Hongli Du
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, China
| | | | | | | | | | | | | | | | | |
Collapse
|
92
|
Tse H, Cai JJ, Tsoi HW, Lam EP, Yuen KY. Natural selection retains overrepresented out-of-frame stop codons against frameshift peptides in prokaryotes. BMC Genomics 2010; 11:491. [PMID: 20828396 PMCID: PMC2996987 DOI: 10.1186/1471-2164-11-491] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2010] [Accepted: 09/09/2010] [Indexed: 12/03/2022] Open
Abstract
Background Out-of-frame stop codons (OSCs) occur naturally in coding sequences of all organisms, providing a mechanism of early termination of translation in incorrect reading frame so that the metabolic cost associated with frameshift events can be reduced. Given such a functional significance, we expect statistically overrepresented OSCs in coding sequences as a result of a widespread selection. Accordingly, we examined available prokaryotic genomes to look for evidence of this selection. Results The complete genome sequences of 990 prokaryotes were obtained from NCBI GenBank. We found that low G+C content coding sequences contain significantly more OSCs and G+C content at specific codon positions were the principal determinants of OSC usage bias in the different reading frames. To investigate if there is overrepresentation of OSCs, we modeled the trinucleotide and hexanucleotide biases of the coding sequences using Markov models, and calculated the expected OSC frequencies for each organism using a Monte Carlo approach. More than 93% of 342 phylogenetically representative prokaryotic genomes contain excess OSCs. Interestingly the degree of OSC overrepresentation correlates positively with G+C content, which may represent a compensatory mechanism for the negative correlation of OSC frequency with G+C content. We extended the analysis using additional compositional bias models and showed that lower-order bias like codon usage and dipeptide bias could not explain the OSC overrepresentation. The degree of OSC overrepresentation was found to correlate negatively with the optimal growth temperature of the organism after correcting for the G+C% and AT skew of the coding sequence. Conclusions The present study uses approaches with statistical rigor to show that OSC overrepresentation is a widespread phenomenon among prokaryotes. Our results support the hypothesis that OSCs carry functional significance and have been selected in the course of genome evolution to act against unintended frameshift occurrences. Some results also hint that OSC overrepresentation being a compensatory mechanism to make up for the decrease in OSCs in high G+C organisms, thus revealing the interplay between two different determinants of OSC frequency.
Collapse
Affiliation(s)
- Herman Tse
- Carol Yu Centre for Infection, Department of Microbiology, The University of Hong Kong, Hong Kong, China
| | | | | | | | | |
Collapse
|
93
|
Tyagi A, Bag SK, Shukla V, Roy S, Tuli R. Oligonucleotide frequencies of barcoding loci can discriminate species across kingdoms. PLoS One 2010; 5:e12330. [PMID: 20808837 PMCID: PMC2924895 DOI: 10.1371/journal.pone.0012330] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2010] [Accepted: 07/28/2010] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND DNA barcoding refers to the use of short DNA sequences for rapid identification of species. Genetic distance or character attributes of a particular barcode locus discriminate the species. We report an efficient approach to analyze short sequence data for discrimination between species. METHODOLOGY AND PRINCIPAL FINDINGS A new approach, Oligonucleotide Frequency Range (OFR) of barcode loci for species discrimination is proposed. OFR of the loci that discriminates between species was characteristic of a species, i.e., the maxima and minima within a species did not overlap with that of other species. We compared the species resolution ability of different barcode loci using p-distance, Euclidean distance of oligonucleotide frequencies, nucleotide-character based approach and OFR method. The species resolution by OFR was either higher or comparable to the other methods. A short fragment of 126 bp of internal transcribed spacer region in ribosomal RNA gene was sufficient to discriminate a majority of the species using OFR. CONCLUSIONS/SIGNIFICANCE Oligonucleotide frequency range of a barcode locus can discriminate between species. Ability to discriminate species using very short DNA fragments may have wider applications in forensic and conservation studies.
Collapse
Affiliation(s)
- Antariksh Tyagi
- Center for Plant Molecular Biology, National Botanical Research Institute (Council of Scientific and Industrial Research), Lucknow, India
| | - Sumit K. Bag
- Center for Plant Molecular Biology, National Botanical Research Institute (Council of Scientific and Industrial Research), Lucknow, India
| | - Virendra Shukla
- Center for Plant Molecular Biology, National Botanical Research Institute (Council of Scientific and Industrial Research), Lucknow, India
| | - Sribash Roy
- Center for Plant Molecular Biology, National Botanical Research Institute (Council of Scientific and Industrial Research), Lucknow, India
| | - Rakesh Tuli
- Center for Plant Molecular Biology, National Botanical Research Institute (Council of Scientific and Industrial Research), Lucknow, India
| |
Collapse
|
94
|
Fox JM, Erill I. Relative codon adaptation: a generic codon bias index for prediction of gene expression. DNA Res 2010; 17:185-96. [PMID: 20453079 PMCID: PMC2885275 DOI: 10.1093/dnares/dsq012] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The development of codon bias indices (CBIs) remains an active field of research due to their myriad applications in computational biology. Recently, the relative codon usage bias (RCBS) was introduced as a novel CBI able to estimate codon bias without using a reference set. The results of this new index when applied to Escherichia coli and Saccharomyces cerevisiae led the authors of the original publications to conclude that natural selection favours higher expression and enhanced codon usage optimization in short genes. Here, we show that this conclusion was flawed and based on the systematic oversight of an intrinsic bias for short sequences in the RCBS index and of biases in the small data sets used for validation in E. coli. Furthermore, we reveal that how the RCBS can be corrected to produce useful results and how its underlying principle, which we here term relative codon adaptation (RCA), can be made into a powerful reference-set-based index that directly takes into account the genomic base composition. Finally, we show that RCA outperforms the codon adaptation index (CAI) as a predictor of gene expression when operating on the CAI reference set and that this improvement is significantly larger when analysing genomes with high mutational bias.
Collapse
Affiliation(s)
- Jesse M Fox
- Department of Biological Sciences, University of Maryland Baltimore County (UMBC), 1000 Hilltop Road, Baltimore, MD 21228, USA
| | | |
Collapse
|
95
|
Xing-Tang F, Hai-Xia X, Hong C, Chun-Lei Z, Xiu-Cai H, Xue-Yuan G, Chuan-Wen G, Wang-Ping Y, Xian-Yong L. Polymorphisms of Bone Morphogenetic Protein 4 (BMP4) Gene in Goats. ACTA ACUST UNITED AC 2010. [DOI: 10.3923/javaa.2010.907.912] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
96
|
Zhong X, Zan L, Wang H, Liu Y. Polymorphic CA microsatellites in the third exon of the bovine BMP4 gene. GENETICS AND MOLECULAR RESEARCH 2010; 9:868-74. [DOI: 10.4238/vol9-2gmr732] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
97
|
Kielak A, Rodrigues JL, Kuramae EE, Chain PS, Van Veen JA, Kowalchuk GA. Phylogenetic and metagenomic analysis of Verrucomicrobiaâin former âagricultural grassland soil. FEMS Microbiol Ecol 2010; 71:23-33. [DOI: 10.1111/j.1574-6941.2009.00785.x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
98
|
Gatherer D. Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences. Bioinform Biol Insights 2009; 1:101-26. [PMID: 20066129 PMCID: PMC2789693 DOI: 10.4137/bbi.s415] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.
Collapse
Affiliation(s)
- Derek Gatherer
- MRC Virology Unit, Institute of Virology, Church Street, Glasgow G11 5JR UK
| |
Collapse
|
99
|
Prakash A, Shepard SS, He J, Hart B, Chen M, Amarachintha SP, Mileyeva-Biebesheimer O, Bechtel J, Fedorov A. Evolution of genomic sequence inhomogeneity at mid-range scales. BMC Genomics 2009; 10:513. [PMID: 19891785 PMCID: PMC2779198 DOI: 10.1186/1471-2164-10-513] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2009] [Accepted: 11/05/2009] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Mid-range inhomogeneity or MRI is the significant enrichment of particular nucleotides in genomic sequences extending from 30 up to several thousands of nucleotides. The best-known manifestation of MRI is CpG islands representing CG-rich regions. Recently it was demonstrated that MRI could be observed not only for G+C content but also for all other nucleotide pairings (e.g. A+G and G+T) as well as for individual bases. Various types of MRI regions are 4-20 times enriched in mammalian genomes compared to their occurrences in random models. RESULTS This paper explores how different types of mutations change MRI regions. Human, chimpanzee and Macaca mulatta genomes were aligned to study the projected effects of substitutions and indels on human sequence evolution within both MRI regions and control regions of average nucleotide composition. Over 18.8 million fixed point substitutions, 3.9 million SNPs, and indels spanning 6.9 Mb were procured and evaluated in human. They include 1.8 Mb substitutions and 1.9 Mb indels within MRI regions. Ancestral and mutant (derived) alleles for substitutions have been determined. Substitutions were grouped according to their fixation within human populations: fixed substitutions (from the human-chimp-macaca alignment), major SNPs (> 80% mutant allele frequency within humans), medium SNPs (20% - 80% mutant allele frequency), minor SNPs (3% - 20%), and rare SNPs (<3%). Data on short (< 3 bp) and medium-length (3 - 50 bp) insertions and deletions within MRI regions and appropriate control regions were analyzed for the effect of indels on the expansion or diminution of such regions as well as on changing nucleotide composition. CONCLUSION MRI regions have comparable levels of de novo mutations to the control genomic sequences with average base composition. De novo substitutions rapidly erode MRI regions, bringing their nucleotide composition toward genome-average levels. However, those substitutions that favor the maintenance of MRI properties have a higher chance to spread through the entire population. Indels have a clear tendency to maintain MRI features yet they have a smaller impact than substitutions. All in all, the observed fixation bias for mutations helps to preserve MRI regions during evolution.
Collapse
Affiliation(s)
- Ashwin Prakash
- Program in Cardiovascular & Metabolic Diseases Track, Biomedical Sciences, Toledo, OH 43614, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
100
|
Freilich S, Goldovsky L, Gottlieb A, Blanc E, Tsoka S, Ouzounis CA. Stratification of co-evolving genomic groups using ranked phylogenetic profiles. BMC Bioinformatics 2009; 10:355. [PMID: 19860884 PMCID: PMC2775751 DOI: 10.1186/1471-2105-10-355] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2009] [Accepted: 10/27/2009] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. RESULTS The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. CONCLUSION Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples.
Collapse
Affiliation(s)
- Shiri Freilich
- The Blavatnik School of Computer Sciences and School of Medicine, Tel-Aviv University, Tel-Aviv 69978, Israel.
| | | | | | | | | | | |
Collapse
|