1
|
Broni E, Miller WA. Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes. Biomedicines 2023; 11:512. [PMID: 36831052 PMCID: PMC9953644 DOI: 10.3390/biomedicines11020512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/03/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a serious global challenge requiring urgent and permanent therapeutic solutions. These solutions can only be engineered if the patterns and rate of mutations of the virus can be elucidated. Predicting mutations and the structure of proteins based on these mutations have become necessary for early drug and vaccine design purposes in anticipation of future viral mutations. The amino acid composition (AAC) of proteomes and individual viral proteins provide avenues for exploitation since AACs have been previously used to predict structure, shape and evolutionary rates. Herein, the frequency of amino acid residues found in 1637 complete proteomes belonging to 11 SARS-CoV-2 variants/lineages were analyzed. Leucine is the most abundant amino acid residue in the SARS-CoV-2 with an average AAC of 9.658% while tryptophan had the least abundance of 1.11%. The AAC and ranking of lysine and glycine varied in the proteome. For some variants, glycine had higher frequency and AAC than lysine and vice versa in other variants. Tryptophan was also observed to be the most intolerant to mutation in the various proteomes for the variants used. A correlogram revealed a very strong correlation of 0.999992 between B.1.525 (Eta) and B.1.526 (Iota) variants. Furthermore, isoleucine and threonine were observed to have a very strong negative correlation of -0.912, while cysteine and isoleucine had a very strong positive correlation of 0.835 at p < 0.001. Shapiro-Wilk normality test revealed that AAC values for all the amino acid residues except methionine showed no evidence of non-normality at p < 0.05. Thus, AACs of SARS-CoV-2 variants can be predicted using probability and z-scores. AACs may be beneficial in classifying viral strains, predicting viral disease types, members of protein families, protein interactions and for diagnostic purposes. They may also be used as a feature along with other crucial factors in machine-learning based algorithms to predict viral mutations. These mutation-predicting algorithms may help in developing effective therapeutics and vaccines for SARS-CoV-2.
Collapse
Affiliation(s)
- Emmanuel Broni
- Department of Medicine, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
| | - Whelton A. Miller
- Department of Medicine, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
- Department of Molecular Pharmacology & Neuroscience, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
| |
Collapse
|
2
|
Determination of the Amino Acid Recruitment Order in Early Life by Genome-Wide Analysis of Amino Acid Usage Bias. Biomolecules 2022; 12:biom12020171. [PMID: 35204672 PMCID: PMC8961565 DOI: 10.3390/biom12020171] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 01/14/2022] [Accepted: 01/18/2022] [Indexed: 12/11/2022] Open
Abstract
The mechanisms shaping the amino acids recruitment pattern into the proteins in the early life history presently remains a huge mystery. In this study, we conducted genome-wide analyses of amino acids usage and genetic codons structure in 7270 species across three domains of life. The carried-out analyses evidenced ubiquitous usage bias of amino acids that were likely independent from codon usage bias. Taking advantage of codon usage bias, we performed pseudotime analysis to re-determine the chronological order of the species emergence, which inspired a new species relationship by tracing the imprint of codon usage evolution. Furthermore, the multidimensional data integration showed that the amino acids A, D, E, G, L, P, R, S, T and V might be the first recruited into the last universal common ancestry (LUCA) proteins. The data analysis also indicated that the remaining amino acids most probably were gradually incorporated into proteogenesis process in the course of two long-timescale parallel evolutionary routes: I→F→Y→C→M→W and K→N→Q→H. This study provides new insight into the origin of life, particularly in terms of the basic protein composition of early life. Our work provides crucial information that will help in a further understanding of protein structure and function in relation to their evolutionary history.
Collapse
|
3
|
The Mutational Robustness of the Genetic Code and Codon Usage in Environmental Context: A Non-Extremophilic Preference? Life (Basel) 2021; 11:life11080773. [PMID: 34440517 PMCID: PMC8398314 DOI: 10.3390/life11080773] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 07/23/2021] [Accepted: 07/28/2021] [Indexed: 12/12/2022] Open
Abstract
The genetic code was evolved, to some extent, to minimize the effects of mutations. The effects of mutations depend on the amino acid repertoire, the structure of the genetic code and frequencies of amino acids in proteomes. The amino acid compositions of proteins and corresponding codon usages are still under selection, which allows us to ask what kind of environment the standard genetic code is adapted to. Using simple computational models and comprehensive datasets comprising genomic and environmental data from all three domains of Life, we estimate the expected severity of non-synonymous genomic mutations in proteins, measured by the change in amino acid physicochemical properties. We show that the fidelity in these physicochemical properties is expected to deteriorate with extremophilic codon usages, especially in thermophiles. These findings suggest that the genetic code performs better under non-extremophilic conditions, which not only explains the low substitution rates encountered in halophiles and thermophiles but the revealed relationship between the genetic code and habitat allows us to ponder on earlier phases in the history of Life.
Collapse
|
4
|
Akhter S, Aziz RK, Kashef MT, Ibrahim ES, Bailey B, Edwards RA. Kullback Leibler divergence in complete bacterial and phage genomes. PeerJ 2017; 5:e4026. [PMID: 29204318 PMCID: PMC5712468 DOI: 10.7717/peerj.4026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/22/2017] [Indexed: 12/11/2022] Open
Abstract
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America
| | - Mona T Kashef
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Eslam S Ibrahim
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Barbara Bailey
- Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America.,Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.,Department of Biology, San Diego State University, San Diego, CA, USA
| |
Collapse
|
5
|
Higgs PG, Hao W, Golding GB. Identification of Conflicting Selective Effects on Highly Expressed Genes. Evol Bioinform Online 2017. [DOI: 10.1177/117693430700300015] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Many different selective effects on DNA and proteins influence the frequency of codons and amino acids in coding sequences. Selection is often stronger on highly expressed genes. Hence, by comparing high- and low-expression genes it is possible to distinguish the factors that are selected by evolution. It has been proposed that highly expressed genes should (i) preferentially use codons matching abundant tRNAs (translational efficiency), (ii) preferentially use amino acids with low cost of synthesis, (iii) be under stronger selection to maintain the required amino acid content, and (iv) be selected for translational robustness. These effects act simultaneously and can be contradictory. We develop a model that combines these factors, and use Akaike's Information Criterion for model selection. We consider pairs of paralogues that arose by whole-genome duplication in Saccharmyces cerevisiae. A codon-based model is used that includes asymmetric effects due to selection on highly expressed genes. The largest effect is translational efficiency, which is found to strongly influence synonymous, but not non-synonymous rates. Minimization of the cost of amino acid synthesis is implicated. However, when a more general measure of selection for amino acid usage is used, the cost minimization effect becomes redundant. Small effects that we attribute to selection for translational robustness can be identified as an improvement in the model fit on top of the effects of translational efficiency and amino acid usage.
Collapse
Affiliation(s)
- Paul G. Higgs
- Department of Physics and Astronomy, McMaster University, Hamilton, Ontario L8S 4M1
| | - Weilong Hao
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1
| | - G. Brian Golding
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1
| |
Collapse
|
6
|
Pathak J, Kannaujiya VK, Singh SP, Sinha RP. Codon usage analysis of photolyase encoding genes of cyanobacteria inhabiting diverse habitats. 3 Biotech 2017; 7:192. [PMID: 28664377 DOI: 10.1007/s13205-017-0826-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Accepted: 05/31/2017] [Indexed: 12/17/2022] Open
Abstract
Nucleotide and amino acid compositions were studied to determine the genomic and structural relationship of photolyase gene in freshwater, marine and hot spring cyanobacteria. Among three habitats, photolyase encoding genes from hot spring cyanobacteria were found to have highest GC content. The genomic GC content was found to influence the codon usage and amino acid variability in photolyases. The third position of codon was found to have more effect on amino acid variability in photolyases than the first and second positions of codon. The variation of amino acids Ala, Asp, Glu, Gly, His, Leu, Pro, Gln, Arg and Val in photolyases of three different habitats was found to be controlled by first position of codon (G1C1). However, second position (G2C2) of codon regulates variation of Ala, Cys, Gly, Pro, Arg, Ser, Thr and Tyr contents in photolyases. Third position (G3C3) of codon controls incorporation of amino acids such as Ala, Phe, Gly, Leu, Gln, Pro, Arg, Ser, Thr and Tyr in photolyases from three habitats. Photolyase encoding genes of hot spring cyanobacteria have 85% codons with G or C at third position, whereas marine and freshwater cyanobacteria showed 82 and 60% codons, respectively, with G or C at third position. Principal component analysis (PCA) showed that GC content has a profound effect in separating the genes along the first major axis according to their RSCU (relative synonymous codon usage) values, and neutrality analysis indicated that mutational pressure has resulted in codon bias in photolyase genes of cyanobacteria.
Collapse
Affiliation(s)
- Jainendra Pathak
- Laboratory of Photobiology and Molecular Microbiology, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, Varanasi, 221005, India
| | - Vinod K Kannaujiya
- Laboratory of Photobiology and Molecular Microbiology, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, Varanasi, 221005, India
| | - Shailendra P Singh
- Laboratory of Photobiology and Molecular Microbiology, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, Varanasi, 221005, India
| | - Rajeshwar P Sinha
- Laboratory of Photobiology and Molecular Microbiology, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, Varanasi, 221005, India.
| |
Collapse
|
7
|
Goncearenco A, Berezovsky IN. The fundamental tradeoff in genomes and proteomes of prokaryotes established by the genetic code, codon entropy, and physics of nucleic acids and proteins. Biol Direct 2014; 9:29. [PMID: 25496919 PMCID: PMC4273451 DOI: 10.1186/s13062-014-0029-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 12/01/2014] [Indexed: 11/26/2022] Open
Abstract
Background Mutations in nucleotide sequences provide a foundation for genetic variability, and selection is the driving force of the evolution and molecular adaptation. Despite considerable progress in the understanding of selective forces and their compositional determinants, the very nature of underlying mutational biases remains unclear. Results We explore here a fundamental tradeoff, which analytically describes mutual adjustment of the nucleotide and amino acid compositions and its possible effect on the mutational biases. The tradeoff is determined by the interplay between the genetic code, optimization of the codon entropy, and demands on the structure and stability of nucleic acids and proteins. Conclusion The tradeoff is the unifying property of all prokaryotes regardless of the differences in their phylogenies, life styles, and extreme environments. It underlies mutational biases characteristic for genomes with different nucleotide and amino acid compositions, providing foundation for evolution and adaptation. Reviewers This article was reviewed by Eugene Koonin, Michael Gromiha, and Alexander Schleiffer. Electronic supplementary material The online version of this article (doi:10.1186/s13062-014-0029-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander Goncearenco
- Computational Biology Unit and Department of Informatics, University of Bergen, N-5008, Bergen, Norway. .,Current address: Computational Biology Branch of the National Center for Biotechnology Information in Bethesda, Maryland, USA.
| | - Igor N Berezovsky
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore. .,Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, 117597, Singapore, Singapore.
| |
Collapse
|
8
|
GC constituents and relative codon expressed amino acid composition in cyanobacterial phycobiliproteins. Gene 2014; 546:162-71. [PMID: 24933001 DOI: 10.1016/j.gene.2014.06.024] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Revised: 04/17/2014] [Accepted: 06/12/2014] [Indexed: 02/01/2023]
Abstract
The genomic as well as structural relationship of phycobiliproteins (PBPs) in different cyanobacterial species are determined by nucleotides as well as amino acid composition. The genomic GC constituents influence the amino acid variability and codon usage of particular subunit of PBPs. We have analyzed 11 cyanobacterial species to explore the variation of amino acids and causal relationship between GC constituents and codon usage. The study at the first, second and third levels of GC content showed relatively more amino acid variability on the levels of G3+C3 position in comparison to the first and second positions. The amino acid encoded GC rich level including G rich and C rich or both correlate the codon variability and amino acid availability. The fluctuation in amino acids such as Arg, Ala, His, Asp, Gly, Leu and Glu in α and β subunits was observed at G1C1 position; however, fluctuation in other amino acids such as Ser, Thr, Cys and Trp was observed at G2C2 position. The coding selection pressure of amino acids such as Ala, Thr, Tyr, Asp, Gly, Ile, Leu, Asn, and Ser in α and β subunits of PBPs was more elaborated at G3C3 position. In this study, we observed that each subunit of PBPs is codon specific for particular amino acid. These results suggest that genomic constraint linked with GC constituents selects the codon for particular amino acids and furthermore, the codon level study may be a novel approach to explore many problems associated with genomics and proteomics of cyanobacteria.
Collapse
|
9
|
Goncearenco A, Ma BG, Berezovsky IN. Molecular mechanisms of adaptation emerging from the physics and evolution of nucleic acids and proteins. Nucleic Acids Res 2013; 42:2879-92. [PMID: 24371267 PMCID: PMC3950714 DOI: 10.1093/nar/gkt1336] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
DNA, RNA and proteins are major biological macromolecules that coevolve and adapt to environments as components of one highly interconnected system. We explore here sequence/structure determinants of mechanisms of adaptation of these molecules, links between them, and results of their mutual evolution. We complemented statistical analysis of genomic and proteomic sequences with folding simulations of RNA molecules, unraveling causal relations between compositional and sequence biases reflecting molecular adaptation on DNA, RNA and protein levels. We found many compositional peculiarities related to environmental adaptation and the life style. Specifically, thermal adaptation of protein-coding sequences in Archaea is characterized by a stronger codon bias than in Bacteria. Guanine and cytosine load in the third codon position is important for supporting the aerobic life style, and it is highly pronounced in Bacteria. The third codon position also provides a tradeoff between arginine and lysine, which are favorable for thermal adaptation and aerobicity, respectively. Dinucleotide composition provides stability of nucleic acids via strong base-stacking in ApG dinucleotides. In relation to coevolution of nucleic acids and proteins, thermostability-related demands on the amino acid composition affect the nucleotide content in the second codon position in Archaea.
Collapse
Affiliation(s)
- Alexander Goncearenco
- CBU, University of Bergen, 5020 Bergen, Norway, Department of Informatics, University of Bergen, 5020 Bergen, Norway, Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671 Singapore and Department of Biological Chemistry, Weizmann Institute of Science, Rehovot, 76100, Israel
| | | | | |
Collapse
|
10
|
Chen W, Shao Y, Chen F. Evolution of complete proteomes: guanine-cytosine pressure, phylogeny and environmental influences blend the proteomic architecture. BMC Evol Biol 2013; 13:219. [PMID: 24088322 PMCID: PMC3850711 DOI: 10.1186/1471-2148-13-219] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 10/01/2013] [Indexed: 11/18/2022] Open
Abstract
Background Guanine-cytosine (GC) composition is an important feature of genomes. Likewise, amino acid composition is a distinct, but less valued, feature of proteomes. A major concern is that it is not clear what valuable information can be acquired from amino acid composition data. To address this concern, in-depth analyses of the amino acid composition of the complete proteomes from 63 archaea, 270 bacteria, and 128 eukaryotes were performed. Results Principal component analysis of the amino acid matrices showed that the main contributors to proteomic architecture were genomic GC variation, phylogeny, and environmental influences. GC pressure drove positive selection on Ala, Arg, Gly, Pro, Trp, and Val, and adverse selection on Asn, Lys, Ile, Phe, and Tyr. The physico-chemical framework of the complete proteomes withstood GC pressure by frequency complementation of GC-dependent amino acid pairs with similar physico-chemical properties. Gln, His, Ser, and Val were responsible for phylogeny and their constituted components could differentiate archaea, bacteria, and eukaryotes. Environmental niche was also a significant factor in determining proteomic architecture, especially for archaea for which the main amino acids were Cys, Leu, and Thr. In archaea, hyperthermophiles, acidophiles, mesophiles, psychrophiles, and halophiles gathered successively along the environment-based principal component. Concordance between proteomic architecture and the genetic code was also related closely to genomic GC content, phylogeny, and lifestyles. Conclusions Large-scale analyses of the complete proteomes of a wide range of organisms suggested that amino acid composition retained the trace of GC variation, phylogeny, and environmental influences during evolution. The findings from this study will help in the development of a global understanding of proteome evolution, and even biological evolution.
Collapse
Affiliation(s)
- Wanping Chen
- Key Laboratory of Environment Correlative Dietology, Huazhong Agricultural University, Wuhan, Hubei Province 430070, China.
| | | | | |
Collapse
|
11
|
Zhang Z, Yu J. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences. Biol Direct 2010; 5:63. [PMID: 21059261 PMCID: PMC2989939 DOI: 10.1186/1745-6150-5-63] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2010] [Accepted: 11/08/2010] [Indexed: 12/03/2022] Open
Abstract
Background Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge. Results To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences. Conclusions We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms. Reviewers This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft.
Collapse
Affiliation(s)
- Zhang Zhang
- Plant Stress Genomics Research Center, Division of Chemical and Life Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia
| | | |
Collapse
|
12
|
Gorban AN, Zinovyev AY. The mystery of two straight lines in bacterial genome statistics. Bull Math Biol 2007; 69:2429-42. [PMID: 17577600 DOI: 10.1007/s11538-007-9229-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2006] [Accepted: 05/04/2007] [Indexed: 10/23/2022]
Abstract
In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.
Collapse
|
13
|
White HB, Dhurjati P. Evolution of protein lipograms: A bioinformatics problem. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION : A BIMONTHLY PUBLICATION OF THE INTERNATIONAL UNION OF BIOCHEMISTRY AND MOLECULAR BIOLOGY 2006; 34:262-266. [PMID: 21638688 DOI: 10.1002/bmb.2006.494034042635] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
A protein lacking one of the 20 common amino acids is a protein lipogram. This open-ended problem-based learning assignment deals with the evolution of proteins with biased amino acid composition. It has students query protein and metabolic databases to test the hypothesis that natural selection has reduced the frequency of each amino acid specifically in the enzymes required for its biosynthesis. Student groups work in parallel on different amino acids and share strategies. Aside from content objectives that integrate knowledge of protein structure, function, synthesis, and evolution, this problem incorporates oral and written presentations, statistical analysis, and substantial decision making. The point of the problem described here is that a deficiency or absence of a particular amino acid in a protein may be more than a chance occurrence and may be driven by natural selection. The challenge is to demonstrate the difference.
Collapse
Affiliation(s)
- Harold B White
- Departments of Chemistry and Biochemistry, University of Delaware, Newark, Delaware 19716
| | | |
Collapse
|
14
|
Foerstner KU, von Mering C, Hooper SD, Bork P. Environments shape the nucleotide composition of genomes. EMBO Rep 2006; 6:1208-13. [PMID: 16200051 PMCID: PMC1369203 DOI: 10.1038/sj.embor.7400538] [Citation(s) in RCA: 202] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2005] [Revised: 08/15/2005] [Accepted: 08/19/2005] [Indexed: 11/09/2022] Open
Abstract
To test the impact of environments on genome evolution, we analysed the relative abundance of the nucleotides guanine and cytosine ('GC content') of large numbers of sequences from four distinct environmental samples (ocean surface water, farm soil, an acidophilic mine drainage biofilm and deep-sea whale carcasses). We show that the GC content of complex microbial communities seems to be globally and actively influenced by the environment. The observed nucleotide compositions cannot be easily explained by distinct phylogenetic origins of the species in the environments; the genomic GC content may change faster than was previously thought, and is also reflected in the amino-acid composition of the proteins in these habitats.
Collapse
Affiliation(s)
- Konrad U Foerstner
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | - Christian von Mering
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | - Sean D Hooper
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | - Peer Bork
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
- Max-Delbrück Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany
- Tel: +49 6221 387 8526; Fax: +49 6221 387 517; E-mail:
| |
Collapse
|
15
|
Urbina D, Tang B, Higgs PG. The response of amino acid frequencies to directional mutation pressure in mitochondrial genome sequences is related to the physical properties of the amino acids and to the structure of the genetic code. J Mol Evol 2006; 62:340-61. [PMID: 16477524 DOI: 10.1007/s00239-005-0051-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2005] [Accepted: 10/01/2005] [Indexed: 11/29/2022]
Abstract
The frequencies of A, C, G, and T in mitochondrial DNA vary among species due to unequal rates of mutation between the bases. The frequencies of bases at fourfold degenerate sites respond directly to mutation pressure. At first and second positions, selection reduces the degree of frequency variation. Using a simple evolutionary model, we show that first position sites are less constrained by selection than second position sites and, therefore, that the frequencies of bases at first position are more responsive to mutation pressure than those at second position. We define a measure of distance between amino acids that is dependent on eight measured physical properties and a similarity measure that is the inverse of this distance. Columns 1, 2, 3, and 4 of the genetic code correspond to codons with U, C, A, and G in their second position, respectively. The similarity of amino acids in the four columns decreases systematically from column 1 to column 2 to column 3 to column 4. We then show that the responsiveness of first position bases to mutation pressure is dependent on the second position base and follows the same decreasing trend through the four columns. Again, this shows the correlation between physical properties and responsiveness. We determine a proximity measure for each amino acid, which is the average similarity between an amino acid and all others that are accessible via single point mutations in the mitochondrial genetic code structure. We also define a responsiveness for each amino acid, which measures how rapidly an amino acid frequency changes as a result of mutation pressure acting on the base frequencies. We show that there is a strong correlation between responsiveness and proximity, and that both these quantities are also correlated with the mutability of amino acids estimated from the mtREV substitution rate matrix. We also consider the variation of base frequencies between strands and between genes on a strand. These trends are consistent with the patterns expected from analysis of the variation among genomes.
Collapse
Affiliation(s)
- Daniel Urbina
- Department of Physics and Astronomy, McMaster University, Hamilton, Ontario, Canada
| | | | | |
Collapse
|
16
|
Yin C, Yau SST. A Fourier characteristic of coding sequences: origins and a non-Fourier approximation. J Comput Biol 2006; 12:1153-65. [PMID: 16305326 DOI: 10.1089/cmb.2005.12.1153] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA
| | | |
Collapse
|
17
|
Bharanidharan D, Gautham N. Amino acid variation in cellular processes in 108 bacterial proteomes. Arch Microbiol 2005; 184:168-74. [PMID: 16205912 DOI: 10.1007/s00203-005-0034-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2005] [Revised: 08/23/2005] [Accepted: 08/29/2005] [Indexed: 11/28/2022]
Abstract
We have analysed 108 bacterial proteomes in the KEGG database to explore the variation of amino acid composition with respect to protein function. The ratio between the observed amino acid composition and that predicted based on mononucleotide composition was calculated for each functional category. This indicated whether the compositional variation arose from mutation or selection pressure. The results showed that charged amino acids (Lys, Arg and Glu), were found more frequently than expected in proteins involved in genetic information processing (i.e. transcription, translation, etc.) Similarly, in the proteins involved in processing environmental information (e.g. signal transduction), the hydrophobic amino acid Leu was found in excess of values expected from the base composition in the genes.
Collapse
Affiliation(s)
- Devarajan Bharanidharan
- Department of Crystallography and Biophysics, University of Madras, Guindy Campus, 600025 Chennai, India
| | | |
Collapse
|
18
|
Berezovsky IN, Chen WW, Choi PJ, Shakhnovich EI. Entropic stabilization of proteins and its proteomic consequences. PLoS Comput Biol 2005; 1:e47. [PMID: 16201009 PMCID: PMC1239905 DOI: 10.1371/journal.pcbi.0010047] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 09/01/2005] [Indexed: 11/18/2022] Open
Abstract
Evolutionary traces of thermophilic adaptation are manifest, on the whole-genome level, in compositional biases toward certain types of amino acids. However, it is sometimes difficult to discern their causes without a clear understanding of underlying physical mechanisms of thermal stabilization of proteins. For example, it is well-known that hyperthermophiles feature a greater proportion of charged residues, but, surprisingly, the excess of positively charged residues is almost entirely due to lysines but not arginines in the majority of hyperthermophilic genomes. All-atom simulations show that lysines have a much greater number of accessible rotamers than arginines of similar degree of burial in folded states of proteins. This finding suggests that lysines would preferentially entropically stabilize the native state. Indeed, we show in computational experiments that arginine-to-lysine amino acid substitutions result in noticeable stabilization of proteins. We then hypothesize that if evolution uses this physical mechanism as a complement to electrostatic stabilization in its strategies of thermophilic adaptation, then hyperthermostable organisms would have much greater content of lysines in their proteomes than comparably sized and similarly charged arginines. Consistent with that, high-throughput comparative analysis of complete proteomes shows extremely strong bias toward arginine-to-lysine replacement in hyperthermophilic organisms and overall much greater content of lysines than arginines in hyperthermophiles. This finding cannot be explained by genomic GC compositional biases or by the universal trend of amino acid gain and loss in protein evolution. We discovered here a novel entropic mechanism of protein thermostability due to residual dynamics of rotamer isomerization in native state and demonstrated its immediate proteomic implications. Our study provides an example of how analysis of a fundamental physical mechanism of thermostability helps to resolve a puzzle in comparative genomics as to why amino acid compositions of hyperthermophilic proteomes are significantly biased toward lysines but not similarly charged arginines. Comparative genomics sends us profound signals that are not easy to understand. For example, it is well known that proteins from hyperthermophiles are enriched with charged residues, but it has been a mystery why enrichment in positively charged amino acids is almost entirely due to lysines at the expense of very similar arginines. Here, the authors show that lysines (in contrast to arginines) exhibit significant residual dynamics in folded states of proteins, making the entropic cost to fold lysine-rich proteins less unfavorable compared with arginine-rich ones. Therefore, replacements of arginines by lysines provide additional thermal stabilization of proteins via entropic mechanism, making them positively charged residues of choice for evolutionary optimization of hyperthermostable proteins. Apparently, natural selection uses diverse physical mechanisms of thermal stability to achieve adaptation. This study provides an example of how better understanding of protein physics can help in solving genomic mysteries.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - William W Chen
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Biophysics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Paul J Choi
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Eugene I Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
19
|
Najafabadi HS, Goodarzi H. Correspondence regarding Bharanidharan et al., "Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species". Biochem Biophys Res Commun 2005; 325:1-2. [PMID: 15522192 DOI: 10.1016/j.bbrc.2004.09.183] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2004] [Indexed: 10/26/2022]
Abstract
Bharanidharan et al. [Biochem. Biophys. Res. Commun. 315 (2004) 1097-1103] claimed that the frequencies of most amino acids are determined by the dinucleotide composition of the genome. Here, regarding a methodological problem in their work, it is suggested that the standard deviations of amino acid frequencies should be determined to indicate how significant a certain deviation from the predicted frequency is. Furthermore, using a different method that is expected to be more reliable, we suggest that the dinucleotide composition cannot explain the observed frequencies of most amino acids, and the deviations of amino acid frequencies from what dinucleotide composition predicts are larger than to be expected by chance.
Collapse
|