1
|
He J, Huang Y, Li L, Lin S, Ma M, Wang Y, Lin S. Novel Plastid Genome Characteristics in Fugacium kawagutii and the Trend of Accelerated Evolution of Plastid Proteins in Dinoflagellates. Genome Biol Evol 2024; 16:evad237. [PMID: 38155596 PMCID: PMC10781511 DOI: 10.1093/gbe/evad237] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/19/2023] [Accepted: 12/22/2023] [Indexed: 12/30/2023] Open
Abstract
Typical (peridinin-containing) dinoflagellates possess plastid genomes composed of small plasmids named "minicircles". Despite the ecological importance of dinoflagellate photosynthesis in corals and marine ecosystems, the structural characteristics, replication dynamics, and evolutionary forcing of dinoflagellate plastid genomes remain poorly understood. Here, we sequenced the plastid genome of the symbiodiniacean species Fugacium kawagutii and conducted comparative analyses. We identified psbT-coding minicircles, features previously not found in Symbiodiniaceae. The copy number of F. kawagutii minicircles showed a strong diel dynamics, changing between 3.89 and 34.3 copies/cell and peaking in mid-light period. We found that F. kawagutii minicircles are the shortest among all dinoflagellates examined to date. Besides, the core regions of the minicircles are highly conserved within genus in Symbiodiniaceae. Furthermore, the codon usage bias of the plastid genomes in Heterocapsaceae, Amphidiniaceae, and Prorocentraceae species are greatly influenced by selection pressure, and in Pyrocystaceae, Symbiodiniaceae, Peridiniaceae, and Ceratiaceae species are influenced by both natural selection pressure and mutation pressure, indicating a family-level distinction in codon usage evolution in dinoflagellates. Phylogenetic analysis using 12 plastid-encoded proteins and five nucleus-encoded plastid proteins revealed accelerated evolution trend of both plastid- and nucleus-encoded plastid proteins in peridinin- and fucoxanthin-dinoflagellate plastids compared to plastid proteins of nondinoflagellate algae. These findings shed new light on the structure and evolution of plastid genomes in dinoflagellates, which will facilitate further studies on the evolutionary forcing and function of the diverse dinoflagellate plastids. The accelerated evolution documented here suggests plastid-encoded sequences are potentially useful for resolving closely related dinoflagellates.
Collapse
Affiliation(s)
- Jiamin He
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Yulin Huang
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Ling Li
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Sitong Lin
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Minglei Ma
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Yujie Wang
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
| | - Senjie Lin
- State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China
- Department of Marine Sciences, University of Connecticut, Groton, CT 06340, USA
| |
Collapse
|
2
|
Protein innovation through template switching in the Saccharomyces cerevisiae lineage. Sci Rep 2021; 11:22558. [PMID: 34799587 PMCID: PMC8604942 DOI: 10.1038/s41598-021-01736-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 10/27/2021] [Indexed: 11/08/2022] Open
Abstract
DNA polymerase template switching between short, non-identical inverted repeats (IRs) is a genetic mechanism that leads to the homogenization of IR arms and to IR spacer inversion, which cause multinucleotide mutations (MNMs). It is unknown if and how template switching affects gene evolution. In this study, we performed a phylogenetic analysis to determine the effect of template switching between IR arms on coding DNA of Saccharomyces cerevisiae. To achieve this, perfect IRs that co-occurred with MNMs between a strain and its parental node were identified in S. cerevisiae strains. We determined that template switching introduced MNMs into 39 protein-coding genes through S. cerevisiae evolution, resulting in both arm homogenization and inversion of the IR spacer. These events in turn resulted in nonsynonymous substitutions and up to five neighboring amino acid replacements in a single gene. The study demonstrates that template switching is a powerful generator of multiple substitutions within codons. Additionally, some template switching events occurred more than once during S. cerevisiae evolution. Our findings suggest that template switching constitutes a general mutagenic mechanism that results in both nonsynonymous substitutions and parallel evolution, which are traditionally considered as evidence for positive selection, without the need for adaptive explanations.
Collapse
|
3
|
Špoljarić D, Ugrina I. Limiting distribution of the number of clumps of palindromes in DNA. COMMUN STAT-THEOR M 2017. [DOI: 10.1080/03610926.2016.1189573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Drago Špoljarić
- Faculty of Mining, Geology and Petroleum Engineering, University of Zagreb, Zagreb, Croatia
| | - Ivo Ugrina
- Faculty of Science, Department of Mathematics, University of Zagreb, Zagreb, Croatia
| |
Collapse
|
4
|
|
5
|
|
6
|
Špoljarić D, Ugrina I. On Statistical Properties of Palindromes in DNA. COMMUN STAT-THEOR M 2013. [DOI: 10.1080/03610926.2012.739253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
7
|
Hatsuda H. Finding differentially expressed regions of arbitrary length in quantitative genomic data based on marked point process model. Bioinformatics 2012; 28:i633-i639. [PMID: 22962492 PMCID: PMC3436798 DOI: 10.1093/bioinformatics/bts371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Motivation: High-throughput nucleotide sequencing technologies provide large amounts of quantitative genomic data at nucleotide resolution, which are important for the present and future biomedical researches; for example differential analysis of base-level RNA expression data will improve our understanding of transcriptome, including both coding and non-coding genes. However, most studies of these data have relied on existing genome annotations and thus are limited to the analysis of known transcripts. Results: In this article, we propose a novel method based on a marked point process model to find differentially expressed genomic regions of arbitrary length without using genome annotations. The presented method conducts a statistical test for differential analysis in regions of various lengths at each nucleotide and searches the optimal configuration of the regions by using a Monte Carlo simulation. We applied the proposed method to both synthetic and real genomic data, and their results demonstrate the effectiveness of our method. Availability: The program used in this study is available at https://sites.google.com/site/hiroshihatsuda/. Contact:H.Hatsuda@warwick.ac.uk
Collapse
Affiliation(s)
- Hiroshi Hatsuda
- Department of Statistics, the University of Warwick, Coventry CV4 7AL, UK.
| |
Collapse
|
8
|
Zubaer A, Thapa S. Palindromes drive the re-assortment in Influenza A. Bioinformation 2011; 7:115-9. [PMID: 22125380 PMCID: PMC3218312 DOI: 10.6026/97320630007115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Accepted: 09/11/2011] [Indexed: 11/23/2022] Open
Abstract
Different subtypes of Influenza A virus are associated with species specific, zoonotic or pandemic Influenza. The cause of its severity underlies in complicated evolution of its segmented RNA genome. Although genetic shift and genetic drift are well known in the evolution of this virus, we reported the significant role of unique RNA palindromes in its evolution. Our computational approach identified the existence of unique palindromes in each subtype of Influenza A virus with its absence in Influenza B relating the fact of virulence and vigorous genetic hitchhiking in Influenza A. The current study focused on the re-assortment event responsible for the emergence of pandemic-2009 H1N1 virus, which is associated with outgrow of new palindrome and in turn, changing its RNA structure. We hypothesize that the change in RNA structure due to the presence of palindrome facilitates the event of re-assortment in Influenza A. Thus the evolutionary process of Influenza A is much more complicated as previously known, and that has been demonstrated in this study.
Collapse
|
9
|
Chan HP, Zhang NR, Chen LHY. Importance sampling of word patterns in DNA and protein sequences. J Comput Biol 2011; 17:1697-709. [PMID: 21128856 DOI: 10.1089/cmb.2008.0233] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs.
Collapse
Affiliation(s)
- Hock Peng Chan
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Republic of Singapore
| | | | | |
Collapse
|
10
|
Lamprea-Burgunder E, Ludin P, Mäser P. Species-specific typing of DNA based on palindrome frequency patterns. DNA Res 2011; 18:117-24. [PMID: 21429991 PMCID: PMC3077040 DOI: 10.1093/dnares/dsr004] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
DNA in its natural, double-stranded form may contain palindromes, sequences which read the same from either side because they are identical to their reverse complement on the sister strand. Short palindromes are underrepresented in all kinds of genomes. The frequency distribution of short palindromes exhibits more than twice the inter-species variance of non-palindromic sequences, which renders palindromes optimally suited for the typing of DNA. Here, we show that based on palindrome frequency, DNA sequences can be discriminated to the level of species of origin. By plotting the ratios of actual occurrence to expectancy, we generate palindrome frequency patterns that allow to cluster different sequences of the same genome and to assign plasmids, and in some cases even viruses to their respective host genomes. This finding will be of use in the growing field of metagenomics.
Collapse
|
11
|
Strawbridge EM, Benson G, Gelfand Y, Benham CJ. The distribution of inverted repeat sequences in the Saccharomyces cerevisiae genome. Curr Genet 2010; 56:321-40. [PMID: 20446088 PMCID: PMC2908449 DOI: 10.1007/s00294-010-0302-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Revised: 04/05/2010] [Accepted: 04/08/2010] [Indexed: 02/06/2023]
Abstract
Although a variety of possible functions have been proposed for inverted repeat sequences (IRs), it is not known which of them might occur in vivo. We investigate this question by assessing the distributions and properties of IRs in the Saccharomyces cerevisiae (SC) genome. Using the IRFinder algorithm we detect 100,514 IRs having copy length greater than 6 bp and spacer length less than 77 bp. To assess statistical significance we also determine the IR distributions in two types of randomization of the S. cerevisiae genome. We find that the S. cerevisiae genome is significantly enriched in IRs relative to random. The S. cerevisiae IRs are significantly longer and contain fewer imperfections than those from the randomized genomes, suggesting that processes to lengthen and/or correct errors in IRs may be operative in vivo. The S. cerevisiae IRs are highly clustered in intergenic regions, while their occurrence in coding sequences is consistent with random. Clustering is stronger in the 3' flanks of genes than in their 5' flanks. However, the S. cerevisiae genome is not enriched in those IRs that would extrude cruciforms, suggesting that this is not a common event. Various explanations for these results are considered.
Collapse
Affiliation(s)
| | - Gary Benson
- Laboratory for Biocomputing and Informatics, Boston University, Boston, MA USA
| | - Yevgeniy Gelfand
- Laboratory for Biocomputing and Informatics, Boston University, Boston, MA USA
| | - Craig J. Benham
- Department of Mathematics, University of California, Davis, CA 95616 USA
| |
Collapse
|
12
|
Cruz-Cano R, Chew DSH, Kwok-Pui C, Ming-Ying L. Least-Squares Support Vector Machine Approach to Viral Replication Origin Prediction. INFORMS JOURNAL ON COMPUTING 2010; 22:457-470. [PMID: 20729987 PMCID: PMC2923853 DOI: 10.1287/ijoc.1090.0360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Replication of their DNA genomes is a central step in the reproduction of many viruses. Procedures to find replication origins, which are initiation sites of the DNA replication process, are therefore of great importance for controlling the growth and spread of such viruses. Existing computational methods for viral replication origin prediction have mostly been tested within the family of herpesviruses. This paper proposes a new approach by least-squares support vector machines (LS-SVMs) and tests its performance not only on the herpes family but also on a collection of caudoviruses coming from three viral families under the order of caudovirales. The LS-SVM approach provides sensitivities and positive predictive values superior or comparable to those given by the previous methods. When suitably combined with previous methods, the LS-SVM approach further improves the prediction accuracy for the herpesvirus replication origins. Furthermore, by recursive feature elimination, the LS-SVM has also helped find the most significant features of the data sets. The results suggest that the LS-SVMs will be a highly useful addition to the set of computational tools for viral replication origin prediction and illustrate the value of optimization-based computing techniques in biomedical applications.
Collapse
Affiliation(s)
- Raul Cruz-Cano
- Department of Computer and Information Sciences, Texas A&M University-Texarkana, Texarkana, TX, 75501, USA,
| | | | | | | |
Collapse
|
13
|
Abstract
This article develops a latent model and likelihood-based inference to detect temporal clustering of events. The model mimics typical processes generating the observed data. We apply model selection techniques to determine the number of clusters, and develop likelihood inference and a Monte Carlo expectation-maximization algorithm to estimate model parameters, detect clusters, and identify cluster locations. Our method differs from the classical scan statistic in that we can simultaneously detect multiple clusters of varying sizes. We illustrate the methodology with two real data applications and evaluate its efficiency through simulation studies. For the typical data-generating process, our methodology is more efficient than a competing procedure that relies on least squares.
Collapse
Affiliation(s)
- Minge Xie
- Department of Statistics, Rutgers, the State University of New Jersey, Piscataway, New Jersey 08854, USA.
| | | | | |
Collapse
|
14
|
Taufer M, Leung MY, Solorio T, Licon A, Mireles D, Araiza R, Johnson KL. RNAVLab: A virtual laboratory for studying RNA secondary structures based on grid computing technology. PARALLEL COMPUTING 2008; 34:661-680. [PMID: 19885376 PMCID: PMC2714649 DOI: 10.1016/j.parco.2008.08.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Revised: 06/06/2008] [Accepted: 08/21/2008] [Indexed: 05/28/2023]
Abstract
As ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation, their secondary structures have been the focus of many recent studies. Despite the computing power of supercomputers, computationally predicting secondary structures with thermodynamic methods is still not feasible when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudoknots. This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge. Two important case studies show the versatility and functionalities of RNAVLab. The first study quantifies its capability to rebuild longer secondary structures from motifs found in systematically sampled nucleotide segments. The extensive sampling and predictions are made feasible in a short turnaround time because of the grid technology used. The second study shows how RNAVLab allows scientists to study the viral RNA genome replication mechanisms used by members of the virus family Nodaviridae.
Collapse
Affiliation(s)
- Michela Taufer
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, United States
| | - Ming-Ying Leung
- Department of Mathematical Sciences, The University of Texas at El Paso, El Paso, TX 79968, United States
- Bioinformatics Program, The University of Texas at El Paso, El Paso, TX 79968, United States
- Border Biomedical Research Center, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Thamar Solorio
- Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, United States
| | - Abel Licon
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, United States
| | - David Mireles
- Department of Computer Science, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Roberto Araiza
- Department of Computer Science, The University of Texas at El Paso, El Paso, TX 79968, United States
| | - Kyle L. Johnson
- Border Biomedical Research Center, The University of Texas at El Paso, El Paso, TX 79968, United States
- Department of Biological Sciences, The University of Texas at El Paso, El Paso, TX 79968, United States
| |
Collapse
|
15
|
Lillo F, Spanò M. Inverted and mirror repeats in model nucleotide sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 76:041914. [PMID: 17995033 DOI: 10.1103/physreve.76.041914] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2007] [Indexed: 05/25/2023]
Abstract
We analytically and numerically study the probabilistic properties of inverted and mirror repeats in model sequences of nucleic acids. We consider both perfect and nonperfect repeats, i.e., repeats with mismatches and gaps. The considered sequence models are independent identically distributed (i.i.d.) sequences, Markov processes and long-range sequences. We show that the number of repeats in correlated sequences is significantly larger than in i.i.d. sequences and that this discrepancy increases exponentially with the repeat length for long-range sequences.
Collapse
Affiliation(s)
- Fabrizio Lillo
- Dipartimento di Fisica e Tecnologie Relative, Università di Palermo, Viale delle Scienze, I-90128, Palermo, Italy
| | | |
Collapse
|
16
|
Chew DSH, Leung MY, Choi KP. AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions. BMC Bioinformatics 2007; 8:163. [PMID: 17517140 PMCID: PMC1904460 DOI: 10.1186/1471-2105-8-163] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2006] [Accepted: 05/21/2007] [Indexed: 11/12/2022] Open
Abstract
Background Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments. Results We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS1). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS1, showing that the AT excursion method is a valuable complement to BWS1. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at [1]. Preliminary investigation shows that the proposed method works well on some larger genomes too. Conclusion The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.
Collapse
Affiliation(s)
- David SH Chew
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
| | - Ming-Ying Leung
- Department of Mathematical Sciences and Bioinformatics Program, The University of Texas at El Paso, TX 79968, USA
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
- Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
| |
Collapse
|
17
|
Chew DSH, Choi KP, Leung MY. Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses. Nucleic Acids Res 2005; 33:e134. [PMID: 16141192 PMCID: PMC1197138 DOI: 10.1093/nar/gni135] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many empirical studies show that there are unusual clusters of palindromes, closely spaced direct and inverted repeats around the replication origins of herpesviruses. In this paper, we introduce two new scoring schemes to quantify the spatial abundance of palindromes in a genomic sequence. Based on these scoring schemes, a computational method to predict the locations of replication origins is developed. When our predictions are compared with 39 known or annotated replication origins in 19 herpesviruses, close to 80% of the replication origins are located within 2% of the genome length. A list of predicted locations of replication origins in all the known herpesviruses with complete genome sequences is reported.
Collapse
Affiliation(s)
- David S H Chew
- Department of Mathematics, National University of Singapore Singapore.
| | | | | |
Collapse
|
18
|
Chew DSH, Choi KP, Heidner H, Leung MY. Palindromes in SARS and Other Coronaviruses. INFORMS JOURNAL ON COMPUTING 2004; 16:331-340. [PMID: 24966663 PMCID: PMC4066412 DOI: 10.1287/ijoc.1040.0087] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
With the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS), computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palindromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markov-chain model for the genome sequence, the mean and standard deviation for the number of palindromes at or above a given length are derived. These theoretical results are complemented by extensive simulations to provide empirical estimates. Using a z score obtained from these mathematical and empirical means and standard deviations, we have observed that palindromes of length four are significantly underrepresented in all the coronaviruses in our data set. In contrast, length-six palindromes are significantly underrepresented only in the SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length-22 palindrome TCTTTAACAAGCTTGTTAAAGA spanning positions 25962-25983. Second, there are two repeating length-12 palindromes TTATAATTATAA spanning positions 22712-22723 and 22796-22807. Some further investigations into possible biological implications of these palindrome features are proposed.
Collapse
Affiliation(s)
- David S. H. Chew
- Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
| | - Kwok Pui Choi
- Departments of Mathematics, and of Statistics and Applied Probability, National University of Singapore, Singapore 117543, Singapore
| | - Hans Heidner
- Department of Biology, University of Texas at San Antonio, San Antonio, Texas 78249, USA
| | - Ming-Ying Leung
- Department of Mathematical Sciences, University of Texas at El Paso, El Paso, Texas 79968, USA
| |
Collapse
|