1
|
Jayapurna I, Ruan Z, Eres M, Jalagam P, Jenkins S, Xu T. Sequence Design of Random Heteropolymers as Protein Mimics. Biomacromolecules 2023; 24:652-660. [PMID: 36638823 PMCID: PMC9930114 DOI: 10.1021/acs.biomac.2c01036] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.
Collapse
Affiliation(s)
- Ivan Jayapurna
- Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, California 94720, United States
| | - Zhiyuan Ruan
- Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, California 94720, United States
| | - Marco Eres
- Department of Chemistry, University of California, Berkeley, Berkeley, California 94720, United States
| | - Prajna Jalagam
- Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, California 94720, United States
| | - Spencer Jenkins
- Department of Chemistry, University of California, Berkeley, Berkeley, California 94720, United States
| | - Ting Xu
- Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, California 94720, United States.,Department of Chemistry, University of California, Berkeley, Berkeley, California 94720, United States.,Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
| |
Collapse
|
2
|
Standage DS, Lai T, Brendel VP. iLoci: robust evaluation of genome content and organization for provisional and mature genome assemblies. NAR Genom Bioinform 2022; 4:lqac013. [PMID: 35211671 PMCID: PMC8862717 DOI: 10.1093/nargab/lqac013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 12/23/2021] [Accepted: 02/10/2022] [Indexed: 11/23/2022] Open
Abstract
We introduce a new framework for genome analyses based on parsing an annotated genome assembly into distinct interval loci (iLoci), available as open-source software as part of the AEGeAn Toolkit (https://github.com/BrendelGroup/AEGeAn). We demonstrate that iLoci provide an alternative coordinate system that is robust to changes in assembly and annotation versions and facilitates granular quality control of genome data. We discuss how statistics computed on iLoci reflect various characteristics of genome content and organization and illustrate how these statistics can be used to establish a baseline for assessment of the completeness and accuracy of the data. We also introduce a well-defined measure of relative genome compactness and compute other iLocus statistics that reveal genome-wide characteristics of gene arrangements in the whole genome context. Given the fast pace of assembly/annotation updates, our AEGeAn Toolkit fills a niche in computational genomics based on deriving persistent and species-specific genome statistics. Gene structure model-centric iLoci provide a precisely defined coordinate system that can be used to store assembly/annotation updates that reflect either stable or changed assessments. Large-scale application of the approach revealed species- and clade-specific genome organization in precisely defined computational terms, promising intriguing forays into the forces of shaping genome structure as more and more genome assemblies are being deposited.
Collapse
Affiliation(s)
- Daniel S Standage
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Tim Lai
- Department of Mathematics, Indiana University, Bloomington, IN 47405, USA
| | - Volker P Brendel
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
3
|
Morandin C, Brendel VP. Tools and applications for integrative analysis of DNA methylation in social insects. Mol Ecol Resour 2021; 22:1656-1674. [PMID: 34861105 DOI: 10.1111/1755-0998.13566] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 11/18/2021] [Accepted: 11/23/2021] [Indexed: 12/15/2022]
Abstract
DNA methylation is a common epigenetic signalling tool and an important biological process which is widely studied in a large array of species. The presence, level and function of DNA methylation vary greatly across species. In some insects, DNA methylation systems are minimal, and overall methylation rates tend to be low in all studied insect species. Low methylation levels probed by whole-genome bisulphite sequencing require great care with respect to data quality control and interpretation. Here, we introduce BWASP/R, a complete workflow that allows efficient, scalable and entirely reproducible analyses of raw DNA methylation sequencing data. Consistent application of quality control filters and analysis parameters provides fair comparisons among different studies and an integrated view of all experiments on one species. We describe the capabilities of the BWASP/R workflow by re-analysing several publicly available social insect WGBS data sets, comprising 70 samples and cumulatively 147 replicates from four different species. We show that the CpG methylome comprises only about 1.5% of CpG sites in the honeybee genome and that the cumulative data are consistent with genetic signatures of site accessibility and physiological control of methylation levels.
Collapse
Affiliation(s)
- Claire Morandin
- Department of Ecology and Evolution, Biophore, University of Lausanne, Lausanne, Switzerland
| | - Volker P Brendel
- Departments of Biology and Computer Science, Indiana University, Bloomingto, Indiana, USA
| |
Collapse
|
4
|
Dick JM. Water as a reactant in the differential expression of proteins in cancer. COMPUTATIONAL AND SYSTEMS ONCOLOGY 2021. [DOI: 10.1002/cso2.1007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Affiliation(s)
- Jeffrey M. Dick
- Key Laboratory of Metallogenic Prediction of Nonferrous Metals and Geological Environment Monitoring, Ministry of Education School of Geosciences and Info‐Physics Central South University Changsha China
| |
Collapse
|
5
|
|
6
|
Ding Y, Xue H, Ding X, Zhao Y, Zhao Z, Wang D, Wu J. On the complexity measures of mutation hotspots in human TP53 protein. CHAOS (WOODBURY, N.Y.) 2020; 30:073118. [PMID: 32752620 DOI: 10.1063/1.5143584] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Accepted: 06/15/2020] [Indexed: 06/11/2023]
Abstract
The role of sequence complexity in 23 051 somatic missense mutations including 73 well-known mutation hotspots across 22 major cancers was studied in human TP53 proteins. A role for sequence complexity in TP53 protein mutations is suggested since (i) the mutation rate significantly increases in low amino acid pair bias complexity; (ii) probability distribution complexity increases following single point substitution mutations and strikingly increases after mutation at the mutation hotspots including six detectable hotspot mutations (R175, G245, R248, R249, R273, and R282); and (iii) the degree of increase in distribution complexity is significantly correlated with the frequency of missense mutations (r = -0.5758, P < 0.0001) across 20 major types of solid tumors. These results are consistent with the hypothesis that amino acid pair bias and distribution probability may be used as novel measures for protein sequence complexity, and the degree of complexity is related to its susceptibility to mutation, as such, it may be used as a predictor for modeling protein mutations in human cancers.
Collapse
Affiliation(s)
- Yan Ding
- Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Hongsheng Xue
- Institute for Translational Medicine, The Affiliated Zhongshan Hospital of Dalian University, Dalian 116001, China
| | - Xinjia Ding
- Department of Urology, The Second Affiliated Hospital of Dalian Medical University, Dalian 116023, China
| | - Yuqing Zhao
- Department of Urology, The Second Affiliated Hospital of Dalian Medical University, Dalian 116023, China
| | - Zhilong Zhao
- Institute for Translational Medicine, The Affiliated Zhongshan Hospital of Dalian University, Dalian 116001, China
| | - Dazhi Wang
- Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Jianlin Wu
- Institute for Translational Medicine, The Affiliated Zhongshan Hospital of Dalian University, Dalian 116001, China
| |
Collapse
|
7
|
Margelevičius M. Estimating statistical significance of local protein profile-profile alignments. BMC Bioinformatics 2019; 20:419. [PMID: 31409275 PMCID: PMC6693267 DOI: 10.1186/s12859-019-2913-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 05/23/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment of sequence families described by profiles provides a sensitive means for establishing homology between proteins and is important in protein evolutionary, structural, and functional studies. In the context of a steadily growing amount of sequence data, estimating the statistical significance of alignments, including profile-profile alignments, plays a key role in alignment-based homology search algorithms. Still, it is an open question as to what and whether one type of distribution governs profile-profile alignment score, especially when profile-profile substitution scores involve such terms as secondary structure predictions. RESULTS This study presents a methodology for estimating the statistical significance of this type of alignments. The methodology rests on a new algorithm developed for generating random profiles such that their alignment scores are distributed similarly to those obtained for real unrelated profiles. We show that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics. Implemented in the COMER software, the proposed methodology yielded an increase of up to 34.2% in the number of true positives and up to 61.8% in the number of high-quality alignments with respect to the previous version of the COMER method. CONCLUSIONS The more accurate estimation of statistical significance is implemented in the COMER method, which is now more sensitive and provides an increased rate of high-quality profile-profile alignments. The results of the present study also suggest directions for future research.
Collapse
Affiliation(s)
- Mindaugas Margelevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, 10257, Lithuania.
| |
Collapse
|
8
|
Xu M, Lawrence JG, Durand D. Selection, periodicity and potential function for Highly Iterative Palindrome-1 (HIP1) in cyanobacterial genomes. Nucleic Acids Res 2019; 46:2265-2278. [PMID: 29432573 PMCID: PMC5861425 DOI: 10.1093/nar/gky075] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 01/25/2018] [Indexed: 02/05/2023] Open
Abstract
Highly Iterated Palindrome 1 (HIP1, GCGATCGC) is hyper-abundant in most cyanobacterial genomes. In some cyanobacteria, average HIP1 abundance exceeds one motif per gene. Such high abundance suggests a significant role in cyanobacterial biology. However, 20 years of study have not revealed whether HIP1 has a function, much less what that function might be. We show that HIP1 is 15- to 300-fold over-represented in genomes analyzed. More importantly, HIP1 sites are conserved both within and between open reading frames, suggesting that their overabundance is maintained by selection rather than by continual replenishment by neutral processes, such as biased DNA repair. This evidence for selection suggests a functional role for HIP1. No evidence was found to support a functional role as a peptide or RNA motif or a role in the regulation of gene expression. Rather, we demonstrate that the distribution of HIP1 along cyanobacterial chromosomes is significantly periodic, with periods ranging from 10 to 90 kb, consistent in scale with periodicities reported for co-regulated, co-expressed and evolutionarily correlated genes. The periodicity we observe is also comparable in scale to chromosomal interaction domains previously described in other bacteria. In this context, our findings imply HIP1 functions associated with chromosome and nucleoid structure.
Collapse
Affiliation(s)
- Minli Xu
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Jeffrey G Lawrence
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Dannie Durand
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
9
|
Affiliation(s)
- Suchismita Goswami
- Computational and Data Sciences, George Mason University, Fairfax, VA, USA
| | - Edward J. Wegman
- Computational and Data Sciences, George Mason University, Fairfax, VA, USA
| |
Collapse
|
10
|
Korir R, Anzala O, Jaoko W, Bii C, Keter L. Multidrug-Resistant Bacterial Isolates Recovered from Herbal Medicinal Products Sold in Nairobi, Kenya. East Afr Health Res J 2017; 1:40-46. [PMID: 34308157 PMCID: PMC8279310 DOI: 10.24248/eahrj-d-17-00027] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Accepted: 02/10/2017] [Indexed: 11/20/2022] Open
Abstract
Background Medicinal herbs have been reported to be contaminated with microorganisms indigenous to the environment. These microbes become a threat when they harbour drug-resistant traits. Objective The aim of this study was to evaluate phenotypic and genotypic drug-resistant traits of bacteria isolated from herbal medicinal products in Nairobi, Kenya. Methods We employed an exploratory as well as laboratory-based experimental design. Herbal products were purchased from markets and transported to Kenya Medical Research Institute laboratories for processing and analysis. Microbial contamination and antibiotic susceptibility were determined following standard methods. Antibiotic-resistant genes were determined using polymerase chain reaction. Data were coded and analysed accordingly. Results We collected 138 samples of herbal products in the form of liquids, powders, capsules, creams/lotions, and syrups. In total, 117 samples (84.8%) were contaminated with bacteria and 61 (44.2%) were contaminated with fungi. Bacillus, Klebsiella, Proteus, Staphylococcus, Streptomyces, Escherichia, Enterobacter, Serratia, Yersinia, Morganella, Citrobacter, Erwinia, and Shigella were the bacterial genera identified. Most of the isolated bacteria were generally sensitive to the panel of antibiotics tested, although a few (35 [36.5%]) were resistant; more than half of these were resistant to more than 1 of the antibiotic agents we tested. Discussion We found an association between phenotypic and genotypic drug resistance among the drug-resistant bacteria. This study makes it evident that herbal medicinal products sold in Nairobi are contaminated with drug-resistant bacteria. Conclusions The results show that herbal medicinal products are a potential source of dissemination of multidrug-resistant bacteria. There is an urgent need for specific education programmes, policies, and regulations that address herbal products' safety to prevent the possibility of these pathogens being involved in deadly invasive infections.
Collapse
Affiliation(s)
- Richard Korir
- Kenya Medical Research Institute, Nairobi, Kenya.,University of Nairobi, School of Medicine, Nairobi, Kenya
| | - Omu Anzala
- University of Nairobi, School of Medicine, Nairobi, Kenya
| | - Walter Jaoko
- University of Nairobi, School of Medicine, Nairobi, Kenya
| | | | - Lucia Keter
- Kenya Medical Research Institute, Nairobi, Kenya
| |
Collapse
|
11
|
Reiner-Benaim A. Scan Statistic Tail Probability Assessment Based on Process Covariance and Window Size. Methodol Comput Appl Probab 2016. [DOI: 10.1007/s11009-015-9447-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
12
|
Ye Z, Chen Z, Sunkel B, Frietze S, Huang THM, Wang Q, Jin VX. Genome-wide analysis reveals positional-nucleosome-oriented binding pattern of pioneer factor FOXA1. Nucleic Acids Res 2016; 44:7540-54. [PMID: 27458208 PMCID: PMC5027512 DOI: 10.1093/nar/gkw659] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2016] [Accepted: 07/12/2016] [Indexed: 11/24/2022] Open
Abstract
The compaction of nucleosomal structures creates a barrier for DNA-binding transcription factors (TFs) to access their cognate cis-regulatory elements. Pioneer factors (PFs) such as FOXA1 are able to directly access these cis-targets within compact chromatin. However, how these PFs interplay with nucleosomes remains to be elucidated, and is critical for us to understand the underlying mechanism of gene regulation. Here, we have conducted a computational analysis on a strand-specific paired-end ChIP-exo (termed as ChIP-ePENS) data of FOXA1 in LNCaP cells by our novel algorithm ePEST. We find that FOXA1 chromatin binding occurs via four distinct border modes (or footprint boundary patterns), with a preferential footprint boundary patterns relative to FOXA1 motif orientation. In addition, from this analysis three fundamental nucleotide positions (oG, oS and oH) emerged as major determinants for blocking exo-digestion and forming these four distinct border modes. By integrating histone MNase-seq data, we found an astonishingly consistent, ‘well-positioned’ configuration occurs between FOXA1 motifs and dyads of nucleosomes genome-wide. We further performed ChIP-seq of eight chromatin remodelers and found an increased occupancy of these remodelers on FOXA1 motifs for all four border modes (or footprint boundary patterns), indicating the full occupancy of FOXA1 complex on the three blocking sites (oG, oS and oH) likely produces an active regulatory status with well-positioned phasing for protein binding events. Together, our results suggest a positional-nucleosome-oriented accessing model for PFs seeking target motifs, in which FOXA1 can examine each underlying DNA nucleotide and is able to sense all potential motifs regardless of whether they face inward or outward from histone octamers along the DNA helix axis.
Collapse
Affiliation(s)
- Zhenqing Ye
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| | - Zhong Chen
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Benjamin Sunkel
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Seth Frietze
- MLRS Department, University of Vermont, VT 05405, USA
| | - Tim H-M Huang
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| | - Qianben Wang
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Victor X Jin
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| |
Collapse
|
13
|
r-scan statistics of a Poisson process with events transformed by duplications, deletions, and displacements. ADV APPL PROBAB 2016. [DOI: 10.1017/s0001867800002056] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A stochastic model of a dynamic marker array in which markers could disappear, duplicate, and move relative to its original position is constructed to reflect on the nature of long DNA sequences. The sequence changes of deletions, duplications, and displacements follow the stochastic rules: (i) the original distribution of the marker array {…, X
−2, X
−1, X
0, X
1, X
2, …} is a Poisson process on the real line; (ii) each marker is replicated l times; replication or loss of marker points occur independently; (iii) each replicated point is independently and randomly displaced by an amount Y relative to its original position, with the Y displacements sampled from a continuous density g(y). Limiting distributions for the maximal and minimal statistics of the r-scan lengths (collection of distances between r + 1 successive markers) for the l-shift model are derived with the aid of the Chen-Stein method and properties of Poisson processes.
Collapse
|
14
|
Chen C, Karlin S. r-scan statistics of a Poisson process with events transformed by duplications, deletions, and displacements. ADV APPL PROBAB 2016. [DOI: 10.1239/aap/1189518639] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A stochastic model of a dynamic marker array in which markers could disappear, duplicate, and move relative to its original position is constructed to reflect on the nature of long DNA sequences. The sequence changes of deletions, duplications, and displacements follow the stochastic rules: (i) the original distribution of the marker array {…,X−2,X−1,X0,X1,X2, …} is a Poisson process on the real line; (ii) each marker is replicatedltimes; replication or loss of marker points occur independently; (iii) each replicated point is independently and randomly displaced by an amountYrelative to its original position, with theYdisplacements sampled from a continuous densityg(y). Limiting distributions for the maximal and minimal statistics of ther-scan lengths (collection of distances betweenr+ 1 successive markers) for thel-shift model are derived with the aid of the Chen-Stein method and properties of Poisson processes.
Collapse
|
15
|
Chen Q, Zhou XJ, Sun F. Finding genetic overlaps among diseases based on ranked gene lists. J Comput Biol 2015; 22:111-23. [PMID: 25684200 DOI: 10.1089/cmb.2014.0149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Abstract
To understand disease relationships in terms of their genetic mechanisms, it is important to study the common genetic basis among different diseases. Although discoveries on pleiotropic genes related to multiple diseases abound, methods flexibly applicable to various types of datasets generated from different studies or experiments are needed to gain big pictures on the genetic relationships among a large number of diseases. We develop a set of genetic similarity measures to gauge the genetic overlap between diseases, as well as several estimators of the number of overlapping disease genes between diseases. These methods are based on ranked gene lists so that they could be flexibly applied to different types of data. We first investigate the performance of the genetic similarity measure for evaluating the similarity between human diseases in simulation studies. Then we apply the method to diseases in the OMIM database. We show that our proposed genetic measure achieves superior performance in explaining phenotype similarities between diseases compared to simpler methods. Furthermore, we identified common genes underlying the genetic overlap between disease pairs. With an example of five vision-related diseases, we demonstrate how our methods can provide insights into the relationships among diseases based on their shared genetic mechanisms.
Collapse
Affiliation(s)
- Quan Chen
- Molecular and Computational Biology Program, University of Southern California , Los Angeles, California
| | | | | |
Collapse
|
16
|
Hemme D, Veyel D, Mühlhaus T, Sommer F, Jüppner J, Unger AK, Sandmann M, Fehrle I, Schönfelder S, Steup M, Geimer S, Kopka J, Giavalisco P, Schroda M. Systems-wide analysis of acclimation responses to long-term heat stress and recovery in the photosynthetic model organism Chlamydomonas reinhardtii. THE PLANT CELL 2014; 26:4270-97. [PMID: 25415976 PMCID: PMC4277220 DOI: 10.1105/tpc.114.130997] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Revised: 10/13/2014] [Accepted: 10/24/2014] [Indexed: 05/19/2023]
Abstract
We applied a top-down systems biology approach to understand how Chlamydomonas reinhardtii acclimates to long-term heat stress (HS) and recovers from it. For this, we shifted cells from 25 to 42°C for 24 h and back to 25°C for ≥8 h and monitored abundances of 1856 proteins/protein groups, 99 polar and 185 lipophilic metabolites, and cytological and photosynthesis parameters. Our data indicate that acclimation of Chlamydomonas to long-term HS consists of a temporally ordered, orchestrated implementation of response elements at various system levels. These comprise (1) cell cycle arrest; (2) catabolism of larger molecules to generate compounds with roles in stress protection; (3) accumulation of molecular chaperones to restore protein homeostasis together with compatible solutes; (4) redirection of photosynthetic energy and reducing power from the Calvin cycle to the de novo synthesis of saturated fatty acids to replace polyunsaturated ones in membrane lipids, which are deposited in lipid bodies; and (5) when sinks for photosynthetic energy and reducing power are depleted, resumption of Calvin cycle activity associated with increased photorespiration, accumulation of reactive oxygen species scavengers, and throttling of linear electron flow by antenna uncoupling. During recovery from HS, cells appear to focus on processes allowing rapid resumption of growth rather than restoring pre-HS conditions.
Collapse
Affiliation(s)
- Dorothea Hemme
- Molekulare Biotechnologie and Systembiologie, TU Kaiserslautern, D-67663 Kaiserslautern, Germany Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Daniel Veyel
- Molekulare Biotechnologie and Systembiologie, TU Kaiserslautern, D-67663 Kaiserslautern, Germany Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Timo Mühlhaus
- Molekulare Biotechnologie and Systembiologie, TU Kaiserslautern, D-67663 Kaiserslautern, Germany Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Frederik Sommer
- Molekulare Biotechnologie and Systembiologie, TU Kaiserslautern, D-67663 Kaiserslautern, Germany Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Jessica Jüppner
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Ann-Katrin Unger
- Zellbiologie/Elektronenmikroskopie, Universität Bayreuth, D-95440 Bayreuth, Germany
| | - Michael Sandmann
- Institut für Biochemie und Biologie, Universität Potsdam, D-14476 Potsdam-Golm, Germany
| | - Ines Fehrle
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Stephanie Schönfelder
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Martin Steup
- Institut für Biochemie und Biologie, Universität Potsdam, D-14476 Potsdam-Golm, Germany
| | - Stefan Geimer
- Zellbiologie/Elektronenmikroskopie, Universität Bayreuth, D-95440 Bayreuth, Germany
| | - Joachim Kopka
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Patrick Giavalisco
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| | - Michael Schroda
- Molekulare Biotechnologie and Systembiologie, TU Kaiserslautern, D-67663 Kaiserslautern, Germany Max-Planck-Institut für Molekulare Pflanzenphysiologie, D-14476 Potsdam-Golm, Germany
| |
Collapse
|
17
|
Spouge JL, Mariño-Ramírez L, Sheetlin SL. Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps. INTERNATIONAL JOURNAL OF BIOINFORMATICS RESEARCH AND APPLICATIONS 2014; 10:384-408. [PMID: 24989859 DOI: 10.1504/ijbra.2014.062991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Some biological sequences contain subsequences of unusual composition; e.g. some proteins contain DNA binding domains, transmembrane regions and charged regions, and some DNA sequences contain repeats. The linear-time Ruzzo-Tompa (RT) algorithm finds subsequences of unusual composition, using a sequence of scores as input and the corresponding 'maximal segments' as output. In principle, permitting gaps in the output subsequences could improve sensitivity. Here, the input of the RT algorithm is generalised to a finite, totally ordered, weighted graph, so the algorithm locates paths of maximal weight through increasing but not necessarily adjacent vertices. By permitting the penalised deletion of unfavourable letters, the generalisation therefore includes gaps. The program RepWords, which finds inexact simple repeats in DNA, exemplifies the general concepts by out-performing a similar extant, ad hoc tool. With minimal programming effort, the generalised Ruzzo-Tompa algorithm could improve the performance of many programs for finding biological subsequences of unusual composition.
Collapse
Affiliation(s)
- John L Spouge
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Leonardo Mariño-Ramírez
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Sergey L Sheetlin
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| |
Collapse
|
18
|
Scan statistics in human gene mapping. Am J Hum Genet 2012; 91:970; author reply 970-1. [PMID: 23122592 DOI: 10.1016/j.ajhg.2012.07.026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2012] [Revised: 07/11/2012] [Accepted: 07/11/2012] [Indexed: 11/23/2022] Open
|
19
|
Ionita-Laza I, Buxbaum J. Response to Ott and Hoh. Am J Hum Genet 2012. [DOI: 10.1016/j.ajhg.2012.09.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
20
|
Grzymski JJ, Dussaq AM. The significance of nitrogen cost minimization in proteomes of marine microorganisms. ISME JOURNAL 2011; 6:71-80. [PMID: 21697958 PMCID: PMC3246230 DOI: 10.1038/ismej.2011.72] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Marine microorganisms thrive under low levels of nitrogen (N). N cost minimization is a major selective pressure imprinted on open-ocean microorganism genomes. Here we show that amino-acid sequences from the open ocean are reduced in N, but increased in average mass compared with coastal-ocean microorganisms. Nutrient limitation exerts significant pressure on organisms supporting the trade-off between N cost minimization and increased average mass of amino acids that is a function of increased A+T codon usage. N cost minimization, especially of highly expressed proteins, reduces the total cellular N budget by 2.7–10% this minimization in combination with reduction in genome size and cell size is an evolutionary adaptation to nutrient limitation. The biogeochemical and evolutionary precedent for these findings suggests that N limitation is a stronger selective force in the ocean than biosynthetic costs and is an important evolutionary strategy in resource-limited ecosystems.
Collapse
Affiliation(s)
- Joseph J Grzymski
- Division of Earth and Ecosystem Sciences, Desert Research Institute, Reno, NV, USA.
| | | |
Collapse
|
21
|
Provata A, Katsaloulis P. Hierarchical multifractal representation of symbolic sequences and application to human chromosomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2010; 81:026102. [PMID: 20365626 DOI: 10.1103/physreve.81.026102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2009] [Indexed: 05/29/2023]
Abstract
The two-dimensional density correlation matrix is constructed for symbolic sequences using contiguous segments of arbitrary size. The multifractal spectrum obtained from this matrix motif is shown to characterize the correlations in the symbolic sequences. This method is applied to entire human chromosomes, shuffled human chromosomes, reconstructed human genomic sequences and to artificial random sequences. It is shown that all human chromosomes have common characteristics in their multifractal spectrum and deviate substantially from random and uncorrelated sequences of the same size. Small deviations are observed between the longer and the shorter chromosomes, especially for the higher (in absolute values) statistical moments. The correlations are crucial for the form of the multifractal spectrum; surrogate shuffled chromosomes present randomlike spectrum, distinctly different from the actual chromosomes. Analytical approaches based on hierarchical superposition of tensor products show that retaining pair correlations in the sequences leads to a closer representation of the genomic multifractal spectra, especially in the region of negative exponents, due to the underrepresentation of various functional units (such as the cytosine-guanine CG combination and its complementary GC complex). Retaining higher-order correlations in the construction of the tensor products is a way to approach closer the structure of the multifractal spectra of the actual genomic sequences. This hierarchical approach is generic and is applicable to other correlated symbolic sequences.
Collapse
Affiliation(s)
- A Provata
- Institute of Physical Chemistry, National Center for Scientific Research Demokritos, 15310 Athens, Greece
| | | |
Collapse
|
22
|
Zhang Z, Townsend JP. Maximum-likelihood model averaging to profile clustering of site types across discrete linear sequences. PLoS Comput Biol 2009; 5:e1000421. [PMID: 19557160 PMCID: PMC2695770 DOI: 10.1371/journal.pcbi.1000421] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Accepted: 05/21/2009] [Indexed: 11/19/2022] Open
Abstract
A major analytical challenge in computational biology is the detection and description of clusters of specified site types, such as polymorphic or substituted sites within DNA or protein sequences. Progress has been stymied by a lack of suitable methods to detect clusters and to estimate the extent of clustering in discrete linear sequences, particularly when there is no a priori specification of cluster size or cluster count. Here we derive and demonstrate a maximum likelihood method of hierarchical clustering. Our method incorporates a tripartite divide-and-conquer strategy that models sequence heterogeneity, delineates clusters, and yields a profile of the level of clustering associated with each site. The clustering model may be evaluated via model selection using the Akaike Information Criterion, the corrected Akaike Information Criterion, and the Bayesian Information Criterion. Furthermore, model averaging using weighted model likelihoods may be applied to incorporate model uncertainty into the profile of heterogeneity across sites. We evaluated our method by examining its performance on a number of simulated datasets as well as on empirical polymorphism data from diverse natural alleles of the Drosophila alcohol dehydrogenase gene. Our method yielded greater power for the detection of clustered sites across a breadth of parameter ranges, and achieved better accuracy and precision of estimation of clusters, than did the existing empirical cumulative distribution function statistics.
Collapse
Affiliation(s)
- Zhang Zhang
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America
| | - Jeffrey P. Townsend
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America
- * E-mail:
| |
Collapse
|
23
|
Mrázek J. Finding sequence motifs in prokaryotic genomes--a brief practical guide for a microbiologist. Brief Bioinform 2009; 10:525-36. [PMID: 19553402 DOI: 10.1093/bib/bbp032] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Finding significant nucleotide sequence motifs in prokaryotic genomes can be divided into three types of tasks: (1) supervised motif finding, where a sample of motif sequences is used to find other similar sequences in genomes; (2) unsupervised motif finding, which typically relates to the task of finding regulatory motifs and protein binding sites and (3) exploratory motif finding, which aims to identify potential functionally significant sequence motifs as those that are unusual in some statistical sense. This article provides a conceptual overview for each type of task, a brief description of basic algorithms used in their solution, and a review of selected relevant software available online.
Collapse
Affiliation(s)
- Jan Mrázek
- Department of Microbiology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602-2605, USA.
| |
Collapse
|
24
|
Long range clustering of oligonucleotides containing the CG signal. J Theor Biol 2009; 258:18-26. [PMID: 19490875 DOI: 10.1016/j.jtbi.2009.01.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2008] [Revised: 01/14/2009] [Accepted: 01/14/2009] [Indexed: 11/24/2022]
Abstract
The distance distributions between successive occurrences of the same oligonucleotides in chromosomal DNA are studied, in different classes of higher eucaryotic organisms. A two-parameter modeling is undertaken and applied on the distance distribution of quintuplets (sequences of size five bps) and hexaplets (sequences of size six bps); the first parameter k refers to the short range exponential decay of the distributions, whereas the second parameter m refers to the power law behavior. A two-dimensional scatter plot representing the model equation demonstrates that the points corresponding to the distance distribution of oligonucleotides containing the CG consensus sequence (promoter of the RNA polymerase II) cluster together (group alpha), apart from all other oligonucleotides (group beta). This is shown for the available chordata Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, Gallus gallus and Danio rerio. This clustering is less evident in lower Animalia and plants, such as Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana. Moreover, in all organisms the oligonucleotides which contain any consensus sequence are found to be described by long range distributions, whereas all others have a stronger influence of short range decay. Various measures are introduced and evaluated, to numerically characterize the clustering of the two groups. The one which most clearly discriminates the two classes is shown to be the proximity factor.
Collapse
|
25
|
Larsson P, Hinas A, Ardell DH, Kirsebom LA, Virtanen A, Söderbom F. De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring. Genome Res 2008; 18:888-99. [PMID: 18347326 DOI: 10.1101/gr.069104.107] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Genome data are increasingly important in the computational identification of novel regulatory non-coding RNAs (ncRNAs). However, most ncRNA gene-finders are either specialized to well-characterized ncRNA gene families or require comparisons of closely related genomes. We developed a method for de novo screening for ncRNA genes with a nucleotide composition that stands out against the background genome based on a partial sum process. We compared the performance when assuming independent and first-order Markov-dependent nucleotides, respectively, and used Karlin-Altschul and Karlin-Dembo statistics to evaluate the significance of hits. We hypothesized that a first-order Markov-dependent process might have better power to detect ncRNA genes since nearest-neighbor models have been shown to be successful in predicting RNA structures. A model based on a first-order partial sum process (analyzing overlapping dinucleotides) had better sensitivity and specificity than a zeroth-order model when applied to the AT-rich genome of the amoeba Dictyostelium discoideum. In this genome, we detected 94% of previously known ncRNA genes (at this sensitivity, the false positive rate was estimated to be 25% in a simulated background). The predictions were further refined by clustering candidate genes according to sequence similarity and/or searching for an ncRNA-associated upstream element. We experimentally verified six out of 10 tested ncRNA gene predictions. We conclude that higher-order models, in combination with other information, are useful for identification of novel ncRNA gene families in single-genome analysis of D. discoideum. Our generalizable approach extends the range of genomic data that can be searched for novel ncRNA genes using well-grounded statistical methods.
Collapse
Affiliation(s)
- Pontus Larsson
- Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, SE-75124 Uppsala, Sweden
| | | | | | | | | | | |
Collapse
|
26
|
Mrázek J, Xie S, Guo X, Srivastava A. AIMIE: a web-based environment for detection and interpretation of significant sequence motifs in prokaryotic genomes. Bioinformatics 2008; 24:1041-8. [PMID: 18304933 DOI: 10.1093/bioinformatics/btn077] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Genomes contain biologically significant information that extends beyond that encoded in genes. Some of this information relates to various short dispersed repeats distributed throughout the genome. The goal of this work was to combine tools for detection of statistically significant dispersed repeats in DNA sequences with tools to aid development of hypotheses regarding their possible physiological functions in an easy-to-use web-based environment. RESULTS Ab Initio Motif Identification Environment (AIMIE) was designed to facilitate investigations of dispersed sequence motifs in prokaryotic genomes. We used AIMIE to analyze the Escherichia coli and Haemophilus influenzae genomes in order to demonstrate the utility of the new environment. AIMIE detected repeated extragenic palindrome (REP) elements, CRISPR repeats, uptake signal sequences, intergenic dyad sequences and several other over-represented sequence motifs. Distributional patterns of these motifs were analyzed using the tools included in AIMIE. AVAILABILITY AIMIE and the related software can be accessed at our web site http://www.cmbl.uga.edu/software.html.
Collapse
Affiliation(s)
- Jan Mrázek
- Department of Microbiology, University of Georgia, Athens, GA 30602-2605, USA.
| | | | | | | |
Collapse
|
27
|
Mitrophanov AY, Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform 2008; 7:2-24. [PMID: 16761361 DOI: 10.1093/bib/bbk001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.
Collapse
|
28
|
Mattarucchi E, Guerini V, Rambaldi A, Campiotti L, Venco A, Pasquali F, Lo Curto F, Porta G. Microhomologies and interspersed repeat elements at genomic breakpoints in chronic myeloid leukemia. Genes Chromosomes Cancer 2008; 47:625-32. [DOI: 10.1002/gcc.20568] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
|
29
|
Boni MF, Posada D, Feldman MW. An exact nonparametric method for inferring mosaic structure in sequence triplets. Genetics 2007; 176:1035-47. [PMID: 17409078 PMCID: PMC1894573 DOI: 10.1534/genetics.106.068874] [Citation(s) in RCA: 599] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2006] [Accepted: 03/18/2007] [Indexed: 11/18/2022] Open
Abstract
Statistical tests for detecting mosaic structure or recombination among nucleotide sequences usually rely on identifying a pattern or a signal that would be unlikely to appear under clonal reproduction. Dozens of such tests have been described, but many are hampered by long running times, confounding of selection and recombination, and/or inability to isolate the mosaic-producing event. We introduce a test that is exact, nonparametric, rapidly computable, free of the infinite-sites assumption, able to distinguish between recombination and variation in mutation/fixation rates, and able to identify the breakpoints and sequences involved in the mosaic-producing event. Our test considers three sequences at a time: two parent sequences that may have recombined, with one or two breakpoints, to form the third sequence (the child sequence). Excess similarity of the child sequence to a candidate recombinant of the parents is a sign of recombination; we take the maximum value of this excess similarity as our test statistic Delta(m,n,b). We present a method for rapidly calculating the distribution of Delta(m,n,b) and demonstrate that it has comparable power to and a much improved running time over previous methods, especially in detecting recombination in large data sets.
Collapse
Affiliation(s)
- Maciej F Boni
- Stanford Genome Technology Center, Palo Alto, California 94304, USA.
| | | | | |
Collapse
|
30
|
|
31
|
Chew DSH, Leung MY, Choi KP. AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions. BMC Bioinformatics 2007; 8:163. [PMID: 17517140 PMCID: PMC1904460 DOI: 10.1186/1471-2105-8-163] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2006] [Accepted: 05/21/2007] [Indexed: 11/12/2022] Open
Abstract
Background Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments. Results We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS1). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS1, showing that the AT excursion method is a valuable complement to BWS1. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at [1]. Preliminary investigation shows that the proposed method works well on some larger genomes too. Conclusion The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.
Collapse
Affiliation(s)
- David SH Chew
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
| | - Ming-Ying Leung
- Department of Mathematical Sciences and Bioinformatics Program, The University of Texas at El Paso, TX 79968, USA
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
- Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
| |
Collapse
|
32
|
Papatsenko D. ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors. ACTA ACUST UNITED AC 2007; 23:1032-4. [PMID: 17308342 DOI: 10.1093/bioinformatics/btm047] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
ClusterDraw is a program aimed to identification of binding sites and binding-site clusters. Major difference of the ClusterDraw from existing tools is its ability to scan a wide range of parameter values and weigh statistical significance of all possible clusters, smaller than a selected size. The program produces graphs along with decorated FASTA files. ClusterDraw web server is available at the following URL: http://flydev.berkeley.edu/cgi-bin/cld/submit.cgi
Collapse
Affiliation(s)
- Dmitri Papatsenko
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA.
| |
Collapse
|
33
|
Mau B, Glasner JD, Darling AE, Perna NT. Genome-wide detection and analysis of homologous recombination among sequenced strains of Escherichia coli. Genome Biol 2006; 7:R44. [PMID: 16737554 PMCID: PMC1779527 DOI: 10.1186/gb-2006-7-5-r44] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2005] [Revised: 02/08/2006] [Accepted: 05/08/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Comparisons of complete bacterial genomes reveal evidence of lateral transfer of DNA across otherwise clonally diverging lineages. Some lateral transfer events result in acquisition of novel genomic segments and are easily detected through genome comparison. Other more subtle lateral transfers involve homologous recombination events that result in substitution of alleles within conserved genomic regions. This type of event is observed infrequently among distantly related organisms. It is reported to be more common within species, but the frequency has been difficult to quantify since the sequences under comparison tend to have relatively few polymorphic sites. RESULTS Here we report a genome-wide assessment of homologous recombination among a collection of six complete Escherichia coli and Shigella flexneri genome sequences. We construct a whole-genome multiple alignment and identify clusters of polymorphic sites that exhibit atypical patterns of nucleotide substitution using a random walk-based method. The analysis reveals one large segment (approximately 100 kb) and 186 smaller clusters of single base pair differences that suggest lateral exchange between lineages. These clusters include portions of 10% of the 3,100 genes conserved in six genomes. Statistical analysis of the functional roles of these genes reveals that several classes of genes are over-represented, including those involved in recombination, transport and motility. CONCLUSION We demonstrate that intraspecific recombination in E. coli is much more common than previously appreciated and may show a bias for certain types of genes. The described method provides high-specificity, conservative inference of past recombination events.
Collapse
Affiliation(s)
- Bob Mau
- Department of Mathematics, Lincoln Drive, University of Wisconsin, Madison WI 53706, USA
- Department of Oncology, University Ave, University of Wisconsin, Madison WI 53706, USA
- Genome Center of Wisconsin, Henry Mall, University of Wisconsin, Madison WI 53706, USA
| | - Jeremy D Glasner
- Genome Center of Wisconsin, Henry Mall, University of Wisconsin, Madison WI 53706, USA
| | - Aaron E Darling
- Department of Computer Science, W. Dayton St, University of Wisconsin, Madison WI 53706, USA
| | - Nicole T Perna
- Genome Center of Wisconsin, Henry Mall, University of Wisconsin, Madison WI 53706, USA
- Department of Animal Health and Biomedical Sciences, Linden Drive, University of Wisconsin, Madison WI 53706, USA
| |
Collapse
|
34
|
Bragg JG, Thomas D, Baudouin-Cornu P. Variation among species in proteomic sulphur content is related to environmental conditions. Proc Biol Sci 2006; 273:1293-300. [PMID: 16720405 PMCID: PMC1560280 DOI: 10.1098/rspb.2005.3441] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2005] [Accepted: 12/04/2005] [Indexed: 11/12/2022] Open
Abstract
The elemental composition of proteins influences the quantities of different elements required by organisms. Here, we considered variation in the sulphur content of whole proteomes among 19 Archaea, 122 Eubacteria and 10 eukaryotes whose genomes have been fully sequenced. We found that different species vary greatly in the sulphur content of their proteins, and that average sulphur content of proteomes and genome base composition are related. Forces contributing to variation in proteomic sulphur content appear to operate quite uniformly across the proteins of different species. In particular, the sulphur content of orthologous proteins was frequently correlated with mean proteomic sulphur contents. Among prokaryotes, proteomic sulphur content tended to be greater in anaerobes, relative to non-anaerobes. Thermophiles tended to have lower proteomic sulphur content than non-thermophiles, consistent with the thermolability of cysteine and methionine residues. This work suggests that persistent environmental growth conditions can influence the evolution of elemental composition of whole proteomes in a manner that may have important implications for the amount of sulphur used by living organisms to build proteins. It extends previous studies that demonstrated links between transient changes in environmental conditions and the elemental composition of subsets of proteins expressed under these conditions.
Collapse
Affiliation(s)
- Jason G Bragg
- Department of Biology, University of New MexicoMSC03 2020, Albuquerque, NM 87131-0001, USA
| | - Dominique Thomas
- Centre de Génétique Moléculaire, Centre National de la Recherche Scientifique91198 Gif-sur-Yvette, France
- Cytomics Systems SABâtiment 5, 1 avenue de la Terrasse, 91190 Gif sur Yvette, France
| | - Peggy Baudouin-Cornu
- Samuel Lunenfeld Research Institute, Mount Sinai Hospital600 University Avenue, Toronto, ON M5G 1X5, Canada
- LPG, SBGM/DBJCbât 144, CEA Saclay, F-91191 Gif-sur-Yvette Cedex, France
| |
Collapse
|
35
|
Abstract
The Arthur M. Sackler Colloquium of the National Academy of Sciences, "Frontiers in Bioinformatics: Unsolved Problems and Challenges," organized by David Eisenberg, Russ Altman, and myself, was held October 15-17, 2004, to provide a forum for discussing concepts and methods in bioinformatics serving the biological and medical sciences. The deluge of genomic and proteomic data in the last two decades has driven the creation of tools that search and analyze biomolecular sequences and structures. Bioinformatics is highly interdisciplinary, using knowledge from mathematics, statistics, computer science, biology, medicine, physics, chemistry, and engineering.
Collapse
Affiliation(s)
- Samuel Karlin
- Department of Mathematics, Stanford University, Stanford, CA 94305-2125, USA.
| |
Collapse
|
36
|
|
37
|
Larrasa J, García-Sánchez A, Ambrose NC, Parra A, Alonso JM, Rey JM, Hermoso-de-Mendoza M, Hermoso-de-Mendoza J. Evaluation of randomly amplified polymorphic DNA and pulsed field gel electrophoresis techniques for molecular typing of Dermatophilus congolensis. FEMS Microbiol Lett 2004; 240:87-97. [PMID: 15500984 DOI: 10.1016/j.femsle.2004.09.016] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2004] [Revised: 09/14/2004] [Indexed: 11/22/2022] Open
Abstract
This study aimed to evaluate molecular typing methods useful for standardization of strains in experimental work on dermatophilosis. Fifty Dermatophilus congolensis isolates, collected from sheep, cattle, horse and a deer, were analyzed by randomly amplified polymorphic DNA (RAPD) method using twenty-one different primers, and the results were compared with those obtained by typing with a pulsed field gel electrophoresis (PFGE) method using the restriction digest enzyme Sse8387I. The typeability, reproducibility and discriminatory power of RAPD and Sse8387I-PFGE typing were calculated. Both typing methods were highly reproducible. Of the two techniques, Sse8387I-PFGE was the least discriminating (Dice Index (DI), 0.663) and could not distinguish between epidemiologically related isolates, whereas RAPD showed an excellent discriminatory power (DI, 0.7694-0.9722). Overall, the degree of correlation between RAPD and PFGE typing was significantly high (r, 0.8822). We conclude that the DNA profiles generated by either RAPD or PFGE can be used to differentiate epidemiologically unrelated isolates. The results of this study strongly suggest that at least two independent primers are used for RAPD typing in order to improve its discriminatory power, and that PFGE is used for confirmation of RAPD results.
Collapse
Affiliation(s)
- José Larrasa
- Departamento de Microbiología, Laboratorios Larrasa S.L., Corredera Hernando de Soto 13-A, Jerez de los Caballeros, 06380 Badajoz, Spain
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Csurös M. Maximum-scoring segment sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:139-50. [PMID: 17051696 DOI: 10.1109/tcbb.2004.43] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.
Collapse
Affiliation(s)
- Miklós Csurös
- Départment d'informatique et de recherche opérationnelle, Université de Montréal, C.P. 6128, succ. Centre-Ville, Montréal, Qué. H3C 3J7, Canada.
| |
Collapse
|
39
|
Abstract
The replication of the chromosome is among the most essential functions of the bacterial cell and influences many other cellular mechanisms, from gene expression to cell division. Yet the way it impacts on the bacterial chromosome was not fully acknowledged until the availability of complete genomes allowed one to look upon genomes as more than bags of genes. Chromosomal replication includes a set of asymmetric mechanisms, among which are a division in a lagging and a leading strand and a gradient between early and late replicating regions. These differences are the causes of many of the organizational features observed in bacterial genomes, in terms of both gene distribution and sequence composition along the chromosome. When asymmetries or gradients increase in some genomes, e.g. due to a different composition of the DNA polymerase or to a higher growth rate, so do the corresponding biases. As some of the features of the chromosome structure seem to be under strong selection, understanding such biases is important for the understanding of chromosome organization and adaptation. Inversely, understanding chromosome organization may shed further light on questions relating to replication and cell division. Ultimately, the understanding of the interplay between these different elements will allow a better understanding of bacterial genetics and evolution.
Collapse
Affiliation(s)
- Eduardo P C Rocha
- Atelier de Bioinformatique, Université Pierre et Marie Curie, 12, Rue Cuvier, 75005 Paris, and Unité Génétique des Génomes Bactériens, Institut Pasteur, 28 rue du Dr Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
40
|
Abstract
Protein simple sequences, a subset of low-complexity sequences, are regions of sequence highly enriched in one or a few residue types. Simple sequences are exceedingly common, the average being more than one per protein sequence. Despite being so common, such sequences are not well-studied. The simple sequences that have been subjected to detailed study are often found to possess important functions. Here we present a survey of protein simple sequences, generally enriched in a single residue type, with the aim of studying their conservation. We find that the majority of such simple sequences are not conserved. However, conserved protein simple sequences are relatively common, with approximately 11% of the surveyed protein families possessing a conserved simple sequence. The data obtained in this study support the idea that simple sequences are conserved for functional reasons. Such functions can range from substrate binding, to mediating protein-protein interactions, to structural integrity. A perhaps surprising finding is that the residue enriching a conserved simple sequence is itself not necessarily conserved. Neither is the length of many of the highly conserved simple sequences. In the few cases where structural and functional data is available it is found that the conserved simple sequences are consistent with both local structure and function. The data presented support the idea that protein simple sequences can be conserved and have important roles in protein structure and function.
Collapse
Affiliation(s)
- Kim Lan Sim
- Center for Structural Biology, Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky 40536-0298, USA
| | | |
Collapse
|
41
|
Dunker AK, Brown CJ, Obradovic Z. Identification and functions of usefully disordered proteins. ADVANCES IN PROTEIN CHEMISTRY 2004; 62:25-49. [PMID: 12418100 DOI: 10.1016/s0065-3233(02)62004-2] [Citation(s) in RCA: 288] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- A Keith Dunker
- School of Molecular Biosciences, Washington State University, Pullman, Washington 99164, USA
| | | | | |
Collapse
|
42
|
Abstract
In this article, the use of the finite Markov chain imbedding (FMCI) technique to study patterns in DNA under a hidden Markov model (HMM) is introduced. With a vision of studying multiple runs-related statistics simultaneously under an HMM through the FMCI technique, this work establishes an investigation of a bivariate runs statistic under a binary HMM for DNA pattern recognition. An FMCI-based recursive algorithm is derived and implemented for the determination of the exact distribution of this bivariate runs statistic under an independent identically distributed (IID) framework, a Markov chain (MC) framework, and a binary HMM framework. With this algorithm, we have studied the distributions of the bivariate runs statistic under different binary HMM parameter sets; probabilistic profiles of runs are created and shown to be useful for trapping HMM maximum likelihood estimates (MLEs). This MLE-trapping scheme offers good initial estimates to jump-start the expectation-maximization (EM) algorithm in HMM parameter estimation and helps prevent the EM estimates from landing on a local maximum or a saddle point. Applications of the bivariate runs statistic and the probabilistic profiles in conjunction with binary HMMs for pattern recognition in genomic DNA sequences are illustrated via case studies on DNA bendability signals using human DNA data.
Collapse
Affiliation(s)
- Leo Wang-Kit Cheung
- Epidemiology Section, Cancer Etiology Program, Cancer Research Center of Hawaii, University of Hawaii, Honolulu, HI 96813-2479, USA.
| |
Collapse
|
43
|
Baudouin-Cornu P, Schuerer K, Marlière P, Thomas D. Intimate evolution of proteins. Proteome atomic content correlates with genome base composition. J Biol Chem 2003; 279:5421-8. [PMID: 14645368 DOI: 10.1074/jbc.m306415200] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Discerning the significant relations that exist within and among genome sequences is a major step toward the modeling of biopolymer evolution. Here we report the systematic analysis of the atomic composition of proteins encoded by organisms representative of each kingdoms. Protein atomic contents are shown to vary largely among species, the larger variations being observed for the main architectural component of proteins, the carbon atom. These variations apply to the bulk proteins as well as to subsets of ortholog proteins. A pronounced correlation between proteome carbon content and genome base composition is further evidenced, with high G+C genome content being related to low protein carbon content. The generation of random proteomes and the examination of the canonical genetic code provide arguments for the hypothesis that natural selection might have driven genome base composition.
Collapse
Affiliation(s)
- Peggy Baudouin-Cornu
- Centre de Génétique Moléculaire, Centre National de la Recherche Scientifique, 91 198 Gif sur Yvette, France
| | | | | | | |
Collapse
|
44
|
Nandi T, Dash D, Ghai R, B-Rao C, Kannan K, Brahmachari SK, Ramakrishnan C, Ramachandran S. A novel complexity measure for comparative analysis of protein sequences from complete genomes. J Biomol Struct Dyn 2003; 20:657-68. [PMID: 12643768 DOI: 10.1080/07391102.2003.10506882] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Analysis of sequence complexities of proteins is an important step in the characterization and classification of new genomes. A new measure has been proposed to compute sequence complexity in protein sequences based on linguistic complexity. The algorithm requires a single parameter, is computationally simple and provides a framework for comparative genomic analysis. Protein sequences were classified into groups of high or low complexity based on a quantitative measure termed F(c), which is proportional to the fraction of low complexity sequence present in the protein. The algorithm was tested on sequences of 196 non-homologous proteins whose crystal structures are available at </=2.0 A resolution. Protein sequences of high complexity had 'globular' structures (95% agreement), whereas those of low complexity had non-globular structures (80% agreement). Application of this measure to proteins of unknown structure/function from different genomes revealed that the sequences of high complexity constitute the majority in all genomes (about 90% in Archaea, about 93% in Eubacteria, 89% in Saccharomyces cerevisiae and 90% in Caenorhabditis elegans). Aeropyrum pernix among Archaeae and Deinococcus radiodurans among Eubacteria have the lowest fraction of high complexity proteins (75% and 80% respectively). Further, it was observed that a few bacterial pathogens (Mycobacterium tuberculosis, Pseudomonas aeruginosa) have high fraction of low complexity proteins. The program ScanCom is available from the authors as a PERL script (UNIX system).
Collapse
Affiliation(s)
- Tannistha Nandi
- Institute of Genomics and Integrative Biology, Centre for Biochemical Technology, Mall Road, Delhi 110 007, India
| | | | | | | | | | | | | | | |
Collapse
|
45
|
Chen C, Gentles AJ, Jurka J, Karlin S. Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci U S A 2002; 99:2930-5. [PMID: 11867739 PMCID: PMC122450 DOI: 10.1073/pnas.052692099] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/21/2001] [Indexed: 11/18/2022] Open
Abstract
Human chromosomes 21 and 22 (mainly the q-arms) were the first complete parts of the human genome released. Our analysis of genes, pseudogenes (Psig), and Alu repeats across these chromosomes include the following findings: The number of gene structures containing untranslated exons exceeds 25%; the terminal exon tends to be the largest among exons, whereas, the initial intron tends to be the largest among introns; single-exon gene length is approximately the mean gene exon number times the mean internal exon length; processed Psig lengths are on average approximately the same as single-exon gene length; and the G+C content and length of genes are uncorrelated. The counts and distribution of genes, Psig, and Alu sequences and G+C variation are evaluated with respect to clusters and overdispersions. Other assessments concern comparisons of intergenic lengths, properties of Psig sequences, and correlations between Alu and Psig sequences.
Collapse
Affiliation(s)
- Chingfer Chen
- Department of Mathematics, Stanford University, Stanford, CA 94305-2125, USA
| | | | | | | |
Collapse
|
46
|
Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD, Chiu W, Garner EC, Obradovic Z. Intrinsically disordered protein. J Mol Graph Model 2002; 19:26-59. [PMID: 11381529 DOI: 10.1016/s1093-3263(00)00138-8] [Citation(s) in RCA: 1797] [Impact Index Per Article: 78.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Proteins can exist in a trinity of structures: the ordered state, the molten globule, and the random coil. The five following examples suggest that native protein structure can correspond to any of the three states (not just the ordered state) and that protein function can arise from any of the three states and their transitions. (1) In a process that likely mimics infection, fd phage converts from the ordered into the disordered molten globular state. (2) Nucleosome hyperacetylation is crucial to DNA replication and transcription; this chemical modification greatly increases the net negative charge of the nucleosome core particle. We propose that the increased charge imbalance promotes its conversion to a much less rigid form. (3) Clusterin contains an ordered domain and also a native molten globular region. The molten globular domain likely functions as a proteinaceous detergent for cell remodeling and removal of apoptotic debris. (4) In a critical signaling event, a helix in calcineurin becomes bound and surrounded by calmodulin, thereby turning on calcineurin's serine/threonine phosphatase activity. Locating the calcineurin helix within a region of disorder is essential for enabling calmodulin to surround its target upon binding. (5) Calsequestrin regulates calcium levels in the sarcoplasmic reticulum by binding approximately 50 ions/molecule. Disordered polyanion tails at the carboxy terminus bind many of these calcium ions, perhaps without adopting a unique structure. In addition to these examples, we will discuss 16 more proteins with native disorder. These disordered regions include molecular recognition domains, protein folding inhibitors, flexible linkers, entropic springs, entropic clocks, and entropic bristles. Motivated by such examples of intrinsic disorder, we are studying the relationships between amino acid sequence and order/disorder, and from this information we are predicting intrinsic order/disorder from amino acid sequence. The sequence-structure relationships indicate that disorder is an encoded property, and the predictions strongly suggest that proteins in nature are much richer in intrinsic disorder than are those in the Protein Data Bank. Recent predictions on 29 genomes indicate that proteins from eucaryotes apparently have more intrinsic disorder than those from either bacteria or archaea, with typically > 30% of eucaryotic proteins having disordered regions of length > or = 50 consecutive residues.
Collapse
Affiliation(s)
- A K Dunker
- School of Molecular Biosciences, Washington State University, Pullman, WA 99164-4660, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Sonnhammer EL, Wootton JC. Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 2001; 45:262-73. [PMID: 11599029 DOI: 10.1002/prot.1146] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Several protein sequence analysis algorithms are based on properties of amino acid composition and repetitiveness. These include methods for prediction of secondary structure elements, coiled-coils, transmembrane segments or signal peptides, and for assignment of low-complexity, nonglobular, or intrinsically unstructured regions. The quality of such analyses can be greatly enhanced by graphical software tools that present predicted sequence features together in context and allow judgment to be focused simultaneously on several different types of supporting information. For these purposes, we describe the SFINX package, which allows many different sets of segmental or continuous-curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot-plot or a multiple alignment of database matches. The implementation is currently based on extensions to the graphical viewers Dotter and Blixem and scripts that convert data from external programs to a simple generic data definition format called SFS. We describe applications in which dot-plots and flanking database matches provide valuable contextual information for analyses based on compositional and repetitive sequence features. The system is also useful for comparing results from algorithms run with a range of parameters to determine appropriate values for defaults or cutoffs for large-scale genomic analyses.
Collapse
Affiliation(s)
- E L Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | |
Collapse
|
48
|
Jeffs AR, Wells E, Morris CM. Nonrandom distribution of interspersed repeat elements in the BCR and ABL1 genes and its relation to breakpoint cluster regions. Genes Chromosomes Cancer 2001; 32:144-54. [PMID: 11550282 DOI: 10.1002/gcc.1176] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
The Philadelphia translocation, t(9;22)(q34;q11), is the microscopically visible product of recombination between two genes, ABL1 on chromosome 9 and BCR on chromosome 22, and gives rise to a functional hybrid BCR-ABL1 gene with demonstrated leukemogenic properties. Breakpoints in BCR occur mostly within one of two regions: a 5 kb major breakpoint cluster region (M-Bcr) and a larger 35 kb minor breakpoint cluster region (m-Bcr) towards the 3' end of the first BCR intron. By contrast, breakpoints in ABL1 are reported to occur more widely across a >200 kb region which spans the large first and second introns. The mechanisms that determine preferential breakage sites in BCR, and which cause recombination between BCR and ABL1, are presently unknown. In some cases, Alu repeats have been identified at or near sequenced breakpoint sites in both genes, providing indications, albeit controversial, that they may be relevant. For the present study, we carried out a detailed analysis of genomic BCR and ABL1 sequences to identify, classify, and locate interspersed repeat sequences and to relate their distribution to precisely mapped BCR-ABL1 recombination sites. Our findings confirm that Alu are the most abundant class of repeat in both genes, but that they occupy fewer sites than previously estimated and that they are distributed nonrandomly. r-Scan statistics were applied to provide a measure of repeat distribution and to evaluate extremes in repeat spacing. A significant lack of Alu elements was observed across the major and minor breakpoint cluster regions of BCR and across a 25-kb region showing a high frequency of breakage in ABL1. These findings counter the suggestion that occurrence of Alu at BCR-ABL1 recombination sites is likely by chance because of the high density of Alu in these two genes. Instead, as yet unidentified DNA conformation or nucleotide characteristics peculiar to the preferentially recombining regions, including those Alu elements present within them, more likely influence their fragility.
Collapse
Affiliation(s)
- A R Jeffs
- Leukaemia Research Group, Christchurch School of Medicine, Christchurch, New Zealand
| | | | | |
Collapse
|
49
|
Baudouin-Cornu P, Surdin-Kerjan Y, Marlière P, Thomas D. Molecular evolution of protein atomic composition. Science 2001; 293:297-300. [PMID: 11452124 DOI: 10.1126/science.1061052] [Citation(s) in RCA: 99] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Living organisms encounter various growth conditions in their habitats, raising the question of whether ecological fluctuations could alter biological macromolecules. The advent of complete genome sequences and the characterization of whole metabolic pathways allowed us to search for such ecological imprints. Significant correlations between atomic composition and metabolic function were found in sulfur- and carbon-assimilatory enzymes, which appear depleted in sulfur and carbon, respectively, in both the bacterium Escherichia coli and the eukaryote Saccharomyces cerevisiae. In addition to genetic instructions, genomic data thus also provide paleontological records of environmental nutrient availability and of metabolic costs.
Collapse
Affiliation(s)
- P Baudouin-Cornu
- Centre de Génétique Moléculaire, Centre National de la Recherche Scientifique, 91 198 Gif-sur-Yvette Cedex, France., Evologic SA, 4 rue Pierre Fontaine, 91000 Evry, France
| | | | | | | |
Collapse
|
50
|
Su X, Wallenstein S, Bishop D. Nonoverlapping clusters: approximate distribution and application to molecular biology. Biometrics 2001; 57:420-6. [PMID: 11414565 DOI: 10.1111/j.0006-341x.2001.00420.x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
An approach is developed for the screening of genomic sequence data to identify gene regulatory regions. This approach is based on deciding if putative transcription factor binding sites are clustered together to a greater extent than one would expect by chance. Given n events occurring on an interval of width L (L base pairs), an r:w cluster is defined as r + 1 consecutive events all contained within a window of length wL. Accurate and easily computable approximations are derived for the distribution of the number of nonoverlapping r:w clusters under the model that the positions of the n events have a uniform distribution. Simulations demonstrate that these approximations have greater accuracy than existing methods. The approximation is applied to detect erythroid-specific regulatory regions in genomic DNA sequences, first in an artificial case where r is specified a priori and then as part of an exploratory approach.
Collapse
Affiliation(s)
- X Su
- Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 10029-6574, USA
| | | | | |
Collapse
|