1
|
Kumari C, Siddharthan R. MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation. PLoS One 2024; 19:e0302271. [PMID: 38630664 PMCID: PMC11023594 DOI: 10.1371/journal.pone.0302271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 03/29/2024] [Indexed: 04/19/2024] Open
Abstract
We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but may also have hidden cluster structure in their rows: for example, they may be drawn from heterogeneous (geographical, socioeconomic, methodological) sources, such that the outcome variable they describe (such as the presence of a disease) may depend not only on the other variables but on the cluster context. Moreover, sharing of biomedical data is often hindered by patient confidentiality laws, and there is current interest in algorithms to generate synthetic tabular data from real data, for example via deep learning. We demonstrate a novel EM-based clustering algorithm, MMM ("Madras Mixture Model"), that outperforms standard algorithms in determining clusters in synthetic heterogeneous data, and recovers structure in real data. Based on this, we demonstrate a synthetic tabular data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data assuming cluster-specific data distributions for the input columns. We benchmark this algorithm by testing the performance of standard ML algorithms when they are trained on synthetic data and tested on real published datasets. Our synthetic data generation algorithm outperforms other literature tabular-data generators, and approaches the performance of training purely with real data.
Collapse
Affiliation(s)
- Chandrani Kumari
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| |
Collapse
|
2
|
Narayanan A, Selvakumar P, Siddharthan R, Sanyal K. Identification of C. auris clade 5 isolates using claID. Med Mycol 2024; 62:myae018. [PMID: 38414264 DOI: 10.1093/mmy/myae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Revised: 02/05/2024] [Accepted: 02/26/2024] [Indexed: 02/29/2024] Open
Abstract
Candida auris poses threats to the global medical community due to its multidrug resistance, ability to cause nosocomial outbreaks and resistance to common sterilization agents. Different variants that emerged at different geographical zones were classified as clades. Clade-typing becomes necessary to track its spread, possible emergence of new clades, and to predict the properties that exhibit a clade bias. We previously reported a colony-Polymerase Chain Reaction-based, clade-identification method employing whole genome alignments and identification of clade-specific sequences of four major geographical clades. Here, we expand the panel by identifying clade 5 which was later isolated in Iran, using specific primers designed through in silico analyses.
Collapse
Affiliation(s)
- Aswathy Narayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore,560064, India
| | - Pavitra Selvakumar
- Computational Biology, The Institute of Mathematical Sciences, Chennai, 600113, India
- Homi Bhabha National Institute, Mumbai, 400094, India
| | - Rahul Siddharthan
- Computational Biology, The Institute of Mathematical Sciences, Chennai, 600113, India
- Homi Bhabha National Institute, Mumbai, 400094, India
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore,560064, India
| |
Collapse
|
3
|
Selvakumar P, Siddharthan R. Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model. R Soc Open Sci 2024; 11:231088. [PMID: 38269075 PMCID: PMC10805598 DOI: 10.1098/rsos.231088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 12/20/2023] [Indexed: 01/26/2024]
Abstract
Transcription factor binding sites (TFBS), like other DNA sequence, evolve via mutation and selection relating to their function. Models of nucleotide evolution describe DNA evolution via single-nucleotide mutation. A stationary vector of such a model is the long-term distribution of nucleotides, unchanging under the model. Neutrally evolving sites may have uniform stationary vectors, but one expects that sites within a TFBS instead have stationary vectors reflective of the fitness of various nucleotides at those positions. We introduce 'position-specific stationary vectors' (PSSVs), the collection of stationary vectors at each site in a TFBS locus, analogous to the position weight matrix (PWM) commonly used to describe TFBS. We infer PSSVs for human TFs using two evolutionary models (Felsenstein 1981 and Hasegawa-Kishino-Yano 1985). We find that PSSVs reflect the nucleotide distribution from PWMs, but with reduced specificity. We infer ancestral nucleotide distributions at individual positions and calculate 'conditional PSSVs' conditioned on specific choices of majority ancestral nucleotide. We find that certain ancestral nucleotides exert a strong evolutionary pressure on neighbouring sequence while others have a negligible effect. Finally, we present a fast likelihood calculation for the F81 model on moderate-sized trees that makes this approach feasible for large-scale studies along these lines.
Collapse
Affiliation(s)
- Pavitra Selvakumar
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| |
Collapse
|
4
|
Parkhi D, Periyathambi N, Ghebremichael-Weldeselassie Y, Patel V, Sukumar N, Siddharthan R, Narlikar L, Saravanan P. Prediction of postpartum prediabetes by machine learning methods in women with gestational diabetes mellitus. iScience 2023; 26:107846. [PMID: 37767000 PMCID: PMC10520542 DOI: 10.1016/j.isci.2023.107846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/27/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
Early onset of type 2 diabetes and cardiovascular disease are common complications for women diagnosed with gestational diabetes. Prediabetes refers to a condition in which blood glucose levels are higher than normal, but not yet high enough to be diagnosed as type 2 diabetes. Currently, there is no accurate way of knowing which women with gestational diabetes are likely to develop postpartum prediabetes. This study aims to predict the risk of postpartum prediabetes in women diagnosed with gestational diabetes. Our sparse logistic regression approach selects only two variables - antenatal fasting glucose at OGTT and HbA1c soon after the diagnosis of GDM - as relevant, but gives an area under the receiver operating characteristic curve of 0.72, outperforming all other methods. We envision this to be a practical solution, which coupled with a targeted follow-up of high-risk women, could yield better cardiometabolic outcomes in women with a history of GDM.
Collapse
Affiliation(s)
- Durga Parkhi
- Populations, Evidence, and Technologies, Division of Health Sciences, University of Warwick, Coventry, UK
| | - Nishanthi Periyathambi
- Populations, Evidence, and Technologies, Division of Health Sciences, University of Warwick, Coventry, UK
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, UK
| | - Yonas Ghebremichael-Weldeselassie
- Populations, Evidence, and Technologies, Division of Health Sciences, University of Warwick, Coventry, UK
- School of Mathematics and Statistics, The Open University, Milton Keynes, UK
| | - Vinod Patel
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, UK
| | - Nithya Sukumar
- Populations, Evidence, and Technologies, Division of Health Sciences, University of Warwick, Coventry, UK
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, UK
| | - Rahul Siddharthan
- Department of Computational Biology, The Institute of Mathematical Sciences, Chennai, India
| | - Leelavati Narlikar
- Department of Data Science, Indian Institute of Science Education and Research, Pune, India
| | - Ponnusamy Saravanan
- Populations, Evidence, and Technologies, Division of Health Sciences, University of Warwick, Coventry, UK
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, UK
| |
Collapse
|
5
|
Vadnala RN, Hannenhalli S, Narlikar L, Siddharthan R. Transcription factors organize into functional groups on the linear genome and in 3D chromatin. Heliyon 2023; 9:e18211. [PMID: 37520992 PMCID: PMC10382302 DOI: 10.1016/j.heliyon.2023.e18211] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 07/11/2023] [Accepted: 07/11/2023] [Indexed: 08/01/2023] Open
Abstract
Transcription factors (TFs) and their binding sites have evolved to interact cooperatively or competitively with each other. Here we examine in detail, across multiple cell lines, such cooperation or competition among TFs both in sequential and spatial proximity (using chromatin conformation capture assays), considering in vivo binding data as well as TF binding motifs in DNA. We ascertain significantly co-occurring ("attractive") or avoiding ("repulsive") TF pairs using robust randomized models that retain the essential characteristics of the experimental data. Across human cell lines TFs organize into two groups, with intra-group attraction and inter-group repulsion. This is true for both sequential and spatial proximity, and for both in vivo binding and sequence motifs. Attractive TF pairs exhibit significantly more physical interactions suggesting an underlying mechanism. The two TF groups differ significantly in their genomic and network properties, as well in their function-while one group regulates housekeeping function, the other potentially regulates lineage-specific functions, that are disrupted in cancer. Weaker binding sites tend to occur in spatially interacting regions of the genome. Our results suggest that a complex pattern of spatial cooperativity of TFs and chromatin has evolved with the genome to support housekeeping and lineage-specific functions.
Collapse
Affiliation(s)
- Rakesh Netha Vadnala
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| | | | - Leelavati Narlikar
- Department of Data Science, Indian Institute of Science Education and Research, Pune, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| |
Collapse
|
6
|
Periyathambi N, Parkhi D, Ghebremichael-Weldeselassie Y, Patel V, Sukumar N, Siddharthan R, Narlikar L, Saravanan P. Machine learning prediction of non-attendance to postpartum glucose screening and subsequent risk of type 2 diabetes following gestational diabetes. PLoS One 2022; 17:e0264648. [PMID: 35255105 PMCID: PMC8901061 DOI: 10.1371/journal.pone.0264648] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 02/14/2022] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE The aim of the present study was to identify the factors associated with non-attendance of immediate postpartum glucose test using a machine learning algorithm following gestational diabetes mellitus (GDM) pregnancy. METHOD A retrospective cohort study of all GDM women (n = 607) for postpartum glucose test due between January 2016 and December 2019 at the George Eliot Hospital NHS Trust, UK. RESULTS Sixty-five percent of women attended postpartum glucose test. Type 2 diabetes was diagnosed in 2.8% and 21.6% had persistent dysglycaemia at 6-13 weeks post-delivery. Those who did not attend postpartum glucose test seem to be younger, multiparous, obese, and continued to smoke during pregnancy. They also had higher fasting glucose at antenatal oral glucose tolerance test. Our machine learning algorithm predicted postpartum glucose non-attendance with an area under the receiver operating characteristic curve of 0.72. The model could achieve a sensitivity of 70% with 66% specificity at a risk score threshold of 0.46. A total of 233 (38.4%) women attended subsequent glucose test at least once within the first two years of delivery and 24% had dysglycaemia. Compared to women who attended postpartum glucose test, those who did not attend had higher conversion rate to type 2 diabetes (2.5% vs 11.4%; p = 0.005). CONCLUSION Postpartum screening following GDM is still poor. Women who did not attend postpartum screening appear to have higher metabolic risk and higher conversion to type 2 diabetes by two years post-delivery. Machine learning model can predict women who are unlikely to attend postpartum glucose test using simple antenatal factors. Enhanced, personalised education of these women may improve postpartum glucose screening.
Collapse
Affiliation(s)
- Nishanthi Periyathambi
- Division of Populations, Evidence, and Technologies of Health Sciences, Warwick Medical School, University of Warwick, Coventry, United Kingdom
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, United Kingdom
| | - Durga Parkhi
- Division of Populations, Evidence, and Technologies of Health Sciences, Warwick Medical School, University of Warwick, Coventry, United Kingdom
| | - Yonas Ghebremichael-Weldeselassie
- Division of Populations, Evidence, and Technologies of Health Sciences, Warwick Medical School, University of Warwick, Coventry, United Kingdom
- School of Mathematics and Statistics, The Open University, Milton Keynes, United Kingdom
| | - Vinod Patel
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, United Kingdom
| | - Nithya Sukumar
- Division of Populations, Evidence, and Technologies of Health Sciences, Warwick Medical School, University of Warwick, Coventry, United Kingdom
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, United Kingdom
| | - Rahul Siddharthan
- Department of Computational Biology, The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| | - Leelavati Narlikar
- Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune, India
| | - Ponnusamy Saravanan
- Division of Populations, Evidence, and Technologies of Health Sciences, Warwick Medical School, University of Warwick, Coventry, United Kingdom
- Department of Diabetes, Endocrinology, and Metabolism, George Eliot Hospital, Nuneaton, United Kingdom
| |
Collapse
|
7
|
Narayanan A, Vadnala RN, Ganguly P, Selvakumar P, Rudramurthy SM, Prasad R, Chakrabarti A, Siddharthan R, Sanyal K. Functional and Comparative Analysis of Centromeres Reveals Clade-Specific Genome Rearrangements in Candida auris and a Chromosome Number Change in Related Species. mBio 2021; 12:e00905-21. [PMID: 33975937 PMCID: PMC8262905 DOI: 10.1128/mbio.00905-21] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 04/01/2021] [Indexed: 12/22/2022] Open
Abstract
The thermotolerant multidrug-resistant ascomycete Candida auris rapidly emerged since 2009 causing systemic infections worldwide and simultaneously evolved in different geographical zones. The molecular events that orchestrated this sudden emergence of the killer fungus remain mostly elusive. Here, we identify centromeres in C. auris and related species, using a combined approach of chromatin immunoprecipitation and comparative genomic analyses. We find that C. auris and multiple other species in the Clavispora/Candida clade shared a conserved small regional GC-poor centromere landscape lacking pericentromeres or repeats. Further, a centromere inactivation event led to karyotypic alterations in this species complex. Interspecies genome analysis identified several structural chromosomal changes around centromeres. In addition, centromeres are found to be rapidly evolving loci among the different geographical clades of the same species of C. auris Finally, we reveal an evolutionary trajectory of the unique karyotype associated with clade 2 that consists of the drug-susceptible isolates of C. aurisIMPORTANCECandida auris, the killer fungus, emerged as different geographical clades, exhibiting multidrug resistance and high karyotype plasticity. Chromosomal rearrangements are known to play key roles in the emergence of new species, virulence, and drug resistance in pathogenic fungi. Centromeres, the genomic loci where microtubules attach to separate the sister chromatids during cell division, are known to be hot spots of breaks and downstream rearrangements. We identified the centromeres in C. auris and related species to study their involvement in the evolution and karyotype diversity reported in C. auris We report conserved centromere features in 10 related species and trace the events that occurred at the centromeres during evolution. We reveal a centromere inactivation-mediated chromosome number change in these closely related species. We also observe that one of the geographical clades, the East Asian clade, evolved along a unique trajectory, compared to the other clades and related species.
Collapse
Affiliation(s)
- Aswathy Narayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, India
| | - Rakesh Netha Vadnala
- Computational Biology, The Institute of Mathematical Sciences/HBNI, Chennai, India
| | - Promit Ganguly
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, India
| | - Pavitra Selvakumar
- Computational Biology, The Institute of Mathematical Sciences/HBNI, Chennai, India
| | - Shivaprakash M Rudramurthy
- Department of Medical Microbiology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Rajendra Prasad
- Amity Institute of Biotechnology, Amity University Haryana, Haryana, India
| | - Arunaloke Chakrabarti
- Department of Medical Microbiology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Rahul Siddharthan
- Computational Biology, The Institute of Mathematical Sciences/HBNI, Chennai, India
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, India
- Osaka University, Suita, Japan
| |
Collapse
|
8
|
Sreekumar L, Kumari K, Guin K, Bakshi A, Varshney N, Thimmappa BC, Narlikar L, Padinhateeri R, Siddharthan R, Sanyal K. Orc4 spatiotemporally stabilizes centromeric chromatin. Genome Res 2021; 31:607-621. [PMID: 33514624 PMCID: PMC8015856 DOI: 10.1101/gr.265900.120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Accepted: 01/27/2021] [Indexed: 11/24/2022]
Abstract
The establishment of centromeric chromatin and its propagation by the centromere-specific histone CENPA is mediated by epigenetic mechanisms in most eukaryotes. DNA replication origins, origin binding proteins, and replication timing of centromere DNA are important determinants of centromere function. The epigenetically regulated regional centromeres in the budding yeast Candida albicans have unique DNA sequences that replicate earliest in every chromosome and are clustered throughout the cell cycle. In this study, the genome-wide occupancy of the replication initiation protein Orc4 reveals its abundance at all centromeres in C. albicans Orc4 is associated with four different DNA sequence motifs, one of which coincides with tRNA genes (tDNA) that replicate early and cluster together in space. Hi-C combined with genome-wide replication timing analyses identify that early replicating Orc4-bound regions interact with themselves stronger than with late replicating Orc4-bound regions. We simulate a polymer model of chromosomes of C. albicans and propose that the early replicating and highly enriched Orc4-bound sites preferentially localize around the clustered kinetochores. We also observe that Orc4 is constitutively localized to centromeres, and both Orc4 and the helicase Mcm2 are essential for cell viability and CENPA stability in C. albicans Finally, we show that new molecules of CENPA are recruited to centromeres during late anaphase/telophase, which coincides with the stage at which the CENPA-specific chaperone Scm3 localizes to the kinetochore. We propose that the spatiotemporal localization of Orc4 within the nucleus, in collaboration with Mcm2 and Scm3, maintains centromeric chromatin stability and CENPA recruitment in C. albicans.
Collapse
Affiliation(s)
- Lakshmi Sreekumar
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
| | - Kiran Kumari
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai 400076, India
- IITB-Monash Research Academy, Mumbai 400076, India
- Department of Chemical Engineering, Monash University, Melbourne 3800, Australia
| | - Krishnendu Guin
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
| | - Asif Bakshi
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
| | - Neha Varshney
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
| | - Bhagya C Thimmappa
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
| | - Leelavati Narlikar
- Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune 411008, India
| | - Ranjith Padinhateeri
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai 400076, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences/HBNI, Taramani, Chennai 600113, India
| | - Kaustuv Sanyal
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
- Graduate School of Frontier Biosciences, Osaka University, Suita, Osaka 565-0871, Japan
| |
Collapse
|
9
|
Nandi S, Potunuru UR, Kumari C, Nathan AA, Gopal J, Menon GI, Siddharthan R, Dixit M, Thangaraj PR. Altered kinetics of circulating progenitor cells in cardiopulmonary bypass (CPB) associated vasoplegic patients: A pilot study. PLoS One 2020; 15:e0242375. [PMID: 33211740 PMCID: PMC7676651 DOI: 10.1371/journal.pone.0242375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 10/31/2020] [Indexed: 11/19/2022] Open
Abstract
Vasoplegia observed post cardiopulmonary bypass (CPB) is associated with substantial morbidity, multiple organ failure and mortality. Circulating counts of hematopoietic stem cells (HSCs) and endothelial progenitor cells (EPC) are potential markers of neo-vascularization and vascular repair. However, the significance of changes in the circulating levels of these progenitors in perioperative CPB, and their association with post-CPB vasoplegia, are currently unexplored. We enumerated HSC and EPC counts, via flow cytometry, at different time-points during CPB in 19 individuals who underwent elective cardiac surgery. These 19 individuals were categorized into two groups based on severity of post-operative vasoplegia, a clinically insignificant vasoplegic Group 1 (G1) and a clinically significant vasoplegic Group 2 (G2). Differential changes in progenitor cell counts during different stages of surgery were compared across these two groups. Machine-learning classifiers (logistic regression and gradient boosting) were employed to determine if differential changes in progenitor counts could aid the classification of individuals into these groups. Enumerating progenitor cells revealed an early and significant increase in the circulating counts of CD34+ and CD34+CD133+ hematopoietic stem cells (HSC) in G1 individuals, while these counts were attenuated in G2 individuals. Additionally, EPCs (CD34+VEGFR2+) were lower in G2 individuals compared to G1. Gradient boosting outperformed logistic regression in assessing the vasoplegia grouping based on the fold change in circulating CD 34+ levels. Our findings indicate that a lack of early response of CD34+ cells and CD34+CD133+ HSCs might serve as an early marker for development of clinically significant vasoplegia after CPB.
Collapse
Affiliation(s)
- Sanhita Nandi
- Laboratory of Vascular Biology, Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Uma Rani Potunuru
- Apollo Hospitals Educational and Research Foundation, Chennai, India
| | | | - Abel Arul Nathan
- Laboratory of Vascular Biology, Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Jayashree Gopal
- Department of Endocrinology and Diabetology, Apollo Hospitals, Chennai, India
- * E-mail: (JG); (MD); (PRT)
| | - Gautam I. Menon
- The Institute of Mathematical Sciences (HBNI), Chennai, India
- Departments of Physics and Biology, Ashoka University, Sonepat, India
| | | | - Madhulika Dixit
- Laboratory of Vascular Biology, Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
- * E-mail: (JG); (MD); (PRT)
| | - Paul Ramesh Thangaraj
- Department of Cardiothoracic Surgery, Apollo Hospitals, Chennai, India
- Department of Mechanical Engineering, Indian Institute of Technology Madras, Chennai, India
- * E-mail: (JG); (MD); (PRT)
| |
Collapse
|
10
|
Sankaranarayanan SR, Ianiri G, Coelho MA, Reza MH, Thimmappa BC, Ganguly P, Vadnala RN, Sun S, Siddharthan R, Tellgren-Roth C, Dawson TL, Heitman J, Sanyal K. Loss of centromere function drives karyotype evolution in closely related Malassezia species. eLife 2020; 9:e53944. [PMID: 31958060 PMCID: PMC7025860 DOI: 10.7554/elife.53944] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 01/20/2020] [Indexed: 12/14/2022] Open
Abstract
Genomic rearrangements associated with speciation often result in variation in chromosome number among closely related species. Malassezia species show variable karyotypes ranging between six and nine chromosomes. Here, we experimentally identified all eight centromeres in M. sympodialis as 3-5-kb long kinetochore-bound regions that span an AT-rich core and are depleted of the canonical histone H3. Centromeres of similar sequence features were identified as CENP-A-rich regions in Malassezia furfur, which has seven chromosomes, and histone H3 depleted regions in Malassezia slooffiae and Malassezia globosa with nine chromosomes each. Analysis of synteny conservation across centromeres with newly generated chromosome-level genome assemblies suggests two distinct mechanisms of chromosome number reduction from an inferred nine-chromosome ancestral state: (a) chromosome breakage followed by loss of centromere DNA and (b) centromere inactivation accompanied by changes in DNA sequence following chromosome-chromosome fusion. We propose that AT-rich centromeres drive karyotype diversity in the Malassezia species complex through breakage and inactivation.
Collapse
Affiliation(s)
- Sundar Ram Sankaranarayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Giuseppe Ianiri
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Marco A Coelho
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Md Hashim Reza
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Bhagya C Thimmappa
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Promit Ganguly
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | | | - Sheng Sun
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | | | - Christian Tellgren-Roth
- National Genomics Infrastructure, Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala UniversityUppsalaSweden
| | - Thomas L Dawson
- Skin Research Institute Singapore, Agency for Science, Technology and Research (A*STAR)SingaporeSingapore
- Department of Drug Discovery, Medical University of South Carolina, School of PharmacyCharlestonUnited States
| | - Joseph Heitman
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| |
Collapse
|
11
|
Agrawal A, Sambare SV, Narlikar L, Siddharthan R. THiCweed: fast, sensitive detection of sequence features by clustering big datasets. Nucleic Acids Res 2019; 46:e29. [PMID: 29267972 PMCID: PMC5861420 DOI: 10.1093/nar/gkx1251] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 12/01/2017] [Indexed: 11/19/2022] Open
Abstract
We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.
Collapse
Affiliation(s)
- Ankit Agrawal
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| | - Snehal V Sambare
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411008, Maharashtra, India
| | - Rahul Siddharthan
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| |
Collapse
|
12
|
Datta V, Hannenhalli S, Siddharthan R. ChIPulate: A comprehensive ChIP-seq simulation pipeline. PLoS Comput Biol 2019; 15:e1006921. [PMID: 30897079 PMCID: PMC6445533 DOI: 10.1371/journal.pcbi.1006921] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 04/02/2019] [Accepted: 03/04/2019] [Indexed: 12/17/2022] Open
Abstract
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at https://github.com/vishakad/chipulate. DNA-binding proteins perform many key roles in biology, such as transcriptional regulation of gene expression and chromatin modification. ChIP-seq (Chromatin immunoprecipitation followed by high-throughput sequencing) is a widely used experimental technique to identify DNA-binding sites of specific proteins of interest, within cells, genome-wide. DNA fragments from genomic regions that are bound by a protein of interest, often a transcription factor (TF), are selectively extracted using specific antibodies, amplified using PCR, and sequenced. The sequences are mapped to the reference genome. Regions where many sequences map, called “peaks”, are used to infer the location of TF-bound loci (peaks), in vivo occupancy at those loci, and the sequence pattern (motif) to which the TF shows a binding affinity. But measurements of TF occupancy and motif inference are vulnerable to several biological and experimental sources of variation that are poorly understood and difficult to assess directly. Here, we simulate key steps of the ChIP-seq protocol with the aim of estimating the relative effects of various sources of variations on motif inference and binding affinity estimations. Besides providing specific insights and recommendations, we provide a general framework to simulate sequence reads in a ChIP-seq experiment, which should considerably aid in the development of software aimed at analyzing ChIP-seq data.
Collapse
Affiliation(s)
- Vishaka Datta
- Simons Centre for the Study of Living Machines, National Centre for Biological Sciences, TIFR, Bengaluru, Karnataka, India
- * E-mail:
| | - Sridhar Hannenhalli
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences/HBNI, Taramani, Chennai, India
| |
Collapse
|
13
|
Zhu Y, Engström PG, Tellgren-Roth C, Baudo CD, Kennell JC, Sun S, Billmyre RB, Schröder MS, Andersson A, Holm T, Sigurgeirsson B, Wu G, Sankaranarayanan SR, Siddharthan R, Sanyal K, Lundeberg J, Nystedt B, Boekhout T, Dawson TL, Heitman J, Scheynius A, Lehtiö J. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis. Nucleic Acids Res 2017; 45:2629-2643. [PMID: 28100699 PMCID: PMC5389616 DOI: 10.1093/nar/gkx006] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 01/16/2017] [Indexed: 11/23/2022] Open
Abstract
Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies.
Collapse
Affiliation(s)
- Yafeng Zhu
- Science for Life Laboratory, Department of Oncology-Pathology, Karolinska Institutet, 17121 Solna, Sweden
| | - Pär G Engström
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, 17121 Solna, Sweden
| | - Christian Tellgren-Roth
- National Genomics Infrastructure, Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 75108 Uppsala, Sweden
| | - Charles D Baudo
- Department of Biology, Saint Louis University, St. Louis, MO 63103, USA
| | - John C Kennell
- Department of Biology, Saint Louis University, St. Louis, MO 63103, USA
| | - Sheng Sun
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA
| | - R Blake Billmyre
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Markus S Schröder
- School of Biomedical and Biomolecular Science, Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Anna Andersson
- Department of Medicine Solna, Translational Immunology Unit, Karolinska Institutet and University Hospital, 17177 Stockholm, Sweden
| | - Tina Holm
- Department of Medicine Solna, Translational Immunology Unit, Karolinska Institutet and University Hospital, 17177 Stockholm, Sweden
| | - Benjamin Sigurgeirsson
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology, 17121 Solna, Sweden
| | - Guangxi Wu
- Computational and Systems Biology, Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), 138672, Singapore
| | - Sundar Ram Sankaranarayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore 560 064, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences/HBNI, Taramani, Chennai 600 113, India
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore 560 064, India
| | - Joakim Lundeberg
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology, 17121 Solna, Sweden
| | - Björn Nystedt
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, 75123 Uppsala, Sweden
| | - Teun Boekhout
- CBS-Fungal Biodiversity Centre, Utrecht, 3508, The Netherlands and Institute for Biodiversity and ecosystem Dynamics (IBED), University of Amsterdam, 1012 WX Amsterdam, The Netherlands
| | - Thomas L Dawson
- Institute of Medical Biology, Agency for Science, Technology and Research (A*STAR), 138648, Singapore
| | - Joseph Heitman
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA
| | - Annika Scheynius
- Science for Life Laboratory, Department of Clinical Science and Education, Karolinska Institutet, and Sachs' Children and Youth Hospital, Södersjukhuset, SE-118 83 Stockholm, Sweden
| | - Janne Lehtiö
- Science for Life Laboratory, Department of Oncology-Pathology, Karolinska Institutet, 17121 Solna, Sweden
| |
Collapse
|
14
|
Chatterjee G, Sankaranarayanan SR, Guin K, Thattikota Y, Padmanabhan S, Siddharthan R, Sanyal K. Repeat-Associated Fission Yeast-Like Regional Centromeres in the Ascomycetous Budding Yeast Candida tropicalis. PLoS Genet 2016; 12:e1005839. [PMID: 26845548 PMCID: PMC4741521 DOI: 10.1371/journal.pgen.1005839] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2015] [Accepted: 01/11/2016] [Indexed: 11/19/2022] Open
Abstract
The centromere, on which kinetochore proteins assemble, ensures precise chromosome segregation. Centromeres are largely specified by the histone H3 variant CENP-A (also known as Cse4 in yeasts). Structurally, centromere DNA sequences are highly diverse in nature. However, the evolutionary consequence of these structural diversities on de novo CENP-A chromatin formation remains elusive. Here, we report the identification of centromeres, as the binding sites of four evolutionarily conserved kinetochore proteins, in the human pathogenic budding yeast Candida tropicalis. Each of the seven centromeres comprises a 2 to 5 kb non-repetitive mid core flanked by 2 to 5 kb inverted repeats. The repeat-associated centromeres of C. tropicalis all share a high degree of sequence conservation with each other and are strikingly diverged from the unique and mostly non-repetitive centromeres of related Candida species--Candida albicans, Candida dubliniensis, and Candida lusitaniae. Using a plasmid-based assay, we further demonstrate that pericentric inverted repeats and the underlying DNA sequence provide a structural determinant in CENP-A recruitment in C. tropicalis, as opposed to epigenetically regulated CENP-A loading at centromeres in C. albicans. Thus, the centromere structure and its influence on de novo CENP-A recruitment has been significantly rewired in closely related Candida species. Strikingly, the centromere structural properties along with role of pericentric repeats in de novo CENP-A loading in C. tropicalis are more reminiscent to those of the distantly related fission yeast Schizosaccharomyces pombe. Taken together, we demonstrate, for the first time, fission yeast-like repeat-associated centromeres in an ascomycetous budding yeast.
Collapse
Affiliation(s)
- Gautam Chatterjee
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| | - Sundar Ram Sankaranarayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| | - Krishnendu Guin
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| | - Yogitha Thattikota
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| | - Sreedevi Padmanabhan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, C.I.T. Campus, Taramani, Chennai, India
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur, Bangalore, India
| |
Collapse
|
15
|
Jayaraman G, Siddharthan R. Sigma-2: Multiple sequence alignment of non-coding DNA via an evolutionary model. BMC Bioinformatics 2010; 11:464. [PMID: 20846408 PMCID: PMC2949893 DOI: 10.1186/1471-2105-11-464] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2010] [Accepted: 09/16/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While most multiple sequence alignment programs expect that all or most of their input is known to be homologous, and penalise insertions and deletions, this is not a reasonable assumption for non-coding DNA, which is much less strongly conserved than protein-coding genes. Arguing that the goal of sequence alignment should be the detection of homology and not similarity, we incorporate an evolutionary model into a previously published multiple sequence alignment program for non-coding DNA, Sigma, as a sensitive likelihood-based way to assess the significance of alignments. Version 1 of Sigma was successful in eliminating spurious alignments but exhibited relatively poor sensitivity on synthetic data. Sigma 1 used a p-value (the probability under the "null hypothesis" of non-homology) to assess the significance of alignments, and, optionally, a background model that captured short-range genomic correlations. Sigma version 2, described here, retains these features, but calculates the p-value using a sophisticated evolutionary model that we describe here, and also allows for a transition matrix for different substitution rates from and to different nucleotides. Our evolutionary model takes separate account of mutation and fixation, and can be extended to allow for locally differing functional constraints on sequence. RESULTS We demonstrate that, on real and synthetic data, Sigma-2 significantly outperforms other programs in specificity to genuine homology (that is, it minimises alignment of spuriously similar regions that do not have a common ancestry) while it is now as sensitive as the best current programs. CONCLUSIONS Comparing these results with an extrapolation of the best results from other available programs, we suggest that conservation rates in intergenic DNA are often significantly over-estimated. It is increasingly important to align non-coding DNA correctly, in regulatory genomics and in the context of whole-genome alignment, and Sigma-2 is an important step in that direction.
Collapse
Affiliation(s)
- Gayathri Jayaraman
- The Institute of Mathematical Sciences, Taramani, Chennai 600 113, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, Taramani, Chennai 600 113, India
| |
Collapse
|
16
|
Siddharthan R. Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS One 2010; 5:e9722. [PMID: 20339533 PMCID: PMC2842295 DOI: 10.1371/journal.pone.0009722] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2009] [Accepted: 02/26/2010] [Indexed: 01/27/2023] Open
Abstract
Background Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as “position weight matrices” (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps. Methodology/Principal Findings I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a “dinucleotide weight matrix” (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined “core motifs” by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the “signature” in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region. Conclusion/Significance While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.
Collapse
|
17
|
Guruharsha KG, Ruiz-Gomez M, Ranganath HA, Siddharthan R, VijayRaghavan K. The complex spatio-temporal regulation of the Drosophila myoblast attractant gene duf/kirre. PLoS One 2009; 4:e6960. [PMID: 19742310 PMCID: PMC2734059 DOI: 10.1371/journal.pone.0006960] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2009] [Accepted: 06/09/2009] [Indexed: 12/18/2022] Open
Abstract
A key early player in the regulation of myoblast fusion is the gene dumbfounded (duf, also known as kirre). Duf must be expressed, and function, in founder cells (FCs). A fixed number of FCs are chosen from a pool of equivalent myoblasts and serve to attract fusion-competent myoblasts (FCMs) to fuse with them to form a multinucleate muscle-fibre. The spatial and temporal regulation of duf expression and function are important and play a deciding role in choice of fibre number, location and perhaps size. We have used a combination of bioinformatics and functional enhancer deletion approaches to understand the regulation of duf. By transgenic enhancer-reporter deletion analysis of the duf regulatory region, we found that several distinct enhancer modules regulate duf expression in specific muscle founders of the embryo and the adult. In addition to existing bioinformatics tools, we used a new program for analysis of regulatory sequence, PhyloGibbs-MP, whose development was largely motivated by the requirements of this work. The results complement our deletion analysis by identifying transcription factors whose predicted binding regions match with our deletion constructs. Experimental evidence for the relevance of some of these TF binding sites comes from available ChIP-on-chip from the literature, and from our analysis of localization of myogenic transcription factors with duf enhancer reporter gene expression. Our results demonstrate the complex regulation in each founder cell of a gene that is expressed in all founder cells. They provide evidence for transcriptional control—both activation and repression—as an important player in the regulation of myoblast fusion. The set of enhancer constructs generated will be valuable in identifying novel trans-acting factor-binding sites and chromatin regulation during myoblast fusion in Drosophila. Our results and the bioinformatics tools developed provide a basis for the study of the transcriptional regulation of other complex genes.
Collapse
Affiliation(s)
- K. G. Guruharsha
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India
- Department of Studies in Zoology, University of Mysore, Manasagangothri, Mysore, India
| | - Mar Ruiz-Gomez
- Centro de Biologia Molecular Severo Ochoa, CSIC and UAM, Cantoblanco, Madrid, Spain
| | - H. A. Ranganath
- Department of Studies in Zoology, University of Mysore, Manasagangothri, Mysore, India
| | - Rahul Siddharthan
- Institute of Mathematical Sciences, CIT Campus, Taramani, Chennai, India
| | - K. VijayRaghavan
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India
- * E-mail:
| |
Collapse
|
18
|
Abstract
In this review, we discuss the general problem of understanding transcriptional regulation from DNA sequence and prior information. The main tasks we discuss are predicting local regions of DNA, cis-regulatory modules (CRMs) that contain binding sites for transcription factors (TFs), and predicting individual binding sites. We review various existing methods, and then describe the approach taken by PhyloGibbs, a recent motif-finding algorithm that we developed to predict TF binding sites, and PhyloGibbs-MP, an extension to PhyloGibbs that tackles other tasks in regulatory genomics, particularly prediction of CRMs.
Collapse
|
19
|
Abstract
PhyloGibbs is a program that uses Gibbs sampling to predict putative binding sites for transcription factors in DNA. It has two notable advances over previous algorithms for this task: it handles phylogenetically related sequence systematically, and it evaluates the significance of each predicted site via statistical sampling. In this article, we explain how to use PhyloGibbs effectively. We describe the essential command-line options in detail, and discuss other considerations that arise in practical situations.
Collapse
|
20
|
Abstract
Background Existing tools for multiple-sequence alignment focus on aligning protein sequence or protein-coding DNA sequence, and are often based on extensions to Needleman-Wunsch-like pairwise alignment methods. We introduce a new tool, Sigma, with a new algorithm and scoring scheme designed specifically for non-coding DNA sequence. This problem acquires importance with the increasing number of published sequences of closely-related species. In particular, studies of gene regulation seek to take advantage of comparative genomics, and recent algorithms for finding regulatory sites in phylogenetically-related intergenic sequence require alignment as a preprocessing step. Much can also be learned about evolution from intergenic DNA, which tends to evolve faster than coding DNA. Sigma uses a strategy of seeking the best possible gapless local alignments (a strategy earlier used by DiAlign), at each step making the best possible alignment consistent with existing alignments, and scores the significance of the alignment based on the lengths of the aligned fragments and a background model which may be supplied or estimated from an auxiliary file of intergenic DNA. Results Comparative tests of sigma with five earlier algorithms on synthetic data generated to mimic real data show excellent performance, with Sigma balancing high "sensitivity" (more bases aligned) with effective filtering of "incorrect" alignments. With real data, while "correctness" can't be directly quantified for the alignment, running the PhyloGibbs motif finder on pre-aligned sequence suggests that Sigma's alignments are superior. Conclusion By taking into account the peculiarities of non-coding DNA, Sigma fills a gap in the toolbox of bioinformatics.
Collapse
Affiliation(s)
- Rahul Siddharthan
- Institute of Mathematical Sciences, CIT Campus, Taramani, Chennai 600113, India.
| |
Collapse
|
21
|
Abstract
Genomewide techniques to assay gene expression and transcription factor binding are in widespread use, but are far from providing predictive rules for the function of regulatory DNA. To investigate more intensively the grammar rules for active regulatory sequence, we made libraries from random ligations of a very restricted set of sequences. Working with the yeast Saccharomyces cerevisiae, we developed a novel screen based on the sensitivity of ascospores lacking dityrosine to treatment with lytic enzymes. We tested two separate libraries built by random ligation of a single type of activator site either for a well-characterized sporulation factor, Ndt80, or for a new sporulation-specific regulatory site that we identified and several neutral spacer elements. This selective system achieved up to 1:10(4) enrichment of the artificial sequences that were active during sporulation, allowing a high-throughput analysis of large libraries of synthetic promoters. This is not practical with methods involving direct screening for expression, such as those based on fluorescent reporters. There were very few false positives, since active promoters always passed the screen when retested. The survival rate of our libraries containing roughly equal numbers of spacers and activators was a few percent that of libraries made from activators alone. The sequences of approximately 100 examples of active and inactive promoters could not be distinguished by simple binary rules; instead, the best model for the data was a linear regression fit of a quantitative measure of gene activity to multiple features of the regulatory sequence.
Collapse
Affiliation(s)
- Martin Ligr
- The Rockefeller University, New York, New York 10021, USA
| | | | | | | |
Collapse
|
22
|
Abstract
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and "background" intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.
Collapse
Affiliation(s)
- Rahul Siddharthan
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
- Institute of Mathematical Sciences, Taramani, Chennai, India
| | - Eric D Siggia
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
| | - Erik van Nimwegen
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
- Division of Bioinformatics, Biozentrum, University of Basel, Basel, Switzerland
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
23
|
Siddharthan R. Eastern creeds are less dogmatic about scripture. Nature 2005; 433:355. [PMID: 15674262 DOI: 10.1038/433355d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
24
|
Georges A, Siddharthan R, Florens S. Dynamical mean-field theory of resonating-valence-bond antiferromagnets. Phys Rev Lett 2001; 87:277203. [PMID: 11800912 DOI: 10.1103/physrevlett.87.277203] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2001] [Indexed: 05/23/2023]
Abstract
We propose a theory of the spin dynamics of frustrated quantum antiferromagnets, which is based on an effective action for a plaquette embedded in a self-consistent bath. This approach, supplemented by a low-energy projection, is applied to the Kagomé antiferromagnet. We find that a spin-liquid regime extends to very low energy, in which local correlation functions have a slow decay in time, well described by power-law behavior and omega/T scaling of the response function: chi(") (omega) is proportional to omega(-alpha)F (omega/T).
Collapse
Affiliation(s)
- A Georges
- Laboratoire de Physique Théorique, Ecole Normale Supérieure, 24 rue Lhomond, 75231 Paris Cedex 05, France
| | | | | |
Collapse
|
25
|
Honig LS, Siddharthan R, Sheremata WA, Sheldon JJ, Sazant A. Multiple sclerosis: correlation of magnetic resonance imaging with cerebrospinal fluid findings. J Neurol Neurosurg Psychiatry 1988; 51:277-80. [PMID: 2450176 PMCID: PMC1031544 DOI: 10.1136/jnnp.51.2.277] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
MRI examination of 41 patients with clinical definite multiple sclerosis showed white matter lesions of high proton T2 signal consistent with demyelination in 76% and CSF abnormalities present in 76%. Of patients with CSF abnormalities, 26% had normal MRI scans; conversely 26% of patients with MRI abnormalities had negative CSF studies. Thus a significant number of multiple sclerosis patients had negative results on either MRI or CSF examination, while only 5% had normal results on both tests.
Collapse
Affiliation(s)
- L S Honig
- Department of Neurology, University of Miami School of Medicine, FL
| | | | | | | | | |
Collapse
|
26
|
Sheldon JJ, Siddharthan R, Tobias J, Sheremata WA, Soila K, Viamonte M. MR imaging of multiple sclerosis: comparison with clinical and CT examinations in 74 patients. AJR Am J Roentgenol 1985; 145:957-64. [PMID: 3876753 DOI: 10.2214/ajr.145.5.957] [Citation(s) in RCA: 34] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Magnetic resonance (MR) imaging, the latest test for evaluation of patients with multiple sclerosis (MS), was assessed against clinical evidence in 74 patients with definite or probable MS. MR imaging was positive in 55 (85%) of 65 patients with definite MS but in only one (11%) of nine patients with probable MS. The examination is most likely to be positive when the patient is classified clinically as having definite MS; when the disease is active and not in remission; and if the constellation of symptoms indicates a multiplicity of regions with neurologic dysfunction. The examination was most sensitive for detecting lesions in the cerebral hemispheres, the posterior fossa, and the cervical spinal cord, in that order; it did not detect any lesions in the optic nerves. The paraclinical tests and MR imaging were of equal sensitivity in detecting MS lesions, but the latter method was more specific in localization. Cerebrospinal fluid evaluation was slightly less sensitive than the other two tests. There was no correlation between MR imaging and these examinations. The authors conclude that MR imaging is more sensitive than computed tomography (CT), which was positive in 25% of 59 patients with definite MS; it is always positive when CT is positive; and it probably can replace CT in the diagnosis and follow-up of patients with MS.
Collapse
|
27
|
|