1
|
Bello L, Wiedenhöft J, Schliep A. Compressed computations using wavelets for hidden Markov models with continuous observations. PLoS One 2023; 18:e0286074. [PMID: 37279196 DOI: 10.1371/journal.pone.0286074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 05/09/2023] [Indexed: 06/08/2023] Open
Abstract
Compression as an accelerant of computation is increasingly recognized as an important component in engineering fast real-world machine learning methods for big data; c.f., its impact on genome-scale approximate string matching. Previous work showed that compression can accelerate algorithms for Hidden Markov Models (HMM) with discrete observations, both for the classical frequentist HMM algorithms-Forward Filtering, Backward Smoothing and Viterbi-and Gibbs sampling for Bayesian HMM. For Bayesian HMM with continuous-valued observations, compression was shown to greatly accelerate computations for specific types of data. For instance, data from large-scale experiments interrogating structural genetic variation can be assumed to be piece-wise constant with noise, or, equivalently, data generated by HMM with dominant self-transition probabilities. Here we extend the compressive computation approach to the classical frequentist HMM algorithms on continuous-valued observations, providing the first compressive approach for this problem. In a large-scale simulation study, we demonstrate empirically that in many settings compressed HMM algorithms very clearly outperform the classical algorithms with no, or only an insignificant effect, on the computed probabilities and infered state paths of maximal likelihood. This provides an efficient approach to big data computations with HMM. An open-source implementation of the method is available from https://github.com/lucabello/wavelet-hmms.
Collapse
Affiliation(s)
- Luca Bello
- Computer Science and Engineering, University of Gothenburg, Chalmers, Gothenburg, Sweden
| | - John Wiedenhöft
- Scientific Core Facility Medical Biometry and Statistical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
| | - Alexander Schliep
- Computer Science and Engineering, University of Gothenburg, Chalmers, Gothenburg, Sweden
- Faculty of Health Sciences, B-TU Cottbus-Senftenberg, Cottbus, Germany
| |
Collapse
|
2
|
Viet Johansson S, Gummesson Svensson H, Bjerrum E, Schliep A, Haghir Chehreghani M, Tyrchan C, Engkvist O. Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction. Mol Inform 2022; 41:e2200043. [PMID: 35732584 DOI: 10.1002/minf.202200043] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 06/22/2022] [Indexed: 01/05/2023]
Abstract
Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.
Collapse
Affiliation(s)
- Simon Viet Johansson
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Hampus Gummesson Svensson
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Esben Bjerrum
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Morteza Haghir Chehreghani
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Christian Tyrchan
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, SE-431 83, Mölndal, Sweden
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| |
Collapse
|
3
|
Gustafsson J, Norberg P, Qvick-Wester JR, Schliep A. Fast parallel construction of variable-length Markov chains. BMC Bioinformatics 2021; 22:487. [PMID: 34627154 PMCID: PMC8501649 DOI: 10.1186/s12859-021-04387-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 09/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has [Formula: see text] formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. RESULTS An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl's law of 3 for 4 threads and about 6 for 16 threads, respectively. CONCLUSIONS Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.
Collapse
Affiliation(s)
- Joel Gustafsson
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden.
| | - Peter Norberg
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden
| | - Jan R Qvick-Wester
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| |
Collapse
|
4
|
Dansson HV, Stempfle L, Egilsdóttir H, Schliep A, Portelius E, Blennow K, Zetterberg H, Johansson FD. Predicting progression and cognitive decline in amyloid-positive patients with Alzheimer's disease. Alzheimers Res Ther 2021; 13:151. [PMID: 34488882 PMCID: PMC8422748 DOI: 10.1186/s13195-021-00886-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Accepted: 08/08/2021] [Indexed: 11/10/2022]
Abstract
BACKGROUND In Alzheimer's disease, amyloid- β (A β) peptides aggregate in the lowering CSF amyloid levels - a key pathological hallmark of the disease. However, lowered CSF amyloid levels may also be present in cognitively unimpaired elderly individuals. Therefore, it is of great value to explain the variance in disease progression among patients with A β pathology. METHODS A cohort of n=2293 participants, of whom n=749 were A β positive, was selected from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database to study heterogeneity in disease progression for individuals with A β pathology. The analysis used baseline clinical variables including demographics, genetic markers, and neuropsychological data to predict how the cognitive ability and AD diagnosis of subjects progressed using statistical models and machine learning. Due to the relatively low prevalence of A β pathology, models fit only to A β-positive subjects were compared to models fit to an extended cohort including subjects without established A β pathology, adjusting for covariate differences between the cohorts. RESULTS A β pathology status was determined based on the A β42/A β40 ratio. The best predictive model of change in cognitive test scores for A β-positive subjects at the 2-year follow-up achieved an R2 score of 0.388 while the best model predicting adverse changes in diagnosis achieved a weighted F1 score of 0.791. A β-positive subjects declined faster on average than those without A β pathology, but the specific level of CSF A β was not predictive of progression rate. When predicting cognitive score change 4 years after baseline, the best model achieved an R2 score of 0.325 and it was found that fitting models to the extended cohort improved performance. Moreover, using all clinical variables outperformed the best model based only on a suite of cognitive test scores which achieved an R2 score of 0.228. CONCLUSION Our analysis shows that CSF levels of A β are not strong predictors of the rate of cognitive decline in A β-positive subjects when adjusting for other variables. Baseline assessments of cognitive function accounts for the majority of variance explained in the prediction of 2-year decline but is insufficient for achieving optimal results in longer-term predictions. Predicting changes both in cognitive test scores and in diagnosis provides multiple perspectives of the progression of potential AD subjects.
Collapse
Affiliation(s)
- Hákon Valur Dansson
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Lena Stempfle
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden.
| | - Hildur Egilsdóttir
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Erik Portelius
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy at the University of Gothenburg, Mölndal, Sweden.,Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, Mölndal, Sweden
| | - Kaj Blennow
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy at the University of Gothenburg, Mölndal, Sweden.,Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, Mölndal, Sweden
| | - Henrik Zetterberg
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy at the University of Gothenburg, Mölndal, Sweden.,Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, Mölndal, Sweden.,Department of Neurodegenerative Disease, UCL Institute of Neurology, London, UK.,UK Dementia Research Institute, UCL, London, UK
| | - Fredrik D Johansson
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | | |
Collapse
|
5
|
Tavara S, Schliep A. Effects of network topology on the performance of consensus and distributed learning of SVMs using ADMM. PeerJ Comput Sci 2021; 7:e397. [PMID: 33817043 PMCID: PMC7959654 DOI: 10.7717/peerj-cs.397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 01/26/2021] [Indexed: 06/12/2023]
Abstract
The Alternating Direction Method of Multipliers (ADMM) is a popular and promising distributed framework for solving large-scale machine learning problems. We consider decentralized consensus-based ADMM in which nodes may only communicate with one-hop neighbors. This may cause slow convergence. We investigate the impact of network topology on the performance of an ADMM-based learning of Support Vector Machine using expander, and mean-degree graphs, and additionally some of the common modern network topologies. In particular, we investigate to which degree the expansion property of the network influences the convergence in terms of iterations, training and communication time. We furthermore suggest which topology is preferable. Additionally, we provide an implementation that makes these theoretical advances easily available. The results show that the performance of decentralized ADMM-based learning of SVMs in terms of convergence is improved using graphs with large spectral gaps, higher and homogeneous degrees.
Collapse
|
6
|
Johansson S, Thakkar A, Kogej T, Bjerrum E, Genheden S, Bastys T, Kannas C, Schliep A, Chen H, Engkvist O. AI-assisted synthesis prediction. Drug Discov Today Technol 2020; 32-33:65-72. [PMID: 33386096 DOI: 10.1016/j.ddtec.2020.06.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/01/2020] [Accepted: 06/10/2020] [Indexed: 11/25/2022]
Abstract
Application of AI technologies in synthesis prediction has developed very rapidly in recent years. We attempt here to give a comprehensive summary on the latest advancement on retro-synthesis planning, forward synthesis prediction as well as quantum chemistry-based reaction prediction models. Besides an introduction on the AI/ML models for addressing various synthesis related problems, the sources of the reaction datasets used in model building is also covered. In addition to the predictive models, the robotics based high throughput experimentation technology will be another crucial factor for conducting synthesis in an automated fashion. Some state-of-the-art of high throughput experimentation practices carried out in the pharmaceutical industry are highlighted in this chapter to give the reader a sense of how future chemistry will be conducted to make compounds faster and cheaper.
Collapse
Affiliation(s)
- Simon Johansson
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden; Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.
| | - Amol Thakkar
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden; Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012 Bern, Switzerland
| | - Thierry Kogej
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| | - Esben Bjerrum
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| | - Samuel Genheden
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| | - Tomas Bastys
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| | - Christos Kannas
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg, Gothenburg, Sweden
| | - Hongming Chen
- Centre of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health - Guangdong Laboratory, Guangzhou 510530, China
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Sweden
| |
Collapse
|
7
|
Judd N, Sauce B, Wiedenhoeft J, Tromp J, Chaarani B, Schliep A, van Noort B, Penttilä J, Grimmer Y, Insensee C, Becker A, Banaschewski T, Bokde ALW, Quinlan EB, Desrivières S, Flor H, Grigis A, Gowland P, Heinz A, Ittermann B, Martinot JL, Paillère Martinot ML, Artiges E, Nees F, Papadopoulos Orfanos D, Paus T, Poustka L, Hohmann S, Millenet S, Fröhner JH, Smolka MN, Walter H, Whelan R, Schumann G, Garavan H, Klingberg T. Cognitive and brain development is independently influenced by socioeconomic status and polygenic scores for educational attainment. Proc Natl Acad Sci U S A 2020; 117:12411-12418. [PMID: 32430323 PMCID: PMC7275733 DOI: 10.1073/pnas.2001228117] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Genetic factors and socioeconomic status (SES) inequalities play a large role in educational attainment, and both have been associated with variations in brain structure and cognition. However, genetics and SES are correlated, and no prior study has assessed their neural associations independently. Here we used a polygenic score for educational attainment (EduYears-PGS), as well as SES, in a longitudinal study of 551 adolescents to tease apart genetic and environmental associations with brain development and cognition. Subjects received a structural MRI scan at ages 14 and 19. At both time points, they performed three working memory (WM) tasks. SES and EduYears-PGS were correlated (r = 0.27) and had both common and independent associations with brain structure and cognition. Specifically, lower SES was related to less total cortical surface area and lower WM. EduYears-PGS was also related to total cortical surface area, but in addition had a regional association with surface area in the right parietal lobe, a region related to nonverbal cognitive functions, including mathematics, spatial cognition, and WM. SES, but not EduYears-PGS, was related to a change in total cortical surface area from age 14 to 19. This study demonstrates a regional association of EduYears-PGS and the independent prediction of SES with cognitive function and brain development. It suggests that the SES inequalities, in particular parental education, are related to global aspects of cortical development, and exert a persistent influence on brain development during adolescence.
Collapse
Affiliation(s)
- Nicholas Judd
- Department of Neuroscience, Karolinska Institute, Stockholm, 17165, Sweden
| | - Bruno Sauce
- Department of Neuroscience, Karolinska Institute, Stockholm, 17165, Sweden
| | - John Wiedenhoeft
- Department of Medical Statistics, University of Göttingen, Göttingen, 37073, Germany
| | - Jeshua Tromp
- Department of Cognitive Psychology, Leiden University, Leiden, 2311, The Netherlands
| | - Bader Chaarani
- Department of Psychiatry, University of Vermont, Burlington, VT 05405
- Department of Psychological Science, University of Vermont, Burlington, VT 05405
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg, Gothenburg, 41756, Sweden
| | - Betteke van Noort
- Hochschule für Gesundheit und Medizin, Medical School Berlin, Berlin, 14197, Germany
| | - Jani Penttilä
- Department of Social and Health Care, Psychosocial Services Adolescent Outpatient Clinic, University of Tampere, Lahti, 33100, Finland
| | - Yvonne Grimmer
- Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
| | - Corinna Insensee
- Department of Child and Adolescent Psychiatry and Psychotherapy, University Medical Center, Göttingen, 37075, Germany
| | - Andreas Becker
- Department of Child and Adolescent Psychiatry and Psychotherapy, University Medical Center, Göttingen, 37075, Germany
| | - Tobias Banaschewski
- Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
| | - Arun L W Bokde
- Discipline of Psychiatry, School of Medicine, Trinity College Dublin, Dublin, D02 PN40, Ireland
- Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin, D02 PN40, Ireland
| | - Erin Burke Quinlan
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, SE5 8AF, United Kingdom
| | - Sylvane Desrivières
- Centre for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, SE5 8AF, United Kingdom
| | - Herta Flor
- Institute of Cognitive and Clinical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
- Department of Psychology, School of Social Sciences, University of Mannheim, Mannheim, 68131, Germany
| | - Antoine Grigis
- NeuroSpin, French Alternative Energies and Atomic Energy Commission (CEA), Université Paris-Saclay, F-91191 Gif-sur-Yvette, France
| | - Penny Gowland
- Sir Peter Mansfield Imaging Centre, School of Physics and Astronomy, University of Nottingham, Nottingham, NG7 2RD, United Kingdom
| | - Andreas Heinz
- Department of Psychiatry and Psychotherapy, Campus Charité Mitte, Charité, Universitätsmedizin Berlin, Berlin, 10117, Germany
| | - Bernd Ittermann
- Physikalisch-Technische Bundesanstalt, Berlin, 38116, Germany
| | - Jean-Luc Martinot
- INSERM Unit 1000 "Neuroimaging & Psychiatry," Institut National de la Santé et de la Recherche Médicale, University Paris Saclay, University Paris Descartes, Paris, 75006, France
| | - Marie-Laure Paillère Martinot
- INSERM Unit 1000 "Neuroimaging & Psychiatry," Institut National de la Santé et de la Recherche Médicale, University Paris Saclay, University Paris Descartes, Paris, 75006, France
- Department of Child and Adolescent Psychiatry, Pitié-Salpêtrière Hospital, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, 75006, France
| | - Eric Artiges
- INSERM Unit 1000 "Neuroimaging & Psychiatry," Institut National de la Santé et de la Recherche Médicale, University Paris Saclay, University Paris Descartes, Paris, 75006, France
| | - Frauke Nees
- Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
- Department of Psychology, School of Social Sciences, University of Mannheim, Mannheim, 68131, Germany
| | - Dimitri Papadopoulos Orfanos
- NeuroSpin, French Alternative Energies and Atomic Energy Commission (CEA), Université Paris-Saclay, F-91191 Gif-sur-Yvette, France
| | - Tomáš Paus
- Bloorview Research Institute, Holland Bloorview Kids Rehabilitation Hospital, University of Toronto, Toronto, ON M6A 2E1, Canada
- Department of Psychology, University of Toronto, Toronto, ON M6A 2E1, Canada
- Department of Psychiatry, University of Toronto, Toronto, ON M6A 2E1, Canada
| | - Luise Poustka
- Department of Child and Adolescent Psychiatry and Psychotherapy, University Medical Center, Göttingen, 37075, Germany
| | - Sarah Hohmann
- Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
| | - Sabina Millenet
- Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 69117, Germany
| | - Juliane H Fröhner
- Department of Psychiatry and Psychotherapy, Technische Universität Dresden, Dresden, 01087, Germany
| | - Michael N Smolka
- Department of Psychiatry, Technische Universität Dresden, Dresden, 01062, Germany
- Neuroimaging Center, Technische Universität Dresden, Dresden, 01069, Germany
| | - Henrik Walter
- Department of Psychiatry and Psychotherapy, Campus Charité Mitte, Charité, Universitätsmedizin Berlin, Berlin, 10117, Germany
| | - Robert Whelan
- School of Psychology, Trinity College Dublin, Dublin, D02 PN40, Ireland
- Global Brain Health Institute, Trinity College Dublin, Dublin, D02 PN40, Ireland
| | - Gunter Schumann
- Centre for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, SE5 8AF, United Kingdom
| | - Hugh Garavan
- Department of Psychiatry, University of Vermont, Burlington, VT 05405
- Department of Psychological Science, University of Vermont, Burlington, VT 05405
| | - Torkel Klingberg
- Department of Neuroscience, Karolinska Institute, Stockholm, 17165, Sweden;
| |
Collapse
|
8
|
Bakker FT, Antonelli A, Clarke JA, Cook JA, Edwards SV, Ericson PGP, Faurby S, Ferrand N, Gelang M, Gillespie RG, Irestedt M, Lundin K, Larsson E, Matos-Maraví P, Müller J, von Proschwitz T, Roderick GK, Schliep A, Wahlberg N, Wiedenhoeft J, Källersjö M. The Global Museum: natural history collections and the future of evolutionary science and public education. PeerJ 2020; 8:e8225. [PMID: 32025365 PMCID: PMC6993751 DOI: 10.7717/peerj.8225] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 11/15/2019] [Indexed: 12/27/2022] Open
Abstract
Natural history museums are unique spaces for interdisciplinary research and educational innovation. Through extensive exhibits and public programming and by hosting rich communities of amateurs, students, and researchers at all stages of their careers, they can provide a place-based window to focus on integration of science and discovery, as well as a locus for community engagement. At the same time, like a synthesis radio telescope, when joined together through emerging digital resources, the global community of museums (the ‘Global Museum’) is more than the sum of its parts, allowing insights and answers to diverse biological, environmental, and societal questions at the global scale, across eons of time, and spanning vast diversity across the Tree of Life. We argue that, whereas natural history collections and museums began with a focus on describing the diversity and peculiarities of species on Earth, they are now increasingly leveraged in new ways that significantly expand their impact and relevance. These new directions include the possibility to ask new, often interdisciplinary questions in basic and applied science, such as in biomimetic design, and by contributing to solutions to climate change, global health and food security challenges. As institutions, they have long been incubators for cutting-edge research in biology while simultaneously providing core infrastructure for research on present and future societal needs. Here we explore how the intersection between pressing issues in environmental and human health and rapid technological innovation have reinforced the relevance of museum collections. We do this by providing examples as food for thought for both the broader academic community and museum scientists on the evolving role of museums. We also identify challenges to the realization of the full potential of natural history collections and the Global Museum to science and society and discuss the critical need to grow these collections. We then focus on mapping and modelling of museum data (including place-based approaches and discovery), and explore the main projects, platforms and databases enabling this growth. Finally, we aim to improve relevant protocols for the long-term storage of specimens and tissues, ensuring proper connection with tomorrow’s technologies and hence further increasing the relevance of natural history museums.
Collapse
Affiliation(s)
- Freek T Bakker
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| | | | - Julia A Clarke
- Jackson School of Geosciences, University of Texas at Austin, Austin, TX, United States of America
| | - Joseph A Cook
- Museum of Southwestern Biology, Department of Biology, University of New Mexico, Albuquerque, NM, United States of America
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States of America.,Gothenburg Centre for Advanced Studies in Science and Technology, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
| | - Per G P Ericson
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden
| | - Søren Faurby
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden
| | - Nuno Ferrand
- Museu de História Natural e da Ciência, Universidade do Porto, Porto, Portugal
| | - Magnus Gelang
- Department of Zoology, Gothenburg Natural History Museum, Göteborg, Sweden.,Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden
| | - Rosemary G Gillespie
- Essig Museum of Entomology, Department of Environmental Science, Policy and Management, University of California, Berkeley, Berkeley, CA, United States of America
| | - Martin Irestedt
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden
| | - Kennet Lundin
- Department of Zoology, Gothenburg Natural History Museum, Göteborg, Sweden.,Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden
| | - Ellen Larsson
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden.,Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden
| | - Pável Matos-Maraví
- Biology Centre of the Czech Academy of Sciences, Institute of Entomology, České Budějovice, Czechia
| | - Johannes Müller
- Leibniz-Institut für Evolutions- und Biodiversitätsforschung, Museum für Naturkunde, Berlin, Germany
| | - Ted von Proschwitz
- Department of Zoology, Gothenburg Natural History Museum, Göteborg, Sweden.,Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden
| | - George K Roderick
- Essig Museum of Entomology, Department of Environmental Science, Policy and Management, University of California, Berkeley, Berkeley, CA, United States of America
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg, Göteborg, Sweden
| | | | - John Wiedenhoeft
- Department of Computer Science and Engineering, University of Gothenburg, Göteborg, Sweden
| | - Mari Källersjö
- Gothenburg Global Biodiversity Centre, University of Gothenburg, Göteborg, Sweden.,Gothenburg Botanical Garden, Göteborg, Sweden
| |
Collapse
|
9
|
Martinsson J, Schliep A, Eliasson B, Mogren O. Blood Glucose Prediction with Variance Estimation Using Recurrent Neural Networks. J Healthc Inform Res 2019; 4:1-18. [PMID: 35415439 PMCID: PMC8982803 DOI: 10.1007/s41666-019-00059-y] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 04/26/2019] [Accepted: 10/18/2019] [Indexed: 11/28/2022]
Abstract
AbstractMany factors affect blood glucose levels in type 1 diabetics, several of which vary largely both in magnitude and delay of the effect. Modern rapid-acting insulins generally have a peak time after 60–90 min, while carbohydrate intake can affect blood glucose levels more rapidly for high glycemic index foods, or slower for other carbohydrate sources. It is important to have good estimates of the development of glucose levels in the near future both for diabetic patients managing their insulin distribution manually, as well as for closed-loop systems making decisions about the distribution. Modern continuous glucose monitoring systems provide excellent sources of data to train machine learning models to predict future glucose levels. In this paper, we present an approach for predicting blood glucose levels for diabetics up to 1 h into the future. The approach is based on recurrent neural networks trained in an end-to-end fashion, requiring nothing but the glucose level history for the patient. Our approach obtains results that are comparable to the state of the art on the Ohio T1DM dataset for blood glucose level prediction. In addition to predicting the future glucose value, our model provides an estimate of its certainty, helping users to interpret the predicted levels. This is realized by training the recurrent neural network to parameterize a univariate Gaussian distribution over the output. The approach needs no feature engineering or data preprocessing and is computationally inexpensive. We evaluate our method using the standard root-mean-squared error (RMSE) metric, along with a blood glucose-specific metric called the surveillance error grid (SEG). We further study the properties of the distribution that is learned by the model, using experiments that determine the nature of the certainty estimate that the model is able to capture.
Collapse
Affiliation(s)
| | | | | | - Olof Mogren
- RISE Research Institutes of Sweden, Gothenburg, Sweden
| |
Collapse
|
10
|
Wiedenhoeft J, Cagan A, Kozhemyakina R, Gulevich R, Schliep A. Bayesian localization of CNV candidates in WGS data within minutes. Algorithms Mol Biol 2019; 14:20. [PMID: 31572486 PMCID: PMC6757390 DOI: 10.1186/s13015-019-0154-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Accepted: 08/08/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward-Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. RESULTS In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. CONCLUSIONS Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.
Collapse
|
11
|
Bravo GA, Antonelli A, Bacon CD, Bartoszek K, Blom MPK, Huynh S, Jones G, Knowles LL, Lamichhaney S, Marcussen T, Morlon H, Nakhleh LK, Oxelman B, Pfeil B, Schliep A, Wahlberg N, Werneck FP, Wiedenhoeft J, Willows-Munro S, Edwards SV. Embracing heterogeneity: coalescing the Tree of Life and the future of phylogenomics. PeerJ 2019; 7:e6399. [PMID: 30783571 PMCID: PMC6378093 DOI: 10.7717/peerj.6399] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 01/07/2019] [Indexed: 12/23/2022] Open
Abstract
Building the Tree of Life (ToL) is a major challenge of modern biology, requiring advances in cyberinfrastructure, data collection, theory, and more. Here, we argue that phylogenomics stands to benefit by embracing the many heterogeneous genomic signals emerging from the first decade of large-scale phylogenetic analysis spawned by high-throughput sequencing (HTS). Such signals include those most commonly encountered in phylogenomic datasets, such as incomplete lineage sorting, but also those reticulate processes emerging with greater frequency, such as recombination and introgression. Here we focus specifically on how phylogenetic methods can accommodate the heterogeneity incurred by such population genetic processes; we do not discuss phylogenetic methods that ignore such processes, such as concatenation or supermatrix approaches or supertrees. We suggest that methods of data acquisition and the types of markers used in phylogenomics will remain restricted until a posteriori methods of marker choice are made possible with routine whole-genome sequencing of taxa of interest. We discuss limitations and potential extensions of a model supporting innovation in phylogenomics today, the multispecies coalescent model (MSC). Macroevolutionary models that use phylogenies, such as character mapping, often ignore the heterogeneity on which building phylogenies increasingly rely and suggest that assimilating such heterogeneity is an important goal moving forward. Finally, we argue that an integrative cyberinfrastructure linking all steps of the process of building the ToL, from specimen acquisition in the field to publication and tracking of phylogenomic data, as well as a culture that values contributors at each step, are essential for progress.
Collapse
Affiliation(s)
- Gustavo A. Bravo
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | - Alexandre Antonelli
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
- Gothenburg Botanical Garden, Göteborg, Sweden
| | - Christine D. Bacon
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Krzysztof Bartoszek
- Department of Computer and Information Science, Linköping University, Linköping, Sweden
| | - Mozes P. K. Blom
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden
| | - Stella Huynh
- Institut de Biologie, Université de Neuchâtel, Neuchâtel, Switzerland
| | - Graham Jones
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - L. Lacey Knowles
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Sangeet Lamichhaney
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | - Thomas Marcussen
- Centre for Ecological and Evolutionary Synthesis, University of Oslo, Oslo, Norway
| | - Hélène Morlon
- Institut de Biologie, Ecole Normale Supérieure de Paris, Paris, France
| | - Luay K. Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bengt Oxelman
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Bernard Pfeil
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
| | | | - Fernanda P. Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisa da Amazônia, Manaus, AM, Brazil
| | - John Wiedenhoeft
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
- Department of Computer Science, Rutgers University, Piscataway, NJ, USA
| | - Sandi Willows-Munro
- School of Life Sciences, University of Kwazulu-Natal, Pietermaritzburg, South Africa
| | - Scott V. Edwards
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
- Gothenburg Centre for Advanced Studies in Science and Technology, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
| |
Collapse
|
12
|
Abstract
CNV detection requires a high-quality segmentation of genomic data. In many WGS experiments, sample and control are sequenced together in a multiplexed fashion using DNA barcoding for economic reasons. Using the differential read depth of these two conditions cancels out systematic additive errors. Due to this detrending, the resulting data is appropriate for inference using a hidden Markov model (HMM), arguably one of the principal models for labeled segmentation. However, while the usual frequentist approaches such as Baum-Welch are problematic for several reasons, they are often preferred to Bayesian HMM inference, which normally requires prohibitively long running times and exceeds a typical user's computational resources on a genome scale data. HaMMLET solves this problem using a dynamic wavelet compression scheme, which makes Bayesian segmentation of WGS data feasible on standard consumer hardware.
Collapse
Affiliation(s)
- John Wiedenhoeft
- Chalmers University of Technology, Gothenburg, Sweden.
- Rutgers University, New Brunswick, NJ, USA.
| | | |
Collapse
|
13
|
O N Lopes ID, Schliep A, de L F de Carvalho AP. Automatic learning of pre-miRNAs from different species. BMC Bioinformatics 2016; 17:224. [PMID: 27233515 PMCID: PMC4884428 DOI: 10.1186/s12859-016-1036-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 04/12/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for both systems have evolved differently. However, these tools are biased towards the species for which they were primarily developed and, consequently, their predictive performance on data sets from other species of the same kingdom might be lower. While these biases are intrinsic to the species, their characterization can lead to computational approaches capable of diminishing their negative effect on the accuracy of pre-miRNAs predictive models. We investigate in this study how 45 predictive models induced for data sets from 45 species, distributed in eight subphyla/classes, perform when applied to a species different from the species used in its induction. RESULTS Our computational experiments show that the separability of pre-miRNAs and pseudo pre-miRNAs instances is species-dependent and no feature set performs well for all species, even within the same subphylum/class. Mitigating this species dependency, we show that an ensemble of classifiers reduced the classification errors for all 45 species. As the ensemble members were obtained using meaningful, and yet computationally viable feature sets, the ensembles also have a lower computational cost than individual classifiers that rely on energy stability parameters, which are of prohibitive computational cost in large scale applications. CONCLUSION In this study, the combination of multiple pre-miRNAs feature sets and multiple learning biases enhanced the predictive accuracy of pre-miRNAs classifiers of 45 species. This is certainly a promising approach to be incorporated in miRNA discovery tools towards more accurate and less species-dependent tools. The material to reproduce the results from this paper can be downloaded from http://dx.doi.org/10.5281/zenodo.49754 .
Collapse
Affiliation(s)
- Ivani de O N Lopes
- Empresa Brasileira de Pesquisa Agropecuária, Embrapa Soja, Caixa Postal 231, Londrina-PR, 86001-970, CEP, Brasil.
| | - Alexander Schliep
- Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, 08854, NJ, USA
| | - André P de L F de Carvalho
- Instituto de Ciências Matemáticas e de Computação, Avenida Trabalhador são-carlense, 400 - Centro, São Carlos SP, Brasil
| |
Collapse
|
14
|
Lopes IDON, Schliep A, de Carvalho ACPDLF. The discriminant power of RNA features for pre-miRNA recognition. BMC Bioinformatics 2014; 15:124. [PMID: 24884650 PMCID: PMC4046174 DOI: 10.1186/1471-2105-15-124] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2013] [Accepted: 04/08/2014] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests. RESULTS Small or non-significant differences were found among the estimated classification performances of classifiers induced using sets with diversification of features, despite the wide differences in their dimension. Inspired in these results, we obtained a lower-dimensional feature set, which achieved a sensitivity of 90% and a specificity of 95%. These estimates are within 0.1% of the maximal values obtained with any feature set (SELECT, Section "Results and discussion") while it is 34 times faster to compute. Even compared to another feature set (FS2, see Section "Results and discussion"), which is the computationally least expensive feature set of those from the literature which perform within 0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used as references in the experiments carried out showed that five out of these six tools have lower sensitivity or specificity. CONCLUSION In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysis of putative pre-miRNAs using a computationally expensive feature set would be wasteful or even unfeasible for large genomes. In this work, we propose a relatively inexpensive feature set and explore most of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which may lead to the development of efficient ab-initio pre-miRNA discovery tools.The material to reproduce the main results from this paper can be downloaded from http://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.gz.
Collapse
Affiliation(s)
- Ivani de O N Lopes
- Empresa Brasileira de Pesquisa Agropecuária, Embrapa Soja, Caixa Postal 231, Londrina-PR, CEP 86001-970, Brasil.
| | | | | |
Collapse
|
15
|
|
16
|
Abstract
Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L1 distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L1 distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. Contact:pavelm@cs.rutgers.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Md Pavel Mahmud
- Department of Computer Science, Rutgers University, New Jersey, USA.
| | | | | |
Collapse
|
17
|
|
18
|
Affiliation(s)
- Rajat S. Roy
- Department of Computer Science, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Kevin C. Chen
- Department of Genetics, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Anirvan M. Sengupta
- Department of Physics and Astronomy, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Alexander Schliep
- Department of Computer Science, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| |
Collapse
|
19
|
Mahmud MP, Schliep A. Fast MCMC sampling for hidden Markov Models to determine copy number variations. BMC Bioinformatics 2011; 12:428. [PMID: 22047014 PMCID: PMC3371636 DOI: 10.1186/1471-2105-12-428] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Accepted: 11/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems. RESULTS We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kd-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling. CONCLUSIONS We test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches. AVAILABILITY An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.
Collapse
Affiliation(s)
- Md Pavel Mahmud
- Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA.
| | | |
Collapse
|
20
|
Hafemeister C, Krause R, Schliep A. Selecting oligonucleotide probes for whole-genome tiling arrays with a cross-hybridization potential. IEEE/ACM Trans Comput Biol Bioinform 2011; 8:1642-1652. [PMID: 21358006 DOI: 10.1109/tcbb.2011.39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
For designing oligonucleotide tiling arrays popular, current methods still rely on simple criteria like Hamming distance or longest common factors, neglecting base stacking effects which strongly contribute to binding energies. Consequently, probes are often prone to cross-hybridization which reduces the signal-to-noise ratio and complicates downstream analysis. We propose the first computationally efficient method using hybridization energy to identify specific oligonucleotide probes. Our Cross-Hybridization Potential (CHP) is computed with a Nearest Neighbor Alignment, which efficiently estimates a lower bound for the Gibbs free energy of the duplex formed by two DNA sequences of bounded length. It is derived from our simplified reformulation of t-gap insertion-deletion-like metrics. The computations are accelerated by a filter using weighted ungapped q-grams to arrive at seeds. The computation of the CHP is implemented in our software OSProbes, available under the GPL, which computes sets of viable probe candidates. The user can choose a trade-off between running time and quality of probes selected. We obtain very favorable results in comparison with prior approaches with respect to specificity and sensitivity for cross-hybridization and genome coverage with high-specificity probes. The combination of OSProbes and our Tileomatic method, which computes optimal tiling paths from candidate sets, yields globally optimal tiling arrays, balancing probe distance, hybridization conditions, and uniqueness of hybridization.
Collapse
Affiliation(s)
- Christoph Hafemeister
- Department of Biology, New York University, 100 Washington Square East, Rm 1009, New York, NY 10003-6688, USA.
| | | | | |
Collapse
|
21
|
Seifert M, Strickert M, Schliep A, Grosse I. Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended Hidden Markov Models. ACTA ACUST UNITED AC 2011; 27:1645-52. [PMID: 21511716 DOI: 10.1093/bioinformatics/btr199] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Changes in gene expression levels play a central role in tumors. Additional information about the distribution of gene expression levels and distances between adjacent genes on chromosomes should be integrated into the analysis of tumor expression profiles. RESULTS We use a Hidden Markov Model with distance-scaled transition matrices (DSHMM) to incorporate chromosomal distances of adjacent genes on chromosomes into the identification of differentially expressed genes in breast cancer. We train the DSHMM by integrating prior knowledge about potential distributions of expression levels of differentially expressed and unchanged genes in tumor. We find that especially the combination of these data and to a lesser extent the modeling of distances between adjacent genes contribute to a substantial improvement of the identification of differentially expressed genes in comparison to other existing methods. This performance benefit is also supported by the identification of genes well known to be associated with breast cancer. That suggests applications of DSHMMs for screening of other tumor expression profiles. AVAILABILITY The DSHMM is available as part of the open-source Java library Jstacs (www.jstacs.de/index.php/DSHMM).
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
22
|
Schilling R, Costa IG, Schliep A. pGQL: A probabilistic graphical query language for gene expression time courses. BioData Min 2011; 4:9. [PMID: 21501515 PMCID: PMC3096586 DOI: 10.1186/1756-0381-4-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 04/18/2011] [Indexed: 11/24/2022] Open
Abstract
Background Timeboxes are graphical user interface widgets that were proposed to specify queries on time course data. As queries can be very easily defined, an exploratory analysis of time course data is greatly facilitated. While timeboxes are effective, they have no provisions for dealing with noisy data or data with fluctuations along the time axis, which is very common in many applications. In particular, this is true for the analysis of gene expression time courses, which are mostly derived from noisy microarray measurements at few unevenly sampled time points. From a data mining point of view the robust handling of data through a sound statistical model is of great importance. Results We propose probabilistic timeboxes, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining. Since HMMs are a particular class of probabilistic graphical models we call our method Probabilistic Graphical Query Language. Its implementation was realized in the free software package pGQL. We evaluate its effectiveness in exploratory analysis on a yeast sporulation data set. Conclusions We introduce a new approach to define dynamic, statistical queries on time course data. It supports an interactive exploration of reasonably large amounts of data and enables users without expert knowledge to specify fairly complex statistical models with ease. The expressivity of our approach is by its statistical nature greater and more robust with respect to amplitude and frequency fluctuation than the prior, deterministic timeboxes.
Collapse
Affiliation(s)
- Ruben Schilling
- Max Planck Institute for Molecular Genetics, Department of Computational Biology, Ihnestr, 63-73, 14195 Berlin, Germany.
| | | | | |
Collapse
|
23
|
Hafemeister C, Costa IG, Schönhuth A, Schliep A. Classifying short gene expression time-courses with Bayesian estimation of piecewise constant functions. ACTA ACUST UNITED AC 2011; 27:946-52. [PMID: 21266444 DOI: 10.1093/bioinformatics/btr037] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Analyzing short time-courses is a frequent and relevant problem in molecular biology, as, for example, 90% of gene expression time-course experiments span at most nine time-points. The biological or clinical questions addressed are elucidating gene regulation by identification of co-expressed genes, predicting response to treatment in clinical, trial-like settings or classifying novel toxic compounds based on similarity of gene expression time-courses to those of known toxic compounds. The latter problem is characterized by irregular and infrequent sample times and a total lack of prior assumptions about the incoming query, which comes in stark contrast to clinical settings and requires to implicitly perform a local, gapped alignment of time series. The current state-of-the-art method (SCOW) uses a variant of dynamic time warping and models time series as higher order polynomials (splines). RESULTS We suggest to model time-courses monitoring response to toxins by piecewise constant functions, which are modeled as left-right Hidden Markov Models. A Bayesian approach to parameter estimation and inference helps to cope with the short, but highly multivariate time-courses. We improve prediction accuracy by 7% and 4%, respectively, when classifying toxicology and stress response data. We also reduce running times by at least a factor of 140; note that reasonable running times are crucial when classifying response to toxins. In conclusion, we have demonstrated that appropriate reduction of model complexity can result in substantial improvements both in classification performance and running time. AVAILABILITY A Python package implementing the methods described is freely available under the GPL from http://bioinformatics.rutgers.edu/Software/MVQueries/.
Collapse
Affiliation(s)
- Christoph Hafemeister
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | |
Collapse
|
24
|
Georgi B, Costa IG, Schliep A. PyMix--the python mixture package--a tool for clustering of heterogeneous biological data. BMC Bioinformatics 2010; 11:9. [PMID: 20053276 PMCID: PMC2823712 DOI: 10.1186/1471-2105-11-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2009] [Accepted: 01/06/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications. RESULTS PyMix - the Python mixture package implements algorithms and data structures for clustering with basic and advanced mixture models. The advanced models include context-specific independence mixtures, mixtures of dependence trees and semi-supervised learning. PyMix is licenced under the GNU General Public licence (GPL). PyMix has been successfully used for the analysis of biological sequence, complex disease and gene expression data. CONCLUSIONS PyMix is a useful tool for cluster analysis of biological data. Due to the general nature of the framework, PyMix can be applied to a wide range of applications and data sets.
Collapse
Affiliation(s)
- Benjamin Georgi
- Max Planck Institute for Molecular Genetics, Dept, of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin.
| | | | | |
Collapse
|
25
|
Georgi B, Schultz J, Schliep A. Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC Struct Biol 2009; 9:68. [PMID: 19857261 PMCID: PMC2777906 DOI: 10.1186/1472-6807-9-68] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Accepted: 10/26/2009] [Indexed: 03/20/2023]
Abstract
BACKGROUND The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level. RESULTS We have developed an extension of the context-specific independence mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach. CONCLUSION The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.
Collapse
Affiliation(s)
- Benjamin Georgi
- Max Planck Institute for Molecular Genetics, Dept, of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany.
| | | | | |
Collapse
|
26
|
Abstract
Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-β (IFNβ) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNβ treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects. Results: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFNβ treatment. The advantages of our approach are that it takes the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, >90%. Additionally, we were able to identify potentially mislabeled samples and to sub-divide the good responders into two sub-groups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions. Availability: The method is implemented in the GQL framework and is available at http://www.ghmm.org/gql. Datasets are available at http://www.cin.ufpe.br/∼igcf/MSConst Contact:igcf@cin.ufpe.br Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ivan G Costa
- Center of Informatics, Federal University of Pernambuco, Recife, Brazil.
| | | | | | | |
Collapse
|
27
|
de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 2008; 9:497. [PMID: 19038021 PMCID: PMC2632677 DOI: 10.1186/1471-2105-9-497] [Citation(s) in RCA: 261] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2008] [Accepted: 11/27/2008] [Indexed: 11/28/2022] Open
Abstract
Background The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. Results/Conclusion We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at .
Collapse
Affiliation(s)
- Marcilio C P de Souto
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | | | |
Collapse
|
28
|
Abstract
We define new measures of sequence similarity for oligonucleotide probe design. These new measures incorporate the nearest neighbor k-stem motifs in their definition, but can be efficiently computed by means of a bit-vector method. They are not as computationally costly as algorithms that predict nearest neighbor hybridization potential. Our new measures for sequence similarity correlate significantly better with nearest neighbor thermodynamic predictions than either BLAST or the standard edit or insertion-deletion defined similarities already in use in many different probe design applications.
Collapse
Affiliation(s)
- Anthony J Macula
- Biomathematics Group, SUNY Geneseo, Geneseo, New York 14454, USA.
| | | | | | | |
Collapse
|
29
|
Abstract
The representation of a genome by oligonucleotide probes is a prerequisite for the analysis of many of its basic properties, such as transcription factor binding sites, chromosomal breakpoints, gene expression of known genes and detection of novel genes, in particular those coding for small RNAs. An ideal representation would consist of a high density set of oligonucleotides with similar melting temperatures that do not cross-hybridize with other regions of the genome and are equidistantly spaced. The implementation of such design is typically called a tiling array or genome array. We formulate the minimal cost tiling path problem for the selection of oligonucleotides from a set of candidates. Computing the selection of probes requires multi-criterion optimization, which we cast into a shortest path problem. Standard algorithms running in linear time allow us to compute globally optimal tiling paths from millions of candidate oligonucleotides on a standard desktop computer for most problem variants. The solutions to this multi-criterion optimization are spatially adaptive to the problem instance. Our formulation incorporates experimental constraints with respect to specific regions of interest and trade offs between hybridization parameters, probe quality and tiling density easily. A web application is available at http://tileomatic.org.
Collapse
Affiliation(s)
- Alexander Schliep
- Department Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 69-73, 14195 Berlin, Germany
| | | |
Collapse
|
30
|
Abstract
Motivation: The regulation of proliferation and differentiation of embryonic and adult stem cells into mature cells is central to developmental biology. Gene expression measured in distinguishable developmental stages helps to elucidate underlying molecular processes. In previous work we showed that functional gene modules, which act distinctly in the course of development, can be represented by a mixture of trees. In general, the similarities in the gene expression programs of cell populations reflect the similarities in the differentiation path. Results: We propose a novel model for gene expression profiles and an unsupervised learning method to estimate developmental similarity and infer differentiation pathways. We assess the performance of our model on simulated data and compare it with favorable results to related methods. We also infer differentiation pathways and predict functional modules in gene expression data of lymphoid development. Conclusions: We demonstrate for the first time how, in principal, the incorporation of structural knowledge about the dependence structure helps to reveal differentiation pathways and potentially relevant functional gene modules from microarray datasets. Our method applies in any area of developmental biology where it is possible to obtain cells of distinguishable differentiation stages. Availability: The implementation of our method (GPL license), data and additional results are available at http://algorithmics.molgen.mpg.de/Supplements/InfDif/ Contact:filho@molgen.mpg.de, schliep@molgen.mpg.de Supplementary information:Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Ivan G Costa
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | |
Collapse
|
31
|
Rungsarityotin W, Krause R, Schödl A, Schliep A. Identifying protein complexes directly from high-throughput TAP data with Markov random fields. BMC Bioinformatics 2007; 8:482. [PMID: 18093306 PMCID: PMC2222659 DOI: 10.1186/1471-2105-8-482] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2007] [Accepted: 12/19/2007] [Indexed: 11/10/2022] Open
Abstract
Background Predicting protein complexes from experimental data remains a challenge due to limited resolution and stochastic errors of high-throughput methods. Current algorithms to reconstruct the complexes typically rely on a two-step process. First, they construct an interaction graph from the data, predominantly using heuristics, and subsequently cluster its vertices to identify protein complexes. Results We propose a model-based identification of protein complexes directly from the experimental observations. Our model of protein complexes based on Markov random fields explicitly incorporates false negative and false positive errors and exhibits a high robustness to noise. A model-based quality score for the resulting clusters allows us to identify reliable predictions in the complete data set. Comparisons with prior work on reference data sets shows favorable results, particularly for larger unfiltered data sets. Additional information on predictions, including the source code under the GNU Public License can be found at http://algorithmics.molgen.mpg.de/Static/Supplements/ProteinComplexes. Conclusion We can identify complexes in the data obtained from high-throughput experiments without prior elimination of proteins or weak interactions. The few parameters of our model, which does not rely on heuristics, can be estimated using maximum likelihood without a reference data set. This is particularly important for protein complex studies in organisms that do not have an established reference frame of known protein complexes.
Collapse
Affiliation(s)
- Wasinee Rungsarityotin
- Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestr, 73, D-14195 Berlin, Germany.
| | | | | | | |
Collapse
|
32
|
Abstract
BACKGROUND The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the lymphoid system due to its importance for basic biology and for clinical applications. Gene expression measured in lymphoid cells in several distinguishable developmental stages helps in the elucidation of underlying molecular processes, which change gradually over time and lock cells in either the B cell, T cell or Natural Killer cell lineages. Large-scale analysis of these gene expression trees requires computational support for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes. RESULTS We present the first statistical framework designed to analyze gene expression data as it is collected in the course of lymphoid development through clusters of co-expressed genes and additional heterogeneous data. We introduce dependence trees for continuous variates, which model the inherent dependencies during the differentiation process naturally as gene expression trees. Several trees are combined in a mixture model to allow inference of potentially overlapping clusters of co-expressed genes. Additionally, we predict microRNA targets. CONCLUSION Computational results for several data sets from the lymphoid system demonstrate the relevance of our framework. We recover well-known biological facts and identify promising novel regulatory elements of genes and their functional assignments. The implementation of our method (licensed under the GPL) is available at http://algorithmics.molgen.mpg.de/Supplements/ExpLym/.
Collapse
Affiliation(s)
- Ivan G Costa
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Stefan Roepcke
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Alexander Schliep
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
33
|
Abstract
MOTIVATION Besides their prevalent use for analyzing gene expression, microarrays are an efficient tool for biological, medical and industrial applications due to their ability to assess the presence or absence of biological agents, the targets, in a sample. Given a collection of genetic sequences of targets one faces the challenge of finding short oligonucleotides, the probes, which allow detection of targets in a sample. Each hybridization experiment determines whether the probe binds to its corresponding sequence in the target. Depending on the problem, the experiments are conducted using either unique or non-unique probes and usually assume that only one target is present in the sample. The problem at hand is to compute a design, i.e. a minimal set of probes that allows to infer the targets in the sample from the result of the hybridization experiment. If we allow to test for more than one target in the sample, the design of the probe set becomes difficult in the case of non-unique probes. RESULTS Building upon previous work on group testing for microarrays, we describe the first approach to select a minimal probe set for the case of non-unique probes in the presence of a small number of multiple targets in the sample. The approach is based on an ILP formulation and a branch-and-cut algorithm. Our preliminary implementation greatly reduces the number of probes needed while preserving the decoding capabilities. AVAILABILITY http://www.inf.fu-berlin.de/inst/ag-bio
Collapse
Affiliation(s)
- Gunnar W Klau
- Institute of Computer Graphics and Algorithms, Vienna University of Technology, Vienna, Austria
| | | | | | | | | |
Collapse
|
34
|
Abstract
MOTIVATION A positional weight matrix (PWM) is a statistical representation of the binding pattern of a transcription factor estimated from known binding site sequences. Previous studies showed that for factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. RESULTS We propose a context-specific independence (CSI) mixture model and a learning algorithm based on a Bayesian approach. The CSI model adjusts complexity to fit the amount of variation observed on the sequence level in each position of a site. This not only yields a more parsimonious description of binding patterns, which improves parameter estimates, it also increases robustness as the model automatically adapts the number of components to fit the data. Evaluation of the CSI model on simulated data showed favorable results compared to conventional mixtures. We demonstrate its adaptive properties in a classical model selection setup. The increased parsimony of the CSI model was shown for the transcription factor Leu3 where two binding-energy subgroups were distinguished equally well as with a conventional mixture but requiring 30% less parameters. Analysis of the human-mouse conservation of predicted binding sites of 64 JASPAR TFs showed that CSI was as good or better than a conventional mixture for 89% of the TFs and for 70% for a single PWM model. AVAILABILITY http://algorithmics.molgen.mpg.de/mixture.
Collapse
Affiliation(s)
- Benjamin Georgi
- Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany.
| | | |
Collapse
|
35
|
Abstract
MOTIVATION The reliable identification of presence or absence of biological agents ("targets"), such as viruses or bacteria, is crucial for many applications from health care to biodiversity. If genomic sequences of targets are known, hybridization reactions between oligonucleotide probes and targets performed on suitable DNA microarrays will allow to infer presence or absence from the observed pattern of hybridization. Targets, for example all known strains of HIV, are often closely related and finding unique probes becomes impossible. The use of non-unique oligonucleotides with more advanced decoding techniques from statistical group testing allows to detect known targets with great success. Of great relevance, however, is the problem of identifying the presence of previously unknown targets or of targets that evolve rapidly. RESULTS We present the first approach to decode hybridization experiments using non-unique probes when targets are related by a phylogenetic tree. Using a Bayesian framework and a Markov chain Monte Carlo approach we are able to identify over 94% of known targets and assign up to 70% of unknown targets to their correct clade in hybridization simulations on biological and simulated data. AVAILABILITY Software implementing the method described in this paper and datasets are available from http://algorithmics.molgen.mpg.de/probetrees.
Collapse
Affiliation(s)
- Alexander Schliep
- Dept. Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, D-14195 Berlin, Germany.
| | | |
Collapse
|
36
|
Pipenbacher P, Schliep A, Schneckener S, Schönhuth A, Schomburg D, Schrader R. ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 2005; 18 Suppl 2:S182-91. [PMID: 12386002 DOI: 10.1093/bioinformatics/18.suppl_2.s182] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle. Sensitivity can be recovered by utilizing refined protocols. A number of approaches to this challenge have made use of the fact that proteins are often members of some larger protein family. This can be exploited by using position-specific substitution matrices or profiles, or by making use of transitivity of homology. Transitivity refers to the concept of concluding homology between proteins A and C based on homology between A and a third protein B and between B and C. It has been demonstrated that transitivity can lead to substantial improvement in recognition of remote homologues particularly in cases where the alignment score of A and C is below the noise level. A natural limit to the use of transitivity is imposed by domains. Domains, compact independent sub-units of proteins, are often shared between otherwise distinct proteins, and can cause substantial problems by incorrectly linking otherwise unrelated proteins. RESULTS We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches. AVAILABILITY The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/~proclust/download/
Collapse
|
37
|
Abstract
Measuring gene expression over time can provide important insights into basic cellular processes. Identifying groups of genes with similar expression time-courses is a crucial first step in the analysis. As biologically relevant groups frequently overlap, due to genes having several distinct roles in those cellular processes, this is a difficult problem for classical clustering methods. We use a mixture model to circumvent this principal problem, with hidden Markov models (HMMs) as effective and flexible components. We show that the ensuing estimation problem can be addressed with additional labeled data-partially supervised learning of mixtures-through a modification of the Expectation-Maximization (EM) algorithm. Good starting points for the mixture estimation are obtained through a modification to Bayesian model merging, which allows us to learn a collection of initial HMMs. We infer groups from mixtures with a simple information-theoretic decoding heuristic, which quantifies the level of ambiguity in group assignment. The effectiveness is shown with high-quality annotation data. As the HMMs we propose capture asynchronous behavior by design, the groups we find are also asynchronous. Synchronous subgroups are obtained from a novel algorithm based on Viterbi paths. We show the suitability of our HMM mixture approach on biological and simulated data and through the favorable comparison with previous approaches. A software implementing the method is freely available under the GPL from http://ghmm.org/gql.
Collapse
|
38
|
Abstract
UNLABELLED The Graphical Query Language (GQL) is a set of tools for the analysis of gene expression time-courses. They allow a user to pre-process the data, to query it for interesting patterns, to perform model-based clustering or mixture estimation, to include subsequent refinements of clusters and, finally, to use other biological resources to evaluate the results. Analyses are carried out in a graphical and interactive environment, allowing expert intervention in all stages of the data analysis. AVAILABILITY The GQL package is freely available under the GNU general public license (GPL) at http://www.ghmm.org/gql
Collapse
Affiliation(s)
- Ivan G Costa
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | | | | |
Collapse
|
39
|
Abstract
MOTIVATION Cellular processes cause changes over time. Observing and measuring those changes over time allows insights into the how and why of regulation. The experimental platform for doing the appropriate large-scale experiments to obtain time-courses of expression levels is provided by microarray technology. However, the proper way of analyzing the resulting time course data is still very much an issue under investigation. The inherent time dependencies in the data suggest that clustering techniques which reflect those dependencies yield improved performance. RESULTS We propose to use Hidden Markov Models (HMMs) to account for the horizontal dependencies along the time axis in time course data and to cope with the prevalent errors and missing values. The HMMs are used within a model-based clustering framework. We are given a number of clusters, each represented by one Hidden Markov Model from a finite collection encompassing typical qualitative behavior. Then, our method finds in an iterative procedure cluster models and an assignment of data points to these models that maximizes the joint likelihood of clustering and models. Partially supervised learning--adding groups of labeled data to the initial collection of clusters--is supported. A graphical user interface allows querying an expression profile dataset for time course similar to a prototype graphically defined as a sequence of levels and durations. We also propose a heuristic approach to automate determination of the number of clusters. We evaluate the method on published yeast cell cycle and fibroblasts serum response datasets, and compare them, with favorable results, to the autoregressive curves method.
Collapse
Affiliation(s)
- Alexander Schliep
- Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany.
| | | | | |
Collapse
|
40
|
Abstract
MOTIVATION Genetic regulation of cellular processes is frequently investigated using large-scale gene expression experiments to observe changes in expression over time. This temporal data poses a challenge to classical distance-based clustering methods due to its horizontal dependencies along the time-axis. We propose to use hidden Markov models (HMMs) to explicitly model these time-dependencies. The HMMs are used in a mixture approach that we show to be superior over clustering. Furthermore, mixtures are a more realistic model of the biological reality, as an unambiguous partitioning of genes into clusters of unique functional assignment is impossible. Use of the mixture increases robustness with respect to noise and allows an inference of groups at varying level of assignment ambiguity. A simple approach, partially supervised learning, allows to benefit from prior biological knowledge during the training. Our method allows simultaneous analysis of cyclic and non-cyclic genes and copes well with noise and missing values. RESULTS We demonstrate biological relevance by detection of phase-specific groupings in HeLa time-course data. A benchmark using simulated data, derived using assumptions independent of those in our method, shows very favorable results compared to the baseline supplied by k-means and two prior approaches implementing model-based clustering. The results stress the benefits of incorporating prior knowledge, whenever available. AVAILABILITY A software package implementing our method is freely available under the GNU general public license (GPL) at http://ghmm.org/gql
Collapse
Affiliation(s)
- Alexander Schliep
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | |
Collapse
|
41
|
Schliep A, Torney DC, Rahmann S. Group testing with DNA chips: generating designs and decoding experiments. Proc IEEE Comput Soc Bioinform Conf 2003; 2:84-91. [PMID: 16452782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
DNA microarrays are a valuable tool for massively parallel DNA-DNA hybridization experiments. Currently, most applications rely on the existence of sequence-specific oligonucleotide probes. In large families of closely related target sequences, such as different virus subtypes, the high degree of similarity often makes it impossible to find a unique probe for every target. Fortunately, this is unnecessary. We propose a microarray design methodology based on a group testing approach. While probes might bind to multiple targets simultaneously, a properly chosen probe set can still unambiguously distinguish the presence of one target set from the presence of a different target set. Our method is the first one that explicitly takes cross-hybridization and experimental errors into account while accommodating several targets. The approach consists of three steps: (1) Pre-selection of probe candidates, (2) Generation of a suitable group testing design, and (3) Decoding of hybridization results to infer presence or absence of individual targets. Our results show that this approach is very promising, even for challenging data sets and experimental error rates of up to 5%. On a data set of 28S rDNA sequences we were able to identify 660 sequences, a substantial improvement over a prior approach using unique probes which only identified 408 sequences.
Collapse
Affiliation(s)
- Alexander Schliep
- Department of Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Inestrasse 63-73, D-14195 Berlin , Germany.
| | | | | |
Collapse
|
42
|
Knab B, Schliep A, Steckemetz B, Wichern B. Model-Based Clustering With Hidden Markov Models and its Application to Financial Time-Series Data. Between Data Science and Applied Data Analysis 2003. [DOI: 10.1007/978-3-642-18991-3_64] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
43
|
Abstract
MOTIVATION DNA arrays are a very useful tool to quickly identify biological agents present in some given sample, e.g. to identify viruses causing disease, for quality control in the food industry, or to determine bacteria contaminating drinking water. The selection of specific oligos to attach to the array surface is a relevant problem in the experiment design process. Given a set S of genomic sequences (the target sequences), the task is to find at least one oligonucleotide, called probe, for each sequence in S. This probe will be attached to the array surface, and must be chosen in a way that it will not hybridize to any other sequence but the intended target. Furthermore, all probes on the array must hybridize to their intended targets under the same reaction conditions, most importantly at the temperature T at which the experiment is conducted. RESULTS We present an efficient algorithm for the probe design problem. Melting temperatures are calculated for all possible probe-target interactions using an extended nearest-neighbor model, allowing for both non-Watson-Crick base-pairing and unpaired bases within a duplex. To compute temperatures efficiently, a combination of suffix trees and dynamic programming based alignment algorithms is introduced. Additional filtering steps during preprocessing increase the speed of the computation. The practicability of the algorithms is demonstrated by two case studies: The identification of HIV-1 subtypes, and of 28S rDNA sequences from >or=400 organisms.
Collapse
Affiliation(s)
- Lars Kaderali
- Center for Applied Computer Sciences Cologne (ZAIK), University of Cologne, Weyertal 80, 50931 Köln, Germany.
| | | |
Collapse
|
44
|
Abstract
MOTIVATION It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood if transitivity always holds and whether transitivity can be extended ad infinitum. RESULTS We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a directed graph, where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity, scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited though by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the self-similarity-of the scaling of the alignment scores appears to be an effective criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient library. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structural Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues. AVAILABILITY The software is available to academic users on request from the authors. CONTACT e.bolten@science-factory.com; schliep@zpr.uni-koeln.de; s.schneckener@science-factory.com; d.schomburg@uni-koeln.de; schrader@zpr.uni-koeln.de. SUPPLEMENTARY INFORMATION http://www.zaik.uni-koeln.de/~schliep/ProtClust.html.
Collapse
Affiliation(s)
- E Bolten
- Institut für Biochemie, Universität zu Köln, Weyertal 80, D-50937 Köln, Germany.
| | | | | | | | | |
Collapse
|
45
|
Abstract
This paper describes an effective method for extracting as much information as possible from pooling experiments for library screening. Pools are collections of clones, and screening a pool with a probe determines whether any of these clones are positive for the probe. The results of the pool screenings are interpreted, or decoded, to infer which clones are candidates to be positive. These candidate positives are subjected to confirmatory testing. Decoding the pool screening results is complicated by the presence of errors, which typically lead to ambiguities in the inference of positive clones. However, in many applications there are reasonable models for the prior distributions for positives and for errors, and Bayes inference is the preferred method for ranking candidate positives. Because of the combinatoric complexity of the Bayes formulation, we implemented a decoding algorithm using a Markov chain Monte Carlo method. The algorithm was used in screening a library with 1298 clones using 47 pools. We corroborated the posterior probabilities for positives with results from confirmatory screening. We also simulated the screening of a 10-fold coverage library of 33,000 clones using 253 pools. The use of our algorithm, effective under conditions where combinatorial decoding techniques are imprudent, allows the use of fewer pools and also introduces needed robustness.
Collapse
Affiliation(s)
- E Knill
- Los Alamos National Laboratory, New Mexico 87545, USA
| | | | | |
Collapse
|
46
|
Schlue WR, Schliep A, Walz W. Fluorescence marking of neuropile glial cells in the central nervous system of the leech Hirudo medicinalis. Cell Tissue Res 1980; 209:257-69. [PMID: 7397768 DOI: 10.1007/bf00237630] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Neuropile glial (NG) cells in the central nervous system of the medicinal leech, Hirudo medicinalis L., were studied by histological and intracellular electrophysiological methods. Potential profiles of single leech ganglia were mapped by advancing an electrolyte-filled microelectrode into the ganglion as far as the NG cell. A small negative potential usually appeared during or immediately after penetration of the ganglion sheath. Most of the ganglia in the chain (ganglia 1-4 and 7-21) have Retzius-cell-bodies of normal size; in these, the potential associated with the ganglion sheath was followed by a jump to a more negative potential. Superimposed action potentials were associated with entry of the electrode into a Retzius cell. When the electrode tip passed out of the cell into the center of the ganglion, another potential change was observed, namely that to the membrane potential of the anterior NG cell. This membrane potential averaged -60.2mV and ranged from -50 to -73mV. In ganglia 5 and 6 the Retzius-cell-bodies are particularly small, and no changes of potential associated with these cells were observed; the first potential to appear after the electrode passed through the sheath of the ganglion was the membrane potential of the NG cell. Potential profiles like those of ganglia 5 and 6 are recorded in the posterior parts of all ganglia. Potential profiles of single leech ganglia were also recorded with microelectrodes filled with the fluorescent dye Procion Yellow M4-RAN. When the presumed membrane potential of an NG cell appeared, the dye was injected into the gaglion. Subsequent histological examination with the fluorescence microscope revealed that all of the dye was contained in NG cells.
Collapse
|