1
|
Spealman P, Naik AW, May GE, Kuersten S, Freeberg L, Murphy RF, McManus J. Conserved non-AUG uORFs revealed by a novel regression analysis of ribosome profiling data. Genome Res 2017; 28:214-222. [PMID: 29254944 PMCID: PMC5793785 DOI: 10.1101/gr.221507.117] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 12/11/2017] [Indexed: 12/14/2022]
Abstract
Upstream open reading frames (uORFs), located in transcript leaders (5' UTRs), are potent cis-acting regulators of translation and mRNA turnover. Recent genome-wide ribosome profiling studies suggest that thousands of uORFs initiate with non-AUG start codons. Although intriguing, these non-AUG uORF predictions have been made without statistical control or validation; thus, the importance of these elements remains to be demonstrated. To address this, we took a comparative genomics approach to study AUG and non-AUG uORFs. We mapped transcription leaders in multiple Saccharomyces yeast species and applied a novel machine learning algorithm (uORF-seqr) to ribosome profiling data to identify statistically significant uORFs. We found that AUG and non-AUG uORFs are both frequently found in Saccharomyces yeasts. Although most non-AUG uORFs are found in only one species, hundreds have either conserved sequence or position within Saccharomyces uORFs initiating with UUG are particularly common and are shared between species at rates similar to that of AUG uORFs. However, non-AUG uORFs are translated less efficiently than AUG-uORFs and are less subject to removal via alternative transcription initiation under normal growth conditions. These results suggest that a subset of non-AUG uORFs may play important roles in regulating gene expression.
Collapse
Affiliation(s)
- Pieter Spealman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - Armaghan W Naik
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - Gemma E May
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | | | | | - Robert F Murphy
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - Joel McManus
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| |
Collapse
|
2
|
Li Y, Majarian TD, Naik AW, Johnson GR, Murphy RF. Point process models for localization and interdependence of punctate cellular structures. Cytometry A 2016; 89:633-43. [PMID: 27327612 DOI: 10.1002/cyto.a.22873] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2016] [Revised: 03/09/2016] [Accepted: 04/29/2016] [Indexed: 11/08/2022]
Abstract
Accurate representations of cellular organization for multiple eukaryotic cell types are required for creating predictive models of dynamic cellular function. To this end, we have previously developed the CellOrganizer platform, an open source system for generative modeling of cellular components from microscopy images. CellOrganizer models capture the inherent heterogeneity in the spatial distribution, size, and quantity of different components among a cell population. Furthermore, CellOrganizer can generate quantitatively realistic synthetic images that reflect the underlying cell population. A current focus of the project is to model the complex, interdependent nature of organelle localization. We built upon previous work on developing multiple non-parametric models of organelles or structures that show punctate patterns. The previous models described the relationships between the subcellular localization of puncta and the positions of cell and nuclear membranes and microtubules. We extend these models to consider the relationship to the endoplasmic reticulum (ER), and to consider the relationship between the positions of different puncta of the same type. Our results do not suggest that the punctate patterns we examined are dependent on ER position or inter- and intra-class proximity. With these results, we built classifiers to update previous assignments of proteins to one of 11 patterns in three distinct cell lines. Our generative models demonstrate the ability to construct statistically accurate representations of puncta localization from simple cellular markers in distinct cell types, capturing the complex phenomena of cellular structure interaction with little human input. This protocol represents a novel approach to vesicular protein annotation, a field that is often neglected in high-throughput microscopy. These results suggest that spatial point process models provide useful insight with respect to the spatial dependence between cellular structures. © 2016 International Society for Advancement of Cytometry.
Collapse
Affiliation(s)
- Ying Li
- State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, 430079, China.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213
| | - Timothy D Majarian
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213
| | - Armaghan W Naik
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213
| | - Gregory R Johnson
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213
| | - Robert F Murphy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213.,Departments of Biomedical Engineering and Machine Learning, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213.,Freiburg Institute for Advanced Studies and Faculty of Biology, Albert Ludwig University of Freiburg, Albertstrasse 19, 79104 Freiburg Im Breisgau, Germany
| |
Collapse
|
3
|
Naik AW, Kangas JD, Sullivan DP, Murphy RF. Active machine learning-driven experimentation to determine compound effects on protein patterns. eLife 2016; 5:e10047. [PMID: 26840049 PMCID: PMC4798950 DOI: 10.7554/elife.10047] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 01/28/2016] [Indexed: 12/03/2022] Open
Abstract
High throughput screening determines the effects of many conditions on a given biological target. Currently, to estimate the effects of those conditions on other targets requires either strong modeling assumptions (e.g. similarities among targets) or separate screens. Ideally, data-driven experimentation could be used to learn accurate models for many conditions and targets without doing all possible experiments. We have previously described an active machine learning algorithm that can iteratively choose small sets of experiments to learn models of multiple effects. We now show that, with no prior knowledge and with liquid handling robotics and automated microscopy under its control, this learner accurately learned the effects of 48 chemical compounds on the subcellular localization of 48 proteins while performing only 29% of all possible experiments. The results represent the first practical demonstration of the utility of active learning-driven biological experimentation in which the set of possible phenotypes is unknown in advance. DOI:http://dx.doi.org/10.7554/eLife.10047.001 Biomedical scientists have invested significant effort into making it easy to perform lots of experiments quickly and cheaply. These “high throughput” methods are the workhorses of modern “systems biology” efforts. However, we simply cannot perform an experiment for every possible combination of different cell type, genetic mutation and other conditions. In practice this has led researchers to either exhaustively test a few conditions or targets, or to try to pick the experiments that best allow a particular problem to be explored. But which experiments should we pick? The ones we think we can predict the outcome of accurately, the ones for which we are uncertain what the results will be, or a combination of the two? Humans are not particularly well suited for this task because it requires reasoning about many possible outcomes at the same time. However, computers are much better at handling statistics for many experiments, and machine learning algorithms allow computers to “learn” how to make predictions and decisions based on the data they’ve previously processed. Previous computer simulations showed that a machine learning approach termed “active learning” could do a good job of picking a series of experiments to perform in order to efficiently learn a model that predicts the results of experiments that were not done. Now, Naik et al. have performed cell biology experiments in which experiments were chosen by an active learning algorithm and then performed using liquid handling robots and an automated microscope. The key idea behind the approach is that you learn more from an experiment you can’t predict (or that you predicted incorrectly) than from just confirming your confident predictions. The results of the robot-driven experiments showed that the active learning approach outperforms strategies a human might use, even when the potential outcomes of individual experiments are not known beforehand. The next challenge is to apply these methods to reduce the cost of achieving the goals of large projects, such as The Cancer Genome Atlas. DOI:http://dx.doi.org/10.7554/eLife.10047.002
Collapse
Affiliation(s)
- Armaghan W Naik
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, United States.,Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, United States
| | - Joshua D Kangas
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, United States.,Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, United States
| | - Devin P Sullivan
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, United States.,Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, United States
| | - Robert F Murphy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, United States.,Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, United States.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, United States.,Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, United States.,Machine Learning Department, Carnegie Mellon University, Pittsburgh, United States.,Freiburg Institute for Advanced Studies, Albert Ludwig University of Freiburg, Freiburg, Germany.,Faculty of Biology, Albert Ludwig University of Freiburg, Freiburg, Germany
| |
Collapse
|
4
|
Temerinac-Ott M, Naik AW, Murphy RF. Deciding when to stop: efficient experimentation to learn to predict drug-target interactions. BMC Bioinformatics 2015; 16:213. [PMID: 26153434 PMCID: PMC4495685 DOI: 10.1186/s12859-015-0650-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 06/26/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Active learning is a powerful tool for guiding an experimentation process. Instead of doing all possible experiments in a given domain, active learning can be used to pick the experiments that will add the most knowledge to the current model. Especially, for drug discovery and development, active learning has been shown to reduce the number of experiments needed to obtain high-confidence predictions. However, in practice, it is crucial to have a method to evaluate the quality of the current predictions and decide when to stop the experimentation process. Only by applying reliable stopping criteria to active learning can time and costs in the experimental process actually be saved. RESULTS We compute active learning traces on simulated drug-target matrices in order to determine a regression model for the accuracy of the active learner. By analyzing the performance of the regression model on simulated data, we design stopping criteria for previously unseen experimental matrices. We demonstrate on four previously characterized drug effect data sets that applying the stopping criteria can result in upto 40 % savings of the total experiments for highly accurate predictions. CONCLUSIONS We show that active learning accuracy can be predicted using simulated data and results in substantial savings in the number of experiments required to make accurate drug-target predictions.
Collapse
Affiliation(s)
- Maja Temerinac-Ott
- Freiburg Institute for Advanced Studies, University of Freiburg, Freiburg, Germany.
| | - Armaghan W Naik
- Computational Biology Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA.
| | - Robert F Murphy
- Freiburg Institute for Advanced Studies, University of Freiburg, Freiburg, Germany.
- Computational Biology Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA.
- Departments of Biological Sciences, Biomedical Engineering and Machine Learning, Carnegie Mellon University, 5000 Forbes Ave15213, Pittsburgh, PA, USA.
| |
Collapse
|
5
|
Kangas JD, Naik AW, Murphy RF. Efficient discovery of responses of proteins to compounds using active learning. BMC Bioinformatics 2014; 15:143. [PMID: 24884564 PMCID: PMC4030446 DOI: 10.1186/1471-2105-15-143] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Accepted: 05/07/2014] [Indexed: 11/13/2022] Open
Abstract
Background Drug discovery and development has been aided by high throughput screening methods that detect compound effects on a single target. However, when using focused initial screening, undesirable secondary effects are often detected late in the development process after significant investment has been made. An alternative approach would be to screen against undesired effects early in the process, but the number of possible secondary targets makes this prohibitively expensive. Results This paper describes methods for making this global approach practical by constructing predictive models for many target responses to many compounds and using them to guide experimentation. We demonstrate for the first time that by jointly modeling targets and compounds using descriptive features and using active machine learning methods, accurate models can be built by doing only a small fraction of possible experiments. The methods were evaluated by computational experiments using a dataset of 177 assays and 20,000 compounds constructed from the PubChem database. Conclusions An average of nearly 60% of all hits in the dataset were found after exploring only 3% of the experimental space which suggests that active learning can be used to enable more complete characterization of compound effects than otherwise affordable. The methods described are also likely to find widespread application outside drug discovery, such as for characterizing the effects of a large number of compounds or inhibitory RNAs on a large number of cell or tissue phenotypes.
Collapse
Affiliation(s)
| | | | - Robert F Murphy
- Lane Center for Computational Biology, Carnegie Mellon University, 5000 Forbes Ave,, Pittsburgh, PA 15213, USA.
| |
Collapse
|
6
|
Abstract
High throughput and high content screening involve determination of the effect of many compounds on a given target. As currently practiced, screening for each new target typically makes little use of information from screens of prior targets. Further, choices of compounds to advance to drug development are made without significant screening against off-target effects. The overall drug development process could be made more effective, as well as less expensive and time consuming, if potential effects of all compounds on all possible targets could be considered, yet the cost of such full experimentation would be prohibitive. In this paper, we describe a potential solution: probabilistic models that can be used to predict results for unmeasured combinations, and active learning algorithms for efficiently selecting which experiments to perform in order to build those models and determining when to stop. Using simulated and experimental data, we show that our approaches can produce powerful predictive models without exhaustive experimentation and can learn them much faster than by selecting experiments at random.
Collapse
Affiliation(s)
- Armaghan W. Naik
- Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Joshua D. Kangas
- Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Christopher J. Langmead
- Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Robert F. Murphy
- Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Departments of Biological Sciences, Biomedical Engineering and Machine Learning, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Freiburg Institute for Advanced Studies and Faculty of Biology, Albert Ludwig University of Freiburg, Freiburg, Germany
- * E-mail:
| |
Collapse
|
7
|
Coelho LP, Kangas JD, Naik AW, Osuna-Highley E, Glory-Afshar E, Fuhrman M, Simha R, Berget PB, Jarvik JW, Murphy RF. Determining the subcellular location of new proteins from microscope images using local features. ACTA ACUST UNITED AC 2013; 29:2343-9. [PMID: 23836142 DOI: 10.1093/bioinformatics/btt392] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Evaluation of previous systems for automated determination of subcellular location from microscope images has been done using datasets in which each location class consisted of multiple images of the same representative protein. Here, we frame a more challenging and useful problem where previously unseen proteins are to be classified. RESULTS Using CD-tagging, we generated two new image datasets for evaluation of this problem, which contain several different proteins for each location class. Evaluation of previous methods on these new datasets showed that it is much harder to train a classifier that generalizes across different proteins than one that simply recognizes a protein it was trained on. We therefore developed and evaluated additional approaches, incorporating novel modifications of local features techniques. These extended the notion of local features to exploit both the protein image and any reference markers that were imaged in parallel. With these, we obtained a large accuracy improvement in our new datasets over existing methods. Additionally, these features help achieve classification improvements for other previously studied datasets. AVAILABILITY The datasets are available for download at http://murphylab.web.cmu.edu/data/. The software was written in Python and C++ and is available under an open-source license at http://murphylab.web.cmu.edu/software/. The code is split into a library, which can be easily reused for other data and a small driver script for reproducing all results presented here. A step-by-step tutorial on applying the methods to new datasets is also available at that address. CONTACT murphy@cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luis Pedro Coelho
- Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|