1
|
Abstract
Proteomics is a data-rich science with complex experimental designs and an intricate measurement process. To obtain insights from the large data sets produced, statistical methods, including machine learning, are routinely applied. For a quantity of interest, many of these approaches only produce a point estimate, such as a mean, leaving little room for more nuanced interpretations. By contrast, Bayesian statistics allows quantification of uncertainty through the use of probability distributions. These probability distributions enable scientists to ask complex questions of their proteomics data. Bayesian statistics also offers a modular framework for data analysis by making dependencies between data and parameters explicit. Hence, specifying complex hierarchies of parameter dependencies is straightforward in the Bayesian framework. This allows us to use a statistical methodology which equals, rather than neglects, the sophistication of experimental design and instrumentation present in proteomics. Here, we review Bayesian methods applied to proteomics, demonstrating their potential power, alongside the challenges posed by adopting this new statistical framework. To illustrate our review, we give a walk-through of the development of a Bayesian model for dynamic organic orthogonal phase-separation (OOPS) data.
Collapse
Affiliation(s)
- Oliver M. Crook
- Department
of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| | - Chun-wa Chung
- Structural
and Biophysical Sciences, GlaxoSmithKline
R&D, Stevenage SG1 2NY, United Kingdom
| | - Charlotte M. Deane
- Department
of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| |
Collapse
|
2
|
Lugo-Martinez J, Ruiz-Perez D, Narasimhan G, Bar-Joseph Z. Dynamic interaction network inference from longitudinal microbiome data. Microbiome 2019; 7:54. [PMID: 30940197 PMCID: PMC6446388 DOI: 10.1186/s40168-019-0660-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 03/11/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND Several studies have focused on the microbiota living in environmental niches including human body sites. In many of these studies, researchers collect longitudinal data with the goal of understanding not only just the composition of the microbiome but also the interactions between the different taxa. However, analysis of such data is challenging and very few methods have been developed to reconstruct dynamic models from time series microbiome data. RESULTS Here, we present a computational pipeline that enables the integration of data across individuals for the reconstruction of such models. Our pipeline starts by aligning the data collected for all individuals. The aligned profiles are then used to learn a dynamic Bayesian network which represents causal relationships between taxa and clinical variables. Testing our methods on three longitudinal microbiome data sets we show that our pipeline improve upon prior methods developed for this task. We also discuss the biological insights provided by the models which include several known and novel interactions. The extended CGBayesNets package is freely available under the MIT Open Source license agreement. The source code and documentation can be downloaded from https://github.com/jlugomar/longitudinal_microbiome_analysis_public . CONCLUSIONS We propose a computational pipeline for analyzing longitudinal microbiome data. Our results provide evidence that microbiome alignments coupled with dynamic Bayesian networks improve predictive performance over previous methods and enhance our ability to infer biological relationships within the microbiome and between taxa and clinical factors.
Collapse
Affiliation(s)
- Jose Lugo-Martinez
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, 15213 Pennsylvania USA
| | - Daniel Ruiz-Perez
- Bioinformatics Research Group (BioRG), Florida International University, 11200 SW 8th Street, Miami, 33199 Florida USA
| | - Giri Narasimhan
- Bioinformatics Research Group (BioRG), Florida International University, 11200 SW 8th Street, Miami, 33199 Florida USA
- Biomolecular Sciences Institute, Florida International University, Miami, 33199 Florida USA
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, 15213 Pennsylvania USA
| |
Collapse
|
3
|
Halloran JT, Rocke DM. Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra. Adv Neural Inf Process Syst 2018; 31:5420-5430. [PMID: 31745383 PMCID: PMC6863516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The most widely used technology to identify the proteins present in a complex biological sample is tandem mass spectrometry, which quickly produces a large collection of spectra representative of the peptides (i.e., protein subsequences) present in the original sample. In this work, we greatly expand the parameter learning capabilities of a dynamic Bayesian network (DBN) peptide-scoring algorithm, Didea [25], by deriving emission distributions for which its conditional log-likelihood scoring function remains concave. We show that this class of emission distributions, called Convex Virtual Emissions (CVEs), naturally generalizes the log-sum-exp function while rendering both maximum likelihood estimation and conditional maximum likelihood estimation concave for a wide range of Bayesian networks. Utilizing CVEs in Didea allows efficient learning of a large number of parameters while ensuring global convergence, in stark contrast to Didea's previous parameter learning framework (which could only learn a single parameter using a costly grid search) and other trainable models [12, 13, 14] (which only ensure convergence to local optima). The newly trained scoring function substantially outperforms the state-of-the-art in both scoring function accuracy and downstream Fisher kernel analysis. Furthermore, we significantly improve Didea's runtime performance through successive optimizations to its message passing schedule and derive explicit connections between Didea's new concave score and related MS/MS scoring functions.
Collapse
Affiliation(s)
- John T Halloran
- Department of Public Health Sciences University of California, Davis,
| | - David M Rocke
- Department of Public Health Sciences University of California, Davis,
| |
Collapse
|
4
|
Baker JJ, McDaniel D, Cain D, Lee Tao P, Li C, Huang Y, Liu H, Zhu-Shimoni J, Niñonuevo M. Rapid Identification of Disulfide Bonds and Cysteine-Related Variants in an IgG1 Knob-into-Hole Bispecific Antibody Enhanced by Machine Learning. Anal Chem 2018; 91:965-976. [DOI: 10.1021/acs.analchem.8b04071] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Jordan J. Baker
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Dana McDaniel
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - David Cain
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Paula Lee Tao
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Charlene Li
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Yuting Huang
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Hongbin Liu
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Judith Zhu-Shimoni
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| | - Milady Niñonuevo
- Genentech, 1 DNA Way, South San Francisco, California 94080, United States
| |
Collapse
|
5
|
Abstract
Percolator is an important tool for greatly improving the results of a database search and subsequent downstream analysis. Using support vector machines (SVMs), Percolator recalibrates peptide-spectrum matches based on the learned decision boundary between targets and decoys. To improve analysis time for large-scale data sets, we update Percolator's SVM learning engine through software and algorithmic optimizations rather than heuristic approaches that necessitate the careful study of their impact on learned parameters across different search settings and data sets. We show that by optimizing Percolator's original learning algorithm, l2-SVM-MFN, large-scale SVM learning requires nearly only a third of the original runtime. Furthermore, we show that by employing the widely used Trust Region Newton (TRON) algorithm instead of l2-SVM-MFN, large-scale Percolator SVM learning is reduced to nearly only a fifth of the original runtime. Importantly, these speedups only affect the speed at which Percolator converges to a global solution and do not alter recalibration performance. The upgraded versions of both l2-SVM-MFN and TRON are optimized within the Percolator codebase for multithreaded and single-thread use and are available under Apache license at bitbucket.org/jthalloran/percolator_upgrade .
Collapse
Affiliation(s)
- John T Halloran
- Department of Public Health Sciences , University of California, Davis , Davis , California 95616 , United States
| | - David M Rocke
- Division of Biostatistics , University of California, Davis , Davis , California 95616 , United States
| |
Collapse
|
6
|
Halloran JT. Analyzing Tandem Mass Spectra Using the DRIP Toolkit: Training, Searching, and Post-Processing. Methods Mol Biol 2018; 1807:163-180. [PMID: 30030810 DOI: 10.1007/978-1-4939-8561-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins present in a complex, biological sample. Critical to MS/MS is the ability to accurately identify the peptide responsible for producing each observed spectrum. Recently, a dynamic Bayesian network (DBN) approach was shown to achieve state-of-the-art accuracy for this peptide identification problem. Modeling the stochastic process by which a peptide produces an MS/MS spectrum, this DBN for Rapid Identification of Peptides (DRIP) uses probabilistic inference to efficiently determine the most probable alignment between a peptide and an observed spectrum. DRIP's dynamic alignment strategy improves upon standard "static" alignment strategies, which rely on fixed quantization of the temporal axis of MS/MS data, in several significant ways. In particular, DRIP allows learning non-linear shifts of the temporal axis and, owing to the generative nature of the model, accurate feature extraction for substantially improved discriminative analysis (i.e., Percolator post-processing), all of which are supported in the DRIP Toolkit (DTK). Herein we describe how DTK may be used to significantly improve MS/MS identification accuracy, as well as DTK's interactive features for fine-grained analysis, including on the fly inference and plotting attributes.
Collapse
Affiliation(s)
- John T Halloran
- Department of Public Health Sciences, University of California, Davis, Davis, CA, USA.
| |
Collapse
|
7
|
Halloran JT, Rocke DM. Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra. Adv Neural Inf Process Syst 2017; 30:5724-5733. [PMID: 31745382 PMCID: PMC6863505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generative models to improve the identification of such spectra. In particular, we show that the gradient of a recently proposed dynamic Bayesian network (DBN) [7] may be naturally employed by a kernel-based discriminative classifier. The resulting Fisher kernel substantially improves upon recent attempts to combine generative and discriminative models for post-processing analysis, outperforming all other methods on the evaluated datasets. We extend the improved accuracy offered by the Fisher kernel framework to other search algorithms by introducing Theseus, a DBN representing a large number of widely used MS/MS scoring functions. Furthermore, with gradient ascent and max-product inference at hand, we use Theseus to learn model parameters without any supervision.
Collapse
Affiliation(s)
- John T Halloran
- Department of Public Health Sciences, University of California, Davis,
| | - David M Rocke
- Department of Public Health Sciences, University of California, Davis,
| |
Collapse
|