1
|
Advances in proteomics: characterization of the innate immune system after birth and during inflammation. Front Immunol 2023; 14:1254948. [PMID: 37868984 PMCID: PMC10587584 DOI: 10.3389/fimmu.2023.1254948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 09/26/2023] [Indexed: 10/24/2023] Open
Abstract
Proteomics is the characterization of the protein composition, the proteome, of a biological sample. It involves the large-scale identification and quantification of proteins, peptides, and post-translational modifications. This review focuses on recent developments in mass spectrometry-based proteomics and provides an overview of available methods for sample preparation to study the innate immune system. Recent advancements in the proteomics workflows, including sample preparation, have significantly improved the sensitivity and proteome coverage of biological samples including the technically difficult blood plasma. Proteomics is often applied in immunology and has been used to characterize the levels of innate immune system components after perturbations such as birth or during chronic inflammatory diseases like rheumatoid arthritis (RA) and inflammatory bowel disease (IBD). In cancers, the tumor microenvironment may generate chronic inflammation and release cytokines to the circulation. In these situations, the innate immune system undergoes profound and long-lasting changes, the large-scale characterization of which may increase our biological understanding and help identify components with translational potential for guiding diagnosis and treatment decisions. With the ongoing technical development, proteomics will likely continue to provide increasing insights into complex biological processes and their implications for health and disease. Integrating proteomics with other omics data and utilizing multi-omics approaches have been demonstrated to give additional valuable insights into biological systems.
Collapse
|
2
|
Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
|
3
|
Missing data in multi-omics integration: Recent advances through artificial intelligence. Front Artif Intell 2023; 6:1098308. [PMID: 36844425 PMCID: PMC9949722 DOI: 10.3389/frai.2023.1098308] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/23/2023] [Indexed: 02/11/2023] Open
Abstract
Biological systems function through complex interactions between various 'omics (biomolecules), and a more complete understanding of these systems is only possible through an integrated, multi-omic perspective. This has presented the need for the development of integration approaches that are able to capture the complex, often non-linear, interactions that define these biological systems and are adapted to the challenges of combining the heterogenous data across 'omic views. A principal challenge to multi-omic integration is missing data because all biomolecules are not measured in all samples. Due to either cost, instrument sensitivity, or other experimental factors, data for a biological sample may be missing for one or more 'omic techologies. Recent methodological developments in artificial intelligence and statistical learning have greatly facilitated the analyses of multi-omics data, however many of these techniques assume access to completely observed data. A subset of these methods incorporate mechanisms for handling partially observed samples, and these methods are the focus of this review. We describe recently developed approaches, noting their primary use cases and highlighting each method's approach to handling missing data. We additionally provide an overview of the more traditional missing data workflows and their limitations; and we discuss potential avenues for further developments as well as how the missing data issue and its current solutions may generalize beyond the multi-omics context.
Collapse
|
4
|
A community resource to mass explore the wheat grain proteome and its application to the late-maturity alpha-amylase (LMA) problem. Gigascience 2022; 12:giad084. [PMID: 37919977 PMCID: PMC10627334 DOI: 10.1093/gigascience/giad084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 08/02/2023] [Accepted: 09/19/2023] [Indexed: 11/04/2023] Open
Abstract
BACKGROUND Late-maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point alpha-amylase following a temperature shock during mid-grain development or prolonged cold throughout grain development, both leading to starch degradation. While the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have applied high-throughput proteomics to 4,061 wheat flours displaying a range of LMA activities. Using an array of statistical analyses to select LMA-responsive biomarkers, we have mined them using a suite of tools applicable to wheat proteins. RESULTS We observed that LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis; TCA cycle, along with DNA- and RNA- binding mechanisms; and protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as well as protein assembly via dimerisation and complexing. The secondary metabolism was also mobilized with the upregulation of phytohormones and chemical and defence responses. LMA further invoked cellular structures, including ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain storage proteins, as well as starch and other carbohydrates, with the upregulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose, and UDP-glucose were downregulated. CONCLUSIONS To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed.
Collapse
|
5
|
PBLMM: Peptide-based linear mixed models for differential expression analysis of shotgun proteomics data. J Cell Biochem 2022; 123:691-696. [PMID: 35132673 DOI: 10.1002/jcb.30225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 01/23/2022] [Accepted: 01/27/2022] [Indexed: 11/07/2022]
Abstract
Here, we present a peptide-based linear mixed models tool-PBLMM, a standalone desktop application for differential expression analysis of proteomics data. We also provide a Python package that allows streamlined data analysis workflows implementing the PBLMM algorithm. PBLMM is easy to use without scripting experience and calculates differential expression by peptide-based linear mixed regression models. We show that peptide-based models outperform classical methods of statistical inference of differentially expressed proteins. In addition, PBLMM exhibits superior statistical power in situations of low effect size and/or low sample size. Taken together our tool provides an easy-to-use, high-statistical-power method to infer differentially expressed proteins from proteomics data.
Collapse
|
6
|
Evaluating Spatiotemporal Dynamics of Phosphorylation of RNA Polymerase II Carboxy-Terminal Domain by Ultraviolet Photodissociation Mass Spectrometry. J Am Chem Soc 2021; 143:8488-8498. [PMID: 34053220 DOI: 10.1021/jacs.1c03321] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The critical role of site-specific phosphorylation in eukaryotic transcription has motivated efforts to decipher the complex phosphorylation patterns exhibited by the carboxyl-terminal domain (CTD) of RNA polymerase II. Phosphorylation remains a challenging post-translational modification to characterize by mass spectrometry owing to the labile phosphate ester linkage and low stoichiometric prevalence, two features that complicate analysis by high-throughput MS/MS methods. Identifying phosphorylation sites represents one significant hurdle in decrypting the CTD phosphorylation, a problem exaggerated by a large number of potential phosphorylation sites. An even greater obstacle is decoding the dynamic phosphorylation pattern along the length of the periodic CTD sequence. Ultraviolet photodissociation (UVPD) is a high-energy ion activation method that provides ample backbone cleavages of peptides while preserving labile post-translational modifications that facilitate their confident localization. Herein, we report a quantitative parallel reaction monitoring (PRM) method developed to monitor spatiotemporal changes in site-specific Ser5 phosphorylation of the CTD by cyclin-dependent kinase 7 (CDK7) using UVPD for sequence identification, phosphosite localization, and differentiation of phosphopeptide isomers. We capitalize on the series of phospho-retaining fragment ions produced by UVPD to create unique transition lists that are pivotal for distinguishing the array of phosphopeptides generated from the CTD.
Collapse
|
7
|
Abstract
The throughput efficiency and increased depth of coverage provided by isobaric-labeled proteomics measurements have led to increased usage of these techniques. However, the structure of missing data is different than unlabeled studies, which prompts the need for this review to compare the efficacy of nine imputation methods on large isobaric-labeled proteomics data sets to guide researchers on the appropriateness of various imputation methods. Imputation methods were evaluated by accuracy, statistical hypothesis test inference, and run time. In general, expectation maximization and random forest imputation methods yielded the best performance, and constant-based methods consistently performed poorly across all data set sizes and percentages of missing values. For data sets with small sample sizes and higher percentages of missing data, results indicate that statistical inference with no imputation may be preferable. On the basis of the findings in this review, there are core imputation methods that perform better for isobaric-labeled proteomics data, but great care and consideration as to whether imputation is the optimal strategy should be given for data sets comprised of a small number of samples.
Collapse
|
8
|
DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis. Mol Cell Proteomics 2020; 19:1047-1057. [PMID: 32205417 PMCID: PMC7261819 DOI: 10.1074/mcp.tir119.001646] [Citation(s) in RCA: 94] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Revised: 03/20/2020] [Indexed: 12/19/2022] Open
Abstract
Quantitative proteomics by mass spectrometry is widely used in biomarker research and basic biology research for investigation of phenotype level cellular events. Despite the wide application, the methodology for statistical analysis of differentially expressed proteins has not been unified. Various methods such as t test, linear model and mixed effect models are used to define changes in proteomics experiments. However, none of these methods consider the specific structure of MS-data. Choices between methods, often originally developed for other types of data, are based on compromises between features such as statistical power, general applicability and user friendliness. Furthermore, whether to include proteins identified with one peptide in statistical analysis of differential protein expression varies between studies. Here we present DEqMS, a robust statistical method developed specifically for differential protein expression analysis in mass spectrometry data. In all data sets investigated there is a clear dependence of variance on the number of PSMs or peptides used for protein quantification. DEqMS takes this feature into account when assessing differential protein expression. This allows for a more accurate data-dependent estimation of protein variance and inclusion of single peptide identifications without increasing false discoveries. The method was tested in several data sets including E. coli proteome spike-in data, using both label-free and TMT-labeled quantification. Compared with previous statistical methods used in quantitative proteomics, DEqMS showed consistently better accuracy in detecting altered protein levels compared with other statistical methods in both label-free and labeled quantitative proteomics data. DEqMS is available as an R package in Bioconductor.
Collapse
|
9
|
Peptide filtering differently affects the performances of XIC-based quantification methods. J Proteomics 2018; 193:131-141. [PMID: 30312678 DOI: 10.1016/j.jprot.2018.10.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2018] [Revised: 10/02/2018] [Accepted: 10/08/2018] [Indexed: 11/20/2022]
Abstract
In bottom-up proteomics, data are acquired on peptides resulting from proteolysis. In XIC-based quantification, the quality of the estimation of protein abundance depends on how peptide data are filtered and on which quantification method is used to express peptide intensity as protein abundance. So far, these two questions have been addressed independently. Here, we studied to what extent the relative performances of the quantification methods depend on the filters applied to peptide intensity data. To this end, we performed a spike-in experiment using Universal Protein Standard to evaluate the performances of five quantification methods in five datasets obtained after application of four peptide filters. Estimated protein abundances were not equally affected by filters depending on the computation mode and the type of data for quantification. Furthermore, we found that filters could have contrasting effects depending on the quantification objective. Intensity modeling proved to be the most robust method, providing the best results in the absence of any filter. However, the different quantification methods can achieve similar performances when appropriate peptide filters are used. Altogether, our findings provide insights into how best to handle intensity data according to the quantification objective and the experimental design. SIGNIFICANCE: We believe that our results are of major importance because they address, as far as we know for the first time, the crossed-effects of peptide intensity data filtering and XIC-based quantification methods on protein quantification. While previous papers have dealt with peptide filtering independently of the quantification method, here we combined four peptide filters (based on peptide sharing between proteins, retention time variability, peptides occurrence and peptide intensity profiles) with five XIC-based quantification methods representing different modes of calculating protein abundances from peptide intensities. For these different combinations, we analyzed the quality of protein quantification in terms of precision, accuracy and linearity of response to increasing protein concentration using a spike-in experiment. We showed that not only filters effect on the estimation of protein abundances depend on the quantification methods but also that quantification methods can reach similar performances when appropriate peptide filters are used. Also, depending on the quantification objective, i.e. absolute or relative, filters can have contrasting effects and we demonstrated that protein quantification by the peptide intensity modeling was the most robust method.
Collapse
|
10
|
Statistical Models for the Analysis of Isobaric Tags Multiplexed Quantitative Proteomics. J Proteome Res 2017; 16:3124-3136. [DOI: 10.1021/acs.jproteome.6b01050] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Comparative evaluation of label-free quantification methods for shotgun proteomics. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2017; 31:606-612. [PMID: 28097710 DOI: 10.1002/rcm.7829] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2016] [Revised: 12/24/2016] [Accepted: 01/15/2017] [Indexed: 06/06/2023]
Abstract
RATIONALE Label-free quantification (LFQ) is a popular strategy for shotgun proteomics. A variety of LFQ algorithms have been developed recently. However, a comprehensive comparison of the most commonly used LFQ methods is still rare, in part due to a lack of clear metrics for their evaluation and an annotated and quantitatively well-characterized data set. METHODS Five LFQ methods were compared: spectral counting based algorithms SIN , emPAI, and NSAF, and approaches relying on the extracted ion chromatogram (XIC) intensities, MaxLFQ and Quanti. We used three criteria for performance evaluation: coefficient of variation (CV) of protein abundances between replicates; analysis of variance (ANOVA); and the root-mean-square error of logarithmized calculated concentration ratios, referred to as standard quantification error (SQE). Comparison was performed using a quantitatively annotated publicly available data set. RESULTS The best results in terms of inter-replicate reproducibility were observed for MaxLFQ and NSAF, although they exhibited larger standard quantification errors. Using NSAF, all quantitatively annotated proteins were correctly identified in the Bonferronni-corrected results of the ANOVA test. SIN was found to be the most accurate in terms of SQE. Finally, the current implementations of XIC-based LFQ methods did not outperform the methods based on spectral counting for the data set used in this study. CONCLUSIONS Surprisingly, the performances of XIC-based approaches measured using three independent metrics were found to be comparable with more straightforward and simple MS/MS-based spectral counting approaches. The study revealed no clear leader among the latter. Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
|
12
|
Experimental design and data-analysis in label-free quantitative LC/MS proteomics: A tutorial with MSqRob. J Proteomics 2017; 171:23-36. [PMID: 28391044 DOI: 10.1016/j.jprot.2017.04.004] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Revised: 03/29/2017] [Accepted: 04/01/2017] [Indexed: 12/14/2022]
Abstract
Label-free shotgun proteomics is routinely used to assess proteomes. However, extracting relevant information from the massive amounts of generated data remains difficult. This tutorial provides a strong foundation on analysis of quantitative proteomics data. We provide key statistical concepts that help researchers to design proteomics experiments and we showcase how to analyze quantitative proteomics data using our recent free and open-source R package MSqRob, which was developed to implement the peptide-level robust ridge regression method for relative protein quantification described by Goeminne et al. MSqRob can handle virtually any experimental proteomics design and outputs proteins ordered by statistical significance. Moreover, its graphical user interface and interactive diagnostic plots provide easy inspection and also detection of anomalies in the data and flaws in the data analysis, allowing deeper assessment of the validity of results and a critical review of the experimental design. Our tutorial discusses interactive preprocessing, data analysis and visualization of label-free MS-based quantitative proteomics experiments with simple and more complex designs. We provide well-documented scripts to run analyses in bash mode on GitHub, enabling the integration of MSqRob in automated pipelines on cluster environments (https://github.com/statOmics/MSqRob). SIGNIFICANCE The concepts outlined in this tutorial aid in designing better experiments and analyzing the resulting data more appropriately. The two case studies using the MSqRob graphical user interface will contribute to a wider adaptation of advanced peptide-based models, resulting in higher quality data analysis workflows and more reproducible results in the proteomics community. We also provide well-documented scripts for experienced users that aim at automating MSqRob on cluster environments.
Collapse
|
13
|
Thousand and one ways to quantify and compare protein abundances in label-free bottom-up proteomics. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2016; 1864:883-95. [PMID: 26947242 DOI: 10.1016/j.bbapap.2016.02.019] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 01/21/2016] [Accepted: 02/24/2016] [Indexed: 11/18/2022]
Abstract
How to process and analyze MS data to quantify and statistically compare protein abundances in bottom-up proteomics has been an open debate for nearly fifteen years. Two main approaches are generally used: the first is based on spectral data generated during the process of identification (e.g. peptide counting, spectral counting), while the second makes use of extracted ion currents to quantify chromatographic peaks and infer protein abundances based on peptide quantification. These two approaches actually refer to multiple methods which have been developed during the last decade, but were submitted to deep evaluations only recently. In this paper, we compiled these different methods as exhaustively as possible. We also summarized the way they address the different problems raised by bottom-up protein quantification such as normalization, the presence of shared peptides, unequal peptide measurability and missing data. This article is part of a Special Issue entitled: Plant Proteomics--a bridge between fundamental processes and crop production, edited by Dr. Hans-Peter Mock.
Collapse
|
14
|
Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics. Mol Cell Proteomics 2015; 15:657-68. [PMID: 26566788 DOI: 10.1074/mcp.m115.055897] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Indexed: 01/22/2023] Open
Abstract
Peptide intensities from mass spectra are increasingly used for relative quantitation of proteins in complex samples. However, numerous issues inherent to the mass spectrometry workflow turn quantitative proteomic data analysis into a crucial challenge. We and others have shown that modeling at the peptide level outperforms classical summarization-based approaches, which typically also discard a lot of proteins at the data preprocessing step. Peptide-based linear regression models, however, still suffer from unbalanced datasets due to missing peptide intensities, outlying peptide intensities and overfitting. Here, we further improve upon peptide-based models by three modular extensions: ridge regression, improved variance estimation by borrowing information across proteins with empirical Bayes and M-estimation with Huber weights. We illustrate our method on the CPTAC spike-in study and on a study comparing wild-type and ArgP knock-out Francisella tularensis proteomes. We show that the fold change estimates of our robust approach are more precise and more accurate than those from state-of-the-art summarization-based methods and peptide-based regression models, which leads to an improved sensitivity and specificity. We also demonstrate that ionization competition effects come already into play at very low spike-in concentrations and confirm that analyses with peptide-based regression methods on peptide intensity values aggregated by charge state and modification status (e.g. MaxQuant's peptides.txt file) are slightly superior to analyses on raw peptide intensity values (e.g. MaxQuant's evidence.txt file).
Collapse
|
15
|
Abstract
The expression of proteins can be quantified in high-throughput means using different types of mass spectrometers. In recent years, there have emerged label-free methods for determining protein abundance. Although the expression is initially measured at the peptide level, a common approach is to combine the peptide-level measurements into protein-level values before differential expression analysis. However, this simple combination is prone to inconsistencies between peptides and may lose valuable information. To this end, we introduce here a method for detecting differentially expressed proteins by combining peptide-level expression-change statistics. Using controlled spike-in experiments, we show that the approach of averaging peptide-level expression changes yields more accurate lists of differentially expressed proteins than does the conventional protein-level approach. This is particularly true when there are only few replicate samples or the differences between the sample groups are small. The proposed technique is implemented in the Bioconductor package PECA, and it can be downloaded from http://www.bioconductor.org.
Collapse
|
16
|
Evaluation of empirical rule of linearly correlated peptide selection (ERLPS) for proteotypic peptide-based quantitative proteomics. Proteomics 2014; 14:1593-603. [PMID: 24827140 DOI: 10.1002/pmic.201300032] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2013] [Revised: 02/07/2014] [Accepted: 05/09/2014] [Indexed: 11/11/2022]
Abstract
Precise protein quantification is essential in comparative proteomics. Currently, quantification bias is inevitable when using proteotypic peptide-based quantitative proteomics strategy for the differences in peptides measurability. To improve quantification accuracy, we proposed an "empirical rule for linearly correlated peptide selection (ERLPS)" in quantitative proteomics in our previous work. However, a systematic evaluation on general application of ERLPS in quantitative proteomics under diverse experimental conditions needs to be conducted. In this study, the practice workflow of ERLPS was explicitly illustrated; different experimental variables, such as, different MS systems, sample complexities, sample preparations, elution gradients, matrix effects, loading amounts, and other factors were comprehensively investigated to evaluate the applicability, reproducibility, and transferability of ERPLS. The results demonstrated that ERLPS was highly reproducible and transferable within appropriate loading amounts and linearly correlated response peptides should be selected for each specific experiment. ERLPS was used to proteome samples from yeast to mouse and human, and in quantitative methods from label-free to O18/O16-labeled and SILAC analysis, and enabled accurate measurements for all proteotypic peptide-based quantitative proteomics over a large dynamic range.
Collapse
|
17
|
General statistical framework for quantitative proteomics by stable isotope labeling. J Proteome Res 2014; 13:1234-47. [PMID: 24512137 DOI: 10.1021/pr4006958] [Citation(s) in RCA: 137] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
The combination of stable isotope labeling (SIL) with mass spectrometry (MS) allows comparison of the abundance of thousands of proteins in complex mixtures. However, interpretation of the large data sets generated by these techniques remains a challenge because appropriate statistical standards are lacking. Here, we present a generally applicable model that accurately explains the behavior of data obtained using current SIL approaches, including (18)O, iTRAQ, and SILAC labeling, and different MS instruments. The model decomposes the total technical variance into the spectral, peptide, and protein variance components, and its general validity was demonstrated by confronting 48 experimental distributions against 18 different null hypotheses. In addition to its general applicability, the performance of the algorithm was at least similar than that of other existing methods. The model also provides a general framework to integrate quantitative and error information fully, allowing a comparative analysis of the results obtained from different SIL experiments. The model was applied to the global analysis of protein alterations induced by low H₂O₂ concentrations in yeast, demonstrating the increased statistical power that may be achieved by rigorous data integration. Our results highlight the importance of establishing an adequate and validated statistical framework for the analysis of high-throughput data.
Collapse
|
18
|
|
19
|
A comparative analysis of computational approaches to relative protein quantification using peptide peak intensities in label-free LC-MS proteomics experiments. Proteomics 2012; 13:493-503. [PMID: 23019139 DOI: 10.1002/pmic.201200269] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2012] [Revised: 08/14/2012] [Accepted: 08/22/2012] [Indexed: 12/24/2022]
Abstract
Liquid chromatography coupled with mass spectrometry (LC-MS) is widely used to identify and quantify peptides in complex biological samples. In particular, label-free shotgun proteomics is highly effective for the identification of peptides and subsequently obtaining a global protein profile of a sample. As a result, this approach is widely used for discovery studies. Typically, the objective of these discovery studies is to identify proteins that are affected by some condition of interest (e.g. disease, exposure). However, for complex biological samples, label-free LC-MS proteomics experiments measure peptides and do not directly yield protein quantities. Thus, protein quantification must be inferred from one or more measured peptides. In recent years, many computational approaches to relative protein quantification of label-free LC-MS data have been published. In this review, we examine the most commonly employed quantification approaches to relative protein abundance from peak intensity values, evaluate their individual merits, and discuss challenges in the use of the various computational approaches.
Collapse
|
20
|
Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 2012; 13 Suppl 16:S6. [PMID: 23176351 PMCID: PMC3489535 DOI: 10.1186/1471-2105-13-s16-s6] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for quantitative proteomic investigations. The typical output of such studies is a list of identified and quantified peptides. The biological and clinical interest is, however, usually focused on quantitative conclusions at the protein level. Furthermore, many investigations ask complex biological questions by studying multiple interrelated experimental conditions. Therefore, there is a need in the field for generic statistical models to quantify protein levels even in complex study designs. RESULTS We propose a general statistical modeling approach for protein quantification in arbitrary complex experimental designs, such as time course studies, or those involving multiple experimental factors. The approach summarizes the quantitative experimental information from all the features and all the conditions that pertain to a protein. It enables both protein significance analysis between conditions, and protein quantification in individual samples or conditions. We implement the approach in an open-source R-based software package MSstats suitable for researchers with a limited statistics and programming background. CONCLUSIONS We demonstrate, using as examples two experimental investigations with complex designs, that a simultaneous statistical modeling of all the relevant features and conditions yields a higher sensitivity of protein significance analysis and a higher accuracy of protein quantification as compared to commonly employed alternatives. The software is available at http://www.stat.purdue.edu/~ovitek/Software.html.
Collapse
|
21
|
Proteomics pipeline for biomarker discovery of laser capture microdissected breast cancer tissue. J Mammary Gland Biol Neoplasia 2012; 17:155-64. [PMID: 22644111 PMCID: PMC3428526 DOI: 10.1007/s10911-012-9252-6] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Accepted: 05/01/2012] [Indexed: 01/15/2023] Open
Abstract
Mass spectrometry (MS)-based label-free proteomics offers an unbiased approach to screen biomarkers related to disease progression and therapy-resistance of breast cancer on the global scale. However, multi-step sample preparation can introduce large variation in generated data, while inappropriate statistical methods will lead to false positive hits. All these issues have hampered the identification of reliable protein markers. A workflow, which integrates reproducible and robust sample preparation and data handling methods, is highly desirable in clinical proteomics investigations. Here we describe a label-free tissue proteomics pipeline, which encompasses laser capture microdissection (LCM) followed by nanoscale liquid chromatography and high resolution MS. This pipeline routinely identifies on average ∼10,000 peptides corresponding to ∼1,800 proteins from sub-microgram amounts of protein extracted from ∼4,000 LCM breast cancer epithelial cells. Highly reproducible abundance data were generated from different technical and biological replicates. As a proof-of-principle, comparative proteome analysis was performed on estrogen receptor α positive or negative (ER+/-) samples, and commonly known differentially expressed proteins related to ER expression in breast cancer were identified. Therefore, we show that our tissue proteomics pipeline is robust and applicable for the identification of breast cancer specific protein markers.
Collapse
|
22
|
Metaprotein expression modeling for label-free quantitative proteomics. BMC Bioinformatics 2012; 13:74. [PMID: 22559859 PMCID: PMC3436780 DOI: 10.1186/1471-2105-13-74] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2011] [Accepted: 05/04/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Label-free quantitative proteomics holds a great deal of promise for the future study of both medicine and biology. However, the data generated is extremely intricate in its correlation structure, and its proper analysis is complex. There are issues with missing identifications. There are high levels of correlation between many, but not all, of the peptides derived from the same protein. Additionally, there may be systematic shifts in the sensitivity of the machine between experiments or even through time within the duration of a single experiment. RESULTS We describe a hierarchical model for analyzing unbiased, label-free proteomics data which utilizes the covariance of peptide expression across samples as well as MS/MS-based identifications to group peptides-a strategy we call metaprotein expression modeling. Our metaprotein model acknowledges the possibility of misidentifications, post-translational modifications and systematic differences between samples due to changes in instrument sensitivity or differences in total protein concentration. In addition, our approach allows us to validate findings from unbiased, label-free proteomics experiments with further unbiased, label-free proteomics experiments. Finally, we demonstrate the clinical/translational utility of the model for building predictors capable of differentiating biological phenotypes as well as for validating those findings in the context of three novel cohorts of patients with Hepatitis C. CONCLUSIONS Mass-spectrometry proteomics is quickly becoming a powerful tool for studying biological and translational questions. Making use of all of the information contained in a particular set of data will be critical to the success of those endeavors. Our proposed model represents an advance in the ability of statistical models of proteomic data to identify and utilize correlation between features. This allows validation of predictors without translation to targeted assays in addition to informing the choice of targets when it is appropriate to generate those assays.
Collapse
|
23
|
Statistical considerations of optimal study design for human plasma proteomics and biomarker discovery. J Proteome Res 2012; 11:2103-13. [PMID: 22338609 PMCID: PMC3320746 DOI: 10.1021/pr200636x] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
![]()
A mass spectrometry-based plasma biomarker discovery
workflow was
developed to facilitate biomarker discovery. Plasma from either healthy
volunteers or patients with pancreatic cancer was 8-plex iTRAQ labeled,
fractionated by 2-dimensional reversed phase chromatography and subjected
to MALDI ToF/ToF mass spectrometry. Data were processed using a q-value based statistical approach to maximize protein quantification
and identification. Technical (between duplicate samples) and biological
variance (between and within individuals) were calculated and power
analysis was thereby enabled. An a priori power analysis
was carried out using samples from healthy volunteers to define sample
sizes required for robust biomarker identification. The result was
subsequently validated with a post hoc power analysis
using a real clinical setting involving pancreatic cancer patients.
This demonstrated that six samples per group (e.g., pre- vs post-treatment)
may provide sufficient statistical power for most proteins with changes
>2 fold. A reference standard allowed direct comparison of protein
expression changes between multiple experiments. Analysis of patient
plasma prior to treatment identified 29 proteins with significant
changes within individual patient. Changes in Peroxiredoxin II levels
were confirmed by Western blot. This q-value based
statistical approach in combination with reference standard samples
can be applied with confidence in the design and execution of clinical
studies for predictive, prognostic, and/or pharmacodynamic biomarker
discovery. The power analysis provides information required prior
to study initiation.
Collapse
|
24
|
High-Dimensional Longitudinal Genomic Data: An analysis used for monitoring viral infections. IEEE SIGNAL PROCESSING MAGAZINE 2012; 29:108-123. [PMID: 24678238 PMCID: PMC3964679 DOI: 10.1109/msp.2011.943009] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
|
25
|
Abstract
The continued fast pace of fungal genome sequence generation has enabled proteomic analysis of a wide variety of organisms that span the breadth of the Kingdom Fungi. There is some phylogenetic bias to the current catalog of fungi with reasonable DNA sequence databases (genomic or EST) that could be analyzed at a global proteomic level. However, the rapid development of next generation sequencing platforms has lowered the cost of genome sequencing such that in the near future, having a genome sequence will no longer be a time or cost bottleneck for downstream proteomic (and transcriptomic) analyses. High throughput, nongel-based proteomics offers a snapshot of proteins present in a given sample at a single point in time. There are a number of variations on the general methods and technologies for identifying peptides in a given sample. We present a method that can serve as a "baseline" for proteomic studies of fungi.
Collapse
|
26
|
Abstract
Motivation: In the analysis of differential peptide peak intensities (i.e. abundance measures), LC-MS analyses with poor quality peptide abundance data can bias downstream statistical analyses and hence the biological interpretation for an otherwise high-quality dataset. Although considerable effort has been placed on assuring the quality of the peptide identification with respect to spectral processing, to date quality assessment of the subsequent peptide abundance data matrix has been limited to a subjective visual inspection of run-by-run correlation or individual peptide components. Identifying statistical outliers is a critical step in the processing of proteomics data as many of the downstream statistical analyses [e.g. analysis of variance (ANOVA)] rely upon accurate estimates of sample variance, and their results are influenced by extreme values. Results: We describe a novel multivariate statistical strategy for the identification of LC-MS runs with extreme peptide abundance distributions. Comparison with current method (run-by-run correlation) demonstrates a significantly better rate of identification of outlier runs by the multivariate strategy. Simulation studies also suggest that this strategy significantly outperforms correlation alone in the identification of statistically extreme liquid chromatography-mass spectrometry (LC-MS) runs. Availability:https://www.biopilot.org/docs/Software/RMD.php Contact:bj@pnl.gov Supplementary information:Supplementary material is available at Bioinformatics online.
Collapse
|
27
|
Bioinformatics Tools for Mass Spectrometry-Based High-Throughput Quantitative Proteomics Platforms. CURR PROTEOMICS 2011; 8:125-137. [PMID: 23002391 DOI: 10.2174/157016411795678020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Determining global proteome changes is important for advancing a systems biology view of cellular processes and for discovering biomarkers. Liquid chromatography, coupled to mass spectrometry, has been widely used as a proteomics technique for discovering differentially expressed proteins in biological samples. However, although a large number of high-throughput studies have identified differentially regulated proteins, only a small fraction of these results have been reproduced and independently verified. The use of different approaches to data processing and analyses is among the factors which contribute to inconsistent conclusions. This perspective provides a comprehensive and critical overview of bioinformatics methods for commonly used mass spectrometry-based quantitative proteomics, employing both stable isotope labeling and label-free approaches. We evaluate the challenges associated with current quantitative proteomics techniques, placing particular emphasis on data analyses. The complexity of processing and interpreting proteomics datasets has become a central issue as sensitivity, mass resolution, mass accuracy and throughput of mass spectrometers have improved. A number of computer programs are available to address these challenges, and are reviewed here. We focus on approaches for signal processing, noise reduction, and methods for protein abundance estimation.
Collapse
|
28
|
Combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from MS-based proteomics data. J Proteome Res 2010; 9:5748-56. [PMID: 20831241 PMCID: PMC2974810 DOI: 10.1021/pr1005247] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Liquid chromatography−mass spectrometry-based (LC−MS) proteomics uses peak intensities of proteolytic peptides to infer the differential abundance of peptides/proteins. However, substantial run-to-run variability in intensities and observations (presence/absence) of peptides makes data analysis quite challenging. The missing observations in LC−MS proteomics data are difficult to address with traditional imputation-based approaches because the mechanisms by which data are missing are unknown a priori. Data can be missing due to random mechanisms such as experimental error or nonrandom mechanisms such as a true biological effect. We present a statistical approach that uses a test of independence known as a G-test to test the null hypothesis of independence between the number of missing values across experimental groups. We pair the G-test results, evaluating independence of missing data (IMD) with an analysis of variance (ANOVA) that uses only means and variances computed from the observed data. Each peptide is therefore represented by two statistical confidence metrics, one for qualitative differential observation and one for quantitative differential intensity. We use three LC−MS data sets to demonstrate the robustness and sensitivity of the IMD−ANOVA approach. Missing abundance values in LC−MS data are difficult to analyze statistically because the mechanisms by which the data are missing are unknown (processing or biological effect). We present a new approach that pairs a test of independence on missing data to discern qualitative difference across treatment groups with traditional statistical tests that evaluate quantitative differences. The combination of these two statistics yields a more robust statistical description of the data.
Collapse
|
29
|
Abstract
Spectral count, defined as the total number of spectra identified for a protein, has gained acceptance as a practical, label-free, semiquantitative measure of protein abundance in proteomic studies. In this review, we discuss issues affecting the performance of spectral counting relative to other label-free methods, as well as its limitations. Possible consequences of modifications, which are commonly applied to raw spectral counts to improve abundance estimations, are considered. The use of spectral counting for different types of quantitation studies is explored and critiqued. Different statistical methods and underlying frameworks that have been applied to spectral count analysis are described and compared, and problem areas that undermine confident statistical analysis are considered. Finally, the issue of accurate estimation of false-discovery rates is addressed and identified as a major current challenge in quantitative proteomics.
Collapse
|
30
|
Abstract
The goal of many LC-MS proteomic investigations is to quantify and compare the abundance of proteins in complex biological mixtures. However, the output of an LC-MS experiment is not a list of proteins, but a list of quantified spectral features. To make protein-level conclusions, researchers typically apply ad hoc rules, or take an average of feature abundance to obtain a single protein-level quantity for each sample. We argue that these two approaches are inadequate. We discuss two statistical models, namely, fixed and mixed effects Analysis of Variance (ANOVA), which views individual features as replicate measurements of a protein's abundance, and explicitly account for this redundancy. We demonstrate, using a spike-in and a clinical data set, that the proposed models improve the sensitivity and specificity of testing, improve the accuracy of patient-specific protein quantifications, and are more robust in the presence of missing data.
Collapse
|
31
|
Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Mol Cell Proteomics 2009; 8:2285-95. [PMID: 19596695 DOI: 10.1074/mcp.m800514-mcp200] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The performances of 10 different normalization methods on data of endogenous brain peptides produced with label-free nano-LC-MS were evaluated. Data sets originating from three different species (mouse, rat, and Japanese quail), each consisting of 35-45 individual LC-MS analyses, were used in the study. Each sample set contained both technical and biological replicates, and the LC-MS analyses were performed in a randomized block fashion. Peptides in all three data sets were found to display LC-MS analysis order-dependent bias. Global normalization methods will only to some extent correct this type of bias. Only the novel normalization procedure RegrRun (linear regression followed by analysis order normalization) corrected for this type of bias. The RegrRun procedure performed the best of the normalization methods tested and decreased the median S.D. by 43% on average compared with raw data. This method also produced the smallest fraction of peptides with interblock differences while producing the largest fraction of differentially expressed peaks between treatment groups in all three data sets. Linear regression normalization (Regr) performed second best and decreased median S.D. by 38% on average compared with raw data. All other examined methods reduced median S.D. by 20-30% on average compared with raw data.
Collapse
|
32
|
Statistical design of quantitative mass spectrometry-based proteomic experiments. J Proteome Res 2009; 8:2144-56. [PMID: 19222236 DOI: 10.1021/pr8010099] [Citation(s) in RCA: 190] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We review the fundamental principles of statistical experimental design, and their application to quantitative mass spectrometry-based proteomics. We focus on class comparison using Analysis of Variance (ANOVA), and discuss how randomization, replication and blocking help avoid systematic biases due to the experimental procedure, and help optimize our ability to detect true quantitative changes between groups. We also discuss the issues of pooling multiple biological specimens for a single mass analysis, and calculation of the number of replicates in a future study. When applicable, we emphasize the parallels between designing quantitative proteomic experiments and experiments with gene expression microarrays, and give examples from that area of research. We illustrate the discussion using theoretical considerations, and using real-data examples of profiling of disease.
Collapse
|
33
|
Platelet proteome changes associated with diabetes and during platelet storage for transfusion. J Proteome Res 2009; 8:2261-72. [PMID: 19267493 DOI: 10.1021/pr800885j] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Human platelets play a key role in hemostasis and thrombosis and have recently emerged as key regulators of inflammation. Platelets stored for transfusion produce pro-thrombotic and pro-inflammatory mediators implicated in adverse transfusion reactions. Correspondingly, these mediators are central players in pathological conditions including cardiovascular disease, the major cause of death in diabetics. In view of this, a mass spectrometry based proteomics study was performed on platelets collected from healthy and type-2 diabetics stored for transfusion. Strikingly, our innovative and sensitive proteomic approach identified 122 proteins that were either up- or down-regulated in type-2 diabetics relative to nondiabetic controls and 117 proteins whose abundances changed during a 5-day storage period. Notably, our studies are the first to characterize the proteome of platelets from diabetics before and after storage for transfusion. These identified differences allow us to formulate new hypotheses and experimentation to improve clinical outcomes by targeting "high risk platelets" that render platelet transfusion less effective or even unsafe.
Collapse
|
34
|
Relationship between Sample Loading Amount and Peptide Identification and Its Effects on Quantitative Proteomics. Anal Chem 2009; 81:1307-14. [DOI: 10.1021/ac801466k] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|