1
|
Frankenfield AM, Yang KL, Mazli WNAB, Shih J, Yu F, Lo E, Nesvizhskii AI, Hao L. Benchmarking SILAC Proteomics Workflows and Data Analysis Platforms. Mol Cell Proteomics 2025; 24:100980. [PMID: 40315959 DOI: 10.1016/j.mcpro.2025.100980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2024] [Revised: 04/07/2025] [Accepted: 04/28/2025] [Indexed: 05/04/2025] Open
Abstract
Stable isotope labeling by amino acids in cell culture (SILAC) is a powerful metabolic labeling technique with broad applications and various study designs. SILAC proteomics relies on the accurate identification and quantification of all isotopic versions of proteins and peptides during both data acquisition and analysis. However, a comprehensive comparison and evaluation of SILAC data analysis platforms is currently lacking. To address this critical gap and offer practical guidelines for SILAC proteomics data analysis, we designed a comprehensive benchmarking pipeline to evaluate various in vitro SILAC workflows and commonly used data analysis software. Ten different SILAC data analysis workflows using five software packages (MaxQuant, Proteome Discoverer, FragPipe, DIA-NN, and Spectronaut) were evaluated for static and dynamic SILAC labeling with both DDA and DIA methods. For benchmarking, we used both in-house generated and repository SILAC proteomics datasets from HeLa and neuron culture samples. We assessed 12 performance metrics for SILAC proteomics including identification, quantification, accuracy, precision, reproducibility, filtering criteria, missing values, false discovery rate, protein half-life measurement, data completeness, unique software features, and speed of data analysis. Each method/software has its strengths and weaknesses when evaluated for these performance metrics. Most software reaches a dynamic range limit of 100-fold for accurate quantification of light/heavy ratios. We do not recommend using Proteome Discoverer for SILAC DDA analysis despite its wide use in label-free proteomics. To achieve greater confidence in SILAC quantification, researchers could use more than one software packages to analyze the same dataset for cross-validation. In summary, this study offers the first systematic evaluation of various SILAC data analysis platforms, providing practical guidelines to support decision-making in SILAC proteomics study design and data analysis.
Collapse
Affiliation(s)
- Ashley M Frankenfield
- Department of Chemistry, George Washington University, Washington, District of Columbia, USA
| | - Kevin L Yang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | | | - Jamison Shih
- Department of Chemistry, George Washington University, Washington, District of Columbia, USA
| | - Fengchao Yu
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Edwin Lo
- Data Science Institute, University of Chicago, Chicago, Illinois, USA
| | - Alexey I Nesvizhskii
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA; Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Ling Hao
- Department of Chemistry, George Washington University, Washington, District of Columbia, USA; Department of Chemistry & Biochemistry, University of Maryland, College Park, Maryland, USA.
| |
Collapse
|
2
|
Zelter A, Riffle M, Shteynberg DD, Zhong G, Riddle EB, Hoopmann MR, Jaschob D, Moritz RL, Davis TN, MacCoss MJ, Isoherranen N. Detection and Quantification of Drug-Protein Adducts in Human Liver. J Proteome Res 2024; 23:5143-5152. [PMID: 39442081 PMCID: PMC11537226 DOI: 10.1021/acs.jproteome.4c00663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 09/19/2024] [Accepted: 10/10/2024] [Indexed: 10/25/2024]
Abstract
Covalent protein adducts formed by drugs or their reactive metabolites are risk factors for adverse reactions, and inactivation of cytochrome P450 (CYP) enzymes. Characterization of drug-protein adducts is limited due to lack of methods identifying and quantifying covalent adducts in complex matrices. This study presents a workflow that combines data-dependent and data-independent acquisition (DDA and DIA) based liquid chromatography with tandem mass spectrometry (LC-MS/MS) to detect very low abundance adducts resulting from CYP mediated drug metabolism in human liver microsomes (HLMs). HLMs were incubated with raloxifene as a model compound and adducts were detected in 78 proteins, including CYP3A and CYP2C family enzymes. Experiments with recombinant CYP3A and CYP2C enzymes confirmed adduct formation in all CYPs tested, including CYPs not subject to time-dependent inhibition by raloxifene. These data suggest adducts can be benign. DIA analysis showed variable adduct abundance in many peptides between livers, but no concomitant decrease of unadducted peptides. This study sets a new standard for adduct detection in complex samples, offering insights into the human adductome resulting from reactive metabolite exposure. The methodology presented will aid mechanistic studies to identify, quantify and differentiate between adducts that result in adverse drug reactions and those that are benign.
Collapse
Affiliation(s)
- Alex Zelter
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | - Michael Riffle
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | | | - Guo Zhong
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | - Ellen B. Riddle
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | | | - Daniel Jaschob
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | - Robert L. Moritz
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | - Trisha N. Davis
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | - Michael J. MacCoss
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| | - Nina Isoherranen
- Department
of Genome Sciences, Department of Biochemistry, and Department of Pharmaceutics, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
3
|
Madej D, Lam H. On the use of tandem mass spectra acquired from samples of evolutionarily distant organisms to validate methods for false discovery rate estimation. Proteomics 2024; 24:e2300398. [PMID: 38491400 DOI: 10.1002/pmic.202300398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 03/18/2024]
Abstract
Estimating the false discovery rate (FDR) of peptide identifications is a key step in proteomics data analysis, and many methods have been proposed for this purpose. Recently, an entrapment-inspired protocol to validate methods for FDR estimation appeared in articles showcasing new spectral library search tools. That validation approach involves generating incorrect spectral matches by searching spectra from evolutionarily distant organisms (entrapment queries) against the original target search space. Although this approach may appear similar to the solutions using entrapment databases, it represents a distinct conceptual framework whose correctness has not been verified yet. In this viewpoint, we first discussed the background of the entrapment-based validation protocols and then conducted a few simple computational experiments to verify the assumptions behind them. The results reveal that entrapment databases may, in some implementations, be a reasonable choice for validation, while the assumptions underpinning validation protocols based on entrapment queries are likely to be violated in practice. This article also highlights the need for well-designed frameworks for validating FDR estimation methods in proteomics.
Collapse
Affiliation(s)
- Dominik Madej
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| |
Collapse
|
4
|
Fröhlich K, Fahrner M, Brombacher E, Seredynska A, Maldacker M, Kreutz C, Schmidt A, Schilling O. Data-Independent Acquisition: A Milestone and Prospect in Clinical Mass Spectrometry-Based Proteomics. Mol Cell Proteomics 2024; 23:100800. [PMID: 38880244 PMCID: PMC11380018 DOI: 10.1016/j.mcpro.2024.100800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 06/08/2024] [Accepted: 06/13/2024] [Indexed: 06/18/2024] Open
Abstract
Data-independent acquisition (DIA) has revolutionized the field of mass spectrometry (MS)-based proteomics over the past few years. DIA stands out for its ability to systematically sample all peptides in a given m/z range, allowing an unbiased acquisition of proteomics data. This greatly mitigates the issue of missing values and significantly enhances quantitative accuracy, precision, and reproducibility compared to many traditional methods. This review focuses on the critical role of DIA analysis software tools, primarily focusing on their capabilities and the challenges they address in proteomic research. Advances in MS technology, such as trapped ion mobility spectrometry, or high field asymmetric waveform ion mobility spectrometry require sophisticated analysis software capable of handling the increased data complexity and exploiting the full potential of DIA. We identify and critically evaluate leading software tools in the DIA landscape, discussing their unique features, and the reliability of their quantitative and qualitative outputs. We present the biological and clinical relevance of DIA-MS and discuss crucial publications that paved the way for in-depth proteomic characterization in patient-derived specimens. Furthermore, we provide a perspective on emerging trends in clinical applications and present upcoming challenges including standardization and certification of MS-based acquisition strategies in molecular diagnostics. While we emphasize the need for continuous development of software tools to keep pace with evolving technologies, we advise researchers against uncritically accepting the results from DIA software tools. Each tool may have its own biases, and some may not be as sensitive or reliable as others. Our overarching recommendation for both researchers and clinicians is to employ multiple DIA analysis tools, utilizing orthogonal analysis approaches to enhance the robustness and reliability of their findings.
Collapse
Affiliation(s)
- Klemens Fröhlich
- Proteomics Core Facility, Biozentrum Basel, University of Basel, Basel, Switzerland
| | - Matthias Fahrner
- Institute for Surgical Pathology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany; German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Freiburg, Germany
| | - Eva Brombacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany; Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany; Spemann Graduate School of Biology and Medicine (SGBM), University of Freiburg, Freiburg, Germany; Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - Adrianna Seredynska
- Institute for Surgical Pathology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany; German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Freiburg, Germany; Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - Maximilian Maldacker
- Institute for Surgical Pathology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany; Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany; Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany
| | - Alexander Schmidt
- Proteomics Core Facility, Biozentrum Basel, University of Basel, Basel, Switzerland
| | - Oliver Schilling
- Institute for Surgical Pathology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany; German Cancer Consortium (DKTK) and Cancer Research Center (DKFZ), Freiburg, Germany.
| |
Collapse
|
5
|
Chen YE, Ge X, Woyshner K, McDermott M, Manousopoulou A, Ficarro SB, Marto JA, Li K, Wang LD, Li JJ. APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae042. [PMID: 39198030 DOI: 10.1093/gpbjnl/qzae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 02/26/2024] [Accepted: 03/11/2024] [Indexed: 09/01/2024]
Abstract
Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.
Collapse
Affiliation(s)
- Yiling Elaine Chen
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA
| | - Xinzhou Ge
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA
| | - Kyla Woyshner
- Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - MeiLu McDermott
- Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Antigoni Manousopoulou
- Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Scott B Ficarro
- Department of Cancer Biology and Blais Proteomics Center, Dana-Farber Cancer Institute, Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02215, USA
| | - Jarrod A Marto
- Department of Cancer Biology and Blais Proteomics Center, Dana-Farber Cancer Institute, Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02215, USA
| | - Kexin Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA
| | - Leo David Wang
- Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA
- Department of Pediatrics, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, USA
- Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
- Department of Computational Medicine, University of California, Los Angeles, CA 90095, USA
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
6
|
Freestone J, Noble WS, Keich U. Reinvestigating the Correctness of Decoy-Based False Discovery Rate Control in Proteomics Tandem Mass Spectrometry. J Proteome Res 2024; 23:1907-1914. [PMID: 38687997 DOI: 10.1021/acs.jproteome.3c00902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]
Abstract
Traditional database search methods for the analysis of bottom-up proteomics tandem mass spectrometry (MS/MS) data are limited in their ability to detect peptides with post-translational modifications (PTMs). Recently, "open modification" database search strategies, in which the requirement that the mass of the database peptide closely matches the observed precursor mass is relaxed, have become popular as ways to find a wider variety of types of PTMs. Indeed, in one study, Kong et al. reported that the open modification search tool MSFragger can achieve higher statistical power to detect peptides than a traditional "narrow window" database search. We investigated this claim empirically and, in the process, uncovered a potential general problem with false discovery rate (FDR) control in the machine learning postprocessors Percolator and PeptideProphet. This problem might have contributed to Kong et al.'s report that their empirical results suggest that false discovery (FDR) control in the narrow window setting might generally be compromised. Indeed, reanalyzing the same data while using a more standard form of target-decoy competition-based FDR control, we found that, after accounting for chimeric spectra as well as for the inherent difference in the number of candidates in open and narrow searches, the data does not provide sufficient evidence that FDR control in proteomics MS/MS database search is inherently problematic.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, New South Wales 2006, Australia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, New South Wales 2006, Australia
| |
Collapse
|
7
|
Lin A, See D, Fondrie WE, Keich U, Noble WS. Target-decoy false discovery rate estimation using Crema. Proteomics 2024; 24:e2300084. [PMID: 38380501 DOI: 10.1002/pmic.202300084] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 01/06/2024] [Accepted: 01/16/2024] [Indexed: 02/22/2024]
Abstract
Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington, USA
| | - Donavan See
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| |
Collapse
|
8
|
Strauss MT, Bludau I, Zeng WF, Voytik E, Ammar C, Schessner JP, Ilango R, Gill M, Meier F, Willems S, Mann M. AlphaPept: a modern and open framework for MS-based proteomics. Nat Commun 2024; 15:2168. [PMID: 38461149 PMCID: PMC10924963 DOI: 10.1038/s41467-024-46485-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/20/2024] [Indexed: 03/11/2024] Open
Abstract
In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making efficient analysis a principal challenge. A plethora of different computational tools can process the MS data to derive peptide and protein identification and quantification. However, during the last years there has been dramatic progress in computer science, including collaboration tools that have transformed research and industry. To leverage these advances, we develop AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Numba for just-in-time compilation on CPU and GPU achieves hundred-fold speed improvements. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while accessing the latest advances. We provide an easy on-ramp for community contributions through the concept of literate programming, implemented in Jupyter Notebooks. Large datasets can rapidly be processed as shown by the analysis of hundreds of proteomes in minutes per file, many-fold faster than acquisition. AlphaPept can be used to build automated processing pipelines with web-serving functionality and compatibility with downstream analysis tools. It provides easy access via one-click installation, a modular Python library for advanced users, and via an open GitHub repository for developers.
Collapse
Affiliation(s)
- Maximilian T Strauss
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Isabell Bludau
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Eugenia Voytik
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Constantin Ammar
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Julia P Schessner
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | | | | | - Florian Meier
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
- Functional Proteomics, Jena University Hospital, Jena, Germany
| | - Sander Willems
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Matthias Mann
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
9
|
Yu F, Teo GC, Kong AT, Fröhlich K, Li GX, Demichev V, Nesvizhskii AI. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat Commun 2023; 14:4154. [PMID: 37438352 PMCID: PMC10338508 DOI: 10.1038/s41467-023-39869-5] [Citation(s) in RCA: 92] [Impact Index Per Article: 46.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 06/28/2023] [Indexed: 07/14/2023] Open
Abstract
Liquid chromatography (LC) coupled with data-independent acquisition (DIA) mass spectrometry (MS) has been increasingly used in quantitative proteomics studies. Here, we present a fast and sensitive approach for direct peptide identification from DIA data, MSFragger-DIA, which leverages the unmatched speed of the fragment ion indexing-based search engine MSFragger. Different from most existing methods, MSFragger-DIA conducts a database search of the DIA tandem mass (MS/MS) spectra prior to spectral feature detection and peak tracing across the LC dimension. To streamline the analysis of DIA data and enable easy reproducibility, we integrate MSFragger-DIA into the FragPipe computational platform for seamless support of peptide identification and spectral library building from DIA, data-dependent acquisition (DDA), or both data types combined. We compare MSFragger-DIA with other DIA tools, such as DIA-Umpire based workflow in FragPipe, Spectronaut, DIA-NN library-free, and MaxDIA. We demonstrate the fast, sensitive, and accurate performance of MSFragger-DIA across a variety of sample types and data acquisition schemes, including single-cell proteomics, phosphoproteomics, and large-scale tumor proteome profiling studies.
Collapse
Affiliation(s)
- Fengchao Yu
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA.
| | - Guo Ci Teo
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Andy T Kong
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Klemens Fröhlich
- Proteomics Core Facility, Biozentrum, University of Basel, Basel, Switzerland
| | - Ginny Xiaohe Li
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Vadim Demichev
- Department of Biochemistry, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Alexey I Nesvizhskii
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
10
|
Phlairaharn T, Ye Z, Krismer E, Pedersen AK, Pietzner M, Olsen JV, Schoof EM, Searle BC. Optimizing Linear Ion-Trap Data-Independent Acquisition toward Single-Cell Proteomics. Anal Chem 2023; 95:9881-9891. [PMID: 37338819 DOI: 10.1021/acs.analchem.3c00842] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/21/2023]
Abstract
A linear ion trap (LIT) is an affordable, robust mass spectrometer that provides fast scanning speed and high sensitivity, where its primary disadvantage is inferior mass accuracy compared to more commonly used time-of-flight or orbitrap (OT) mass analyzers. Previous efforts to utilize the LIT for low-input proteomics analysis still rely on either built-in OTs for collecting precursor data or OT-based library generation. Here, we demonstrate the potential versatility of the LIT for low-input proteomics as a stand-alone mass analyzer for all mass spectrometry (MS) measurements, including library generation. To test this approach, we first optimized LIT data acquisition methods and performed library-free searches with and without entrapment peptides to evaluate both the detection and quantification accuracy. We then generated matrix-matched calibration curves to estimate the lower limit of quantification using only 10 ng of starting material. While LIT-MS1 measurements provided poor quantitative accuracy, LIT-MS2 measurements were quantitatively accurate down to 0.5 ng on the column. Finally, we optimized a suitable strategy for spectral library generation from low-input material, which we used to analyze single-cell samples by LIT-DIA using LIT-based libraries generated from as few as 40 cells.
Collapse
Affiliation(s)
- Teeradon Phlairaharn
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, København 2200, Denmark
- Department of Bioscience, TUM School of Natural Sciences, Technical University of Munich, Garching (bei München) 85748, Germany
- Computational Medicine, Berlin Institute of Health at Charité─Universitätsmedizin Berlin, Berlin 10117, Germany
| | - Zilu Ye
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, København 2200, Denmark
| | - Elena Krismer
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, København 2200, Denmark
| | - Anna-Kathrine Pedersen
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, København 2200, Denmark
| | - Maik Pietzner
- Computational Medicine, Berlin Institute of Health at Charité─Universitätsmedizin Berlin, Berlin 10117, Germany
| | - Jesper V Olsen
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, København 2200, Denmark
| | - Erwin M Schoof
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Lyngby 2800, Denmark
| | - Brian C Searle
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio 43210, United States
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, United States
| |
Collapse
|
11
|
Zhang Q. Mzion enables deep and precise identification of peptides in data-dependent acquisition proteomics. Sci Rep 2023; 13:7056. [PMID: 37120666 PMCID: PMC10148867 DOI: 10.1038/s41598-023-34323-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 04/27/2023] [Indexed: 05/01/2023] Open
Abstract
Sensitive and reliable identification of proteins and peptides pertains the basis of proteomics. We introduce Mzion, a new database search tool for data-dependent acquisition (DDA) proteomics. Our tool utilizes an intensity tally strategy and achieves generally a higher performance in terms of depth and precision across 20 datasets, ranging from large-scale to single-cell proteomics. Compared to several other search engines, Mzion matches on average 20% more peptide spectra at tryptic enzymatic specificity and 80% more at no enzymatic specificity from six large-scale, global datasets. Mzion also identifies more phosphopeptide spectra that can be explained by fewer proteins, demonstrated by six large-scale, local datasets corresponding to the global data. Our findings highlight the potential of Mzion for improving proteomic analysis and advancing our understanding of protein biology.
Collapse
Affiliation(s)
- Qiang Zhang
- Division of Endocrinology, Metabolism and Lipid Research, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
12
|
Phlairaharn T, Ye Z, Krismer E, Pedersen AK, Pietzner M, Olsen JV, Schoof EM, Searle BC. Optimizing linear ion trap data independent acquisition towards single cell proteomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.21.529444. [PMID: 36865114 PMCID: PMC9980145 DOI: 10.1101/2023.02.21.529444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/23/2023]
Abstract
A linear ion trap (LIT) is an affordable, robust mass spectrometer that proves fast scanning speed and high sensitivity, where its primary disadvantage is inferior mass accuracy compared to more commonly used time-of-flight (TOF) or orbitrap (OT) mass analyzers. Previous efforts to utilize the LIT for low-input proteomics analysis still rely on either built-in OTs for collecting precursor data or OT-based library generation. Here, we demonstrate the potential versatility of the LIT for low-input proteomics as a stand-alone mass analyzer for all mass spectrometry measurements, including library generation. To test this approach, we first optimized LIT data acquisition methods and performed library-free searches with and without entrapment peptides to evaluate both the detection and quantification accuracy. We then generated matrix-matched calibration curves to estimate the lower limit of quantification using only 10 ng of starting material. While LIT-MS1 measurements provided poor quantitative accuracy, LIT-MS2 measurements were quantitatively accurate down to 0.5 ng on column. Finally, we optimized a suitable strategy for spectral library generation from low-input material, which we used to analyze single-cell samples by LIT-DIA using LIT-based libraries generated from as few as 40 cells.
Collapse
|
13
|
The M, Käll L. Integrating Identification and Quantification Uncertainty for Differential Protein Abundance Analysis with Triqler. Methods Mol Biol 2023; 2426:91-117. [PMID: 36308686 DOI: 10.1007/978-1-0716-1967-4_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Protein quantification for shotgun proteomics is a complicated process where errors can be introduced in each of the steps. Triqler is a Python package that estimates and integrates errors of the different parts of the label-free protein quantification pipeline into a single Bayesian model. Specifically, it weighs the quantitative values by the confidence we have in the correctness of the corresponding PSM. Furthermore, it treats missing values in a way that reflects their uncertainty relative to observed values. Finally, it combines these error estimates in a single differential abundance FDR that not only reflects the errors and uncertainties in quantification but also in identification. In this tutorial, we show how to (1) generate input data for Triqler from quantification packages such as MaxQuant and Quandenser, (2) run Triqler and what the different options are, (3) interpret the results, (4) investigate the posterior distributions of a protein of interest in detail, and (5) verify that the hyperparameter estimations are sensible.
Collapse
Affiliation(s)
- Matthew The
- Chair of Proteomics and Bioanalytics, Technische Universität München, Freising, Germany.
| | - Lukas Käll
- Science for Life Laboratory, KTH Royal Institute of Technology, Solna, Sweden
| |
Collapse
|
14
|
Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023; 2426:25-34. [PMID: 36308683 DOI: 10.1007/978-1-0716-1967-4_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Target-decoy competition has been commonly used for over a decade to control the false discovery rate when analyzing tandem mass spectrometry (MS/MS) data. We recently developed a framework that uses multiple decoys to increase the number of detected peptides in MS/MS data. Here, we present a pipeline of Apache licensed, open-source software that allows the user to readily take advantage of our framework.
Collapse
Affiliation(s)
| | | | | | - Uri Keich
- University of Sydney, Sydney, NSW, Australia.
| |
Collapse
|
15
|
Reanalysis of ProteomicsDB Using an Accurate, Sensitive, and Scalable False Discovery Rate Estimation Approach for Protein Groups. Mol Cell Proteomics 2022; 21:100437. [PMID: 36328188 PMCID: PMC9718969 DOI: 10.1016/j.mcpro.2022.100437] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 10/16/2022] [Accepted: 10/28/2022] [Indexed: 11/07/2022] Open
Abstract
Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry-based proteomics, particularly when analyzing very large datasets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large datasets. Here, we present an extension to this method that can also deal with protein groups, that is, proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam's razor principle leads to anticonservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Reanalysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the reanalysis of the entire human section of ProteomicsDB led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.
Collapse
|
16
|
Lee S, Park H, Kim H. False discovery rate estimation using candidate peptides for each spectrum. BMC Bioinformatics 2022; 23:454. [PMID: 36319948 PMCID: PMC9623924 DOI: 10.1186/s12859-022-05002-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 10/25/2022] [Indexed: 11/06/2022] Open
Abstract
BACKGROUND False discovery rate (FDR) estimation is very important in proteomics. The target-decoy strategy (TDS), which is often used for FDR estimation, estimates the FDR under the assumption that when spectra are identified incorrectly, the probabilities of the spectra matching the target or decoy peptides are identical. However, no spectra matching target or decoy peptide probabilities are identical. We propose cTDS (target-decoy strategy with candidate peptides) for accurate estimation of the FDR using the probability that the spectrum is identified incorrectly as a target or decoy peptide. RESULTS Most spectrum cases result in a probability of having the spectrum identified incorrectly as a target or decoy peptide of close to 0.5, but only about 1.14-4.85% of the total spectra have an exact probability of 0.5. We used an entrapment sequence method to demonstrate the accuracy of cTDS. For fixed FDR thresholds (1-10%), the false match rate (FMR) in cTDS is closer than the FMR in TDS. We compared the number of peptide-spectrum matches (PSMs) obtained with TDS and cTDS at a 1% FDR threshold with the HEK293 dataset. In the first and third replications, the number of PSMs obtained with cTDS for the reverse, pseudo-reverse, shuffle, and de Bruijn databases exceeded those obtained with TDS (about 0.001-0.132%), with the pseudo-shuffle database containing less compared to TDS (about 0.05-0.126%). In the second replication, the number of PSMs obtained with cTDS for all databases exceeds that obtained with TDS (about 0.013-0.274%). CONCLUSIONS When spectra are actually identified incorrectly, most probabilities of the spectra matching a target or decoy peptide are not identical. Therefore, we propose cTDS, which estimates the FDR more accurately using the probability of the spectrum being identified incorrectly as a target or decoy peptide.
Collapse
Affiliation(s)
- Sangjeong Lee
- grid.49606.3d0000 0001 1364 9317Department of Computer Science, Hanyang University, Seoul, 06978 Republic of Korea
| | - Heejin Park
- grid.49606.3d0000 0001 1364 9317Department of Computer Science, Hanyang University, Seoul, 06978 Republic of Korea
| | - Hyunwoo Kim
- grid.249964.40000 0001 0523 5253Biomedical Informatics Team, Korea Institute of Science and Technology Information, Daejeon, 34141 Republic of Korea
| |
Collapse
|
17
|
Lin A, Short T, Noble WS, Keich U. Improving Peptide-Level Mass Spectrometry Analysis via Double Competition. J Proteome Res 2022; 21:2412-2420. [PMID: 36166314 PMCID: PMC10108709 DOI: 10.1021/acs.jproteome.2c00282] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The analysis of shotgun proteomics data often involves generating lists of inferred peptide-spectrum matches (PSMs) and/or of peptides. The canonical approach for generating these discovery lists is by controlling the false discovery rate (FDR), most commonly through target-decoy competition (TDC). At the PSM level, TDC is implemented by competing each spectrum's best-scoring target (real) peptide match with its best match against a decoy database. This PSM-level procedure can be adapted to the peptide level by selecting the top-scoring PSM per peptide prior to FDR estimation. Here, we first highlight and empirically augment a little known previous work by He et al., which showed that TDC-based PSM-level FDR estimates can be liberally biased. We thus propose that researchers instead focus on peptide-level analysis. We then investigate three ways to carry out peptide-level TDC and show that the most common method ("PSM-only") offers the lowest statistical power in practice. An alternative approach that carries out a double competition, first at the PSM and then at the peptide level ("PSM-and-peptide"), is the most powerful method, yielding an average increase of 17% more discovered peptides at 1% FDR threshold relative to the PSM-only method.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Temana Short
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| |
Collapse
|
18
|
Freestone J, Short T, Noble WS, Keich U. Group-walk: a rigorous approach to group-wise false discovery rate analysis by target-decoy competition. Bioinformatics 2022; 38:ii82-ii88. [PMID: 36124786 DOI: 10.1093/bioinformatics/btac471] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Target-decoy competition (TDC) is a commonly used method for false discovery rate (FDR) control in the analysis of tandem mass spectrometry data. This type of competition-based FDR control has recently gained significant popularity in other fields after Barber and Candès laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head-to-head comparison between an (observed) target score and a corresponding decoy (knockoff) score. However, the effectiveness of TDC depends on whether the data are homogeneous, which is often not the case: in many settings, the data consist of groups with different score profiles or different proportions of true nulls. In such cases, applying TDC while ignoring the group structure often yields imbalanced lists of discoveries, where some groups might include relatively many false discoveries and other groups include relatively very few. On the other hand, as we show, the alternative approach of applying TDC separately to each group does not rigorously control the FDR. RESULTS We developed Group-walk, a procedure that controls the FDR in the target-decoy/knockoff setting while taking into account a given group structure. Group-walk is derived from the recently developed AdaPT-a general framework for controlling the FDR with side-information. We show using simulated and real datasets that when the data naturally divide into groups with different characteristics Group-walk can deliver consistent power gains that in some cases are substantial. These groupings include the precursor charge state (4% more discovered peptides at 1% FDR threshold), the peptide length (3.6% increase) and the mass difference due to modifications (26% increase). AVAILABILITY AND IMPLEMENTATION Group-walk is available at https://cran.r-project.org/web/packages/groupwalk/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| | - Temana Short
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| |
Collapse
|
19
|
Physiological and molecular responses of lobe coral indicate nearshore adaptations to anthropogenic stressors. Sci Rep 2021; 11:3423. [PMID: 33564085 PMCID: PMC7873073 DOI: 10.1038/s41598-021-82569-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 01/18/2021] [Indexed: 01/08/2023] Open
Abstract
Corals in nearshore marine environments are increasingly exposed to reduced water quality, which is the primary local threat to Hawaiian coral reefs. It is unclear if corals surviving in such conditions have adapted to withstand sedimentation, pollutants, and other environmental stressors. Lobe coral populations from Maunalua Bay, Hawaii showed clear genetic differentiation between the 'polluted, high-stress' nearshore site and the 'less polluted, lower-stress' offshore site. To understand the driving force of the observed genetic partitioning, reciprocal transplant and common-garden experiments were conducted to assess phenotypic differences between these two populations. Physiological responses differed significantly between the populations, revealing more stress-resilient traits in the nearshore corals. Changes in protein profiles highlighted the inherent differences in the cellular metabolic processes and activities between the two; nearshore corals did not significantly alter their proteome between the sites, while offshore corals responded to nearshore transplantation with increased abundances of proteins associated with detoxification, antioxidant defense, and regulation of cellular metabolic processes. The response differences across multiple phenotypes between the populations suggest local adaptation of nearshore corals to reduced water quality. Our results provide insight into coral’s adaptive potential and its underlying processes, and reveal potential protein biomarkers that could be used to predict resiliency.
Collapse
|
20
|
Sherafat E, Force J, Măndoiu II. Semi-supervised learning for somatic variant calling and peptide identification in personalized cancer immunotherapy. BMC Bioinformatics 2020; 21:498. [PMID: 33375939 PMCID: PMC7772914 DOI: 10.1186/s12859-020-03813-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 10/13/2020] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND Personalized cancer vaccines are emerging as one of the most promising approaches to immunotherapy of advanced cancers. However, only a small proportion of the neoepitopes generated by somatic DNA mutations in cancer cells lead to tumor rejection. Since it is impractical to experimentally assess all candidate neoepitopes prior to vaccination, developing accurate methods for predicting tumor-rejection mediating neoepitopes (TRMNs) is critical for enabling routine clinical use of cancer vaccines. RESULTS In this paper we introduce Positive-unlabeled Learning using AuTOml (PLATO), a general semi-supervised approach to improving accuracy of model-based classifiers. PLATO generates a set of high confidence positive calls by applying a stringent filter to model-based predictions, then rescores remaining candidates by using positive-unlabeled learning. To achieve robust performance on clinical samples with large patient-to-patient variation, PLATO further integrates AutoML hyper-parameter tuning, classification threshold selection based on spies, and support for bootstrapping. CONCLUSIONS Experimental results on real datasets demonstrate that PLATO has improved performance compared to model-based approaches for two key steps in TRMN prediction, namely somatic variant calling from exome sequencing data and peptide identification from MS/MS data.
Collapse
Affiliation(s)
- Elham Sherafat
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, 06269, USA
| | - Jordan Force
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, 06269, USA
| | - Ion I Măndoiu
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, 06269, USA.
| |
Collapse
|
21
|
Couté Y, Bruley C, Burger T. Beyond Target-Decoy Competition: Stable Validation of Peptide and Protein Identifications in Mass Spectrometry-Based Discovery Proteomics. Anal Chem 2020; 92:14898-14906. [PMID: 32970414 DOI: 10.1021/acs.analchem.0c00328] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
In bottom-up discovery proteomics, target-decoy competition (TDC) is the most popular method for false discovery rate (FDR) control. Despite unquestionable statistical foundations, this method has drawbacks, including its hitherto unknown intrinsic lack of stability vis-à-vis practical conditions of application. Although some consequences of this instability have already been empirically described, they may have been misinterpreted. This article provides evidence that TDC has become less reliable as the accuracy of modern mass spectrometers improved. We therefore propose to replace TDC by a totally different method to control the FDR at the spectrum, peptide, and protein levels, while benefiting from the theoretical guarantees of the Benjamini-Hochberg framework. As this method is simpler to use, faster to compute, and more stable than TDC, we argue that it is better adapted to the standardization and throughput constraints of current proteomic platforms.
Collapse
Affiliation(s)
- Yohann Couté
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| | - Christophe Bruley
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| | - Thomas Burger
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| |
Collapse
|
22
|
Prieto G, Vázquez J. Protein Probability Model for High-Throughput Protein Identification by Mass Spectrometry-Based Proteomics. J Proteome Res 2020; 19:1285-1297. [PMID: 32037837 DOI: 10.1021/acs.jproteome.9b00819] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Shotgun proteomics is the method of choice for high-throughput protein identification; however, robust statistical methods are essential to automatize this task while minimizing the number of false identifications. The standard method for estimating the false discovery rate (FDR) of individual identifications and keeping it below a threshold (typically 1%) is the target-decoy approach. However, numerous works have shown that FDR at the protein level may become much larger than FDR at the peptide level. The development of an appropriate scoring model to identify proteins from their peptides using high-throughput shotgun proteomics is highly needed. In this study, we present a novel protein-level scoring algorithm that uses the scores of the identified peptides and maintains all of the properties expected for a true protein probability. We also present a refinement of the picked method to calculate FDR at the protein level. These algorithms can be used together as a robust identification workflow suitable for large-scale proteomics, and we show that the identification performance of this workflow is superior to that of other widely used methods in several samples and using different search engines. Our protein probability model offers the scientific community an algorithm that is easy to integrate into protein identification workflows for the automated analysis of shotgun proteomics data.
Collapse
Affiliation(s)
- Gorka Prieto
- Department of Communications Engineering, University of the Basque Country (UPV/EHU), 48013 Bilbao, Spain
| | - Jesús Vázquez
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28049 Madrid, Spain
| |
Collapse
|
23
|
Abstract
Shotgun proteomics is the method of choice for large-scale protein identification. However, the use of a robust statistical workflow to validate such identification is mandatory to minimize false matches, ambiguities, and amplification of error rates from spectra to proteins. In this chapter we emphasize the key concepts to take into account when processing the output of a search engine to obtain reliable peptide or protein identifications. We assume that the reader is already familiar with tandem mass spectrometry so we can focus on the use of statistical confidence methods. After introducing the key concepts we present different software tools and how to use them with an example dataset.
Collapse
Affiliation(s)
- Gorka Prieto
- Department of Communications Engineering, Faculty of Engineering of Bilbao, University of the Basque Country (UPV/EHU), Bilbao, Spain.
| | - Jesús Vázquez
- Laboratory of Cardiovascular Proteomics, Centro Nacional de Investigaciones Cardiovasculares (CNIC) and CIBER de Enfermedades Cardiovasculares (CIBERCV), Madrid, Spain
| |
Collapse
|
24
|
Mikan MP, Harvey HR, Timmins-Schiffman E, Riffle M, May DH, Salter I, Noble WS, Nunn BL. Metaproteomics reveal that rapid perturbations in organic matter prioritize functional restructuring over taxonomy in western Arctic Ocean microbiomes. THE ISME JOURNAL 2020; 14:39-52. [PMID: 31492961 PMCID: PMC6908719 DOI: 10.1038/s41396-019-0503-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 07/31/2019] [Accepted: 08/06/2019] [Indexed: 02/05/2023]
Abstract
We examined metaproteome profiles from two Arctic microbiomes during 10-day shipboard incubations to directly track early functional and taxonomic responses to a simulated algal bloom and an oligotrophic control. Using a novel peptide-based enrichment analysis, significant changes (p-value < 0.01) in biological and molecular functions associated with carbon and nitrogen recycling were observed. Within the first day under both organic matter conditions, Bering Strait surface microbiomes increased protein synthesis, carbohydrate degradation, and cellular redox processes while decreasing C1 metabolism. Taxonomic assignments revealed that the core microbiome collectively responded to algal substrates by assimilating carbon before select taxa utilize and metabolize nitrogen intracellularly. Incubations of Chukchi Sea bottom water microbiomes showed similar, but delayed functional responses to identical treatments. Although 24 functional terms were shared between experimental treatments, the timing, and degree of the remaining responses were highly variable, showing that organic matter perturbation directs community functionality prior to alterations to the taxonomic distribution at the microbiome class level. The dynamic responses of these two oceanic microbial communities have important implications for timing and magnitude of responses to organic perturbations within the Arctic Ocean and how community-level functions may forecast biogeochemical gradients in oceans.
Collapse
Affiliation(s)
- Molly P Mikan
- Ocean, Earth and Atmospheric Sciences, Old Dominion University, 406 Oceanography & Physical Sciences Building, Norfolk, VA, 23529, USA
| | - H Rodger Harvey
- Ocean, Earth and Atmospheric Sciences, Old Dominion University, 406 Oceanography & Physical Sciences Building, Norfolk, VA, 23529, USA
| | - Emma Timmins-Schiffman
- Department of Genome Sciences, University of Washington, William H. Foege Hall, 3720 15th Ave NE, Seattle, WA, 98195, USA
| | - Michael Riffle
- Department of Biochemistry, University of Washington, 1705 NE Pacific St., Seattle, WA, USA
| | - Damon H May
- Department of Genome Sciences, University of Washington, William H. Foege Hall, 3720 15th Ave NE, Seattle, WA, 98195, USA
| | - Ian Salter
- Faroese Marine Research Institute, Nóatún 1, FO-100, Tórshavn, Faroe Islands
- Alfred Wegener Institute Helmholtz Center for Polar and Marine Research, Bremerhaven, Germany
| | - William S Noble
- Department of Genome Sciences, University of Washington, William H. Foege Hall, 3720 15th Ave NE, Seattle, WA, 98195, USA
| | - Brook L Nunn
- Department of Genome Sciences, University of Washington, William H. Foege Hall, 3720 15th Ave NE, Seattle, WA, 98195, USA.
| |
Collapse
|
25
|
Chen ZL, Meng JM, Cao Y, Yin JL, Fang RQ, Fan SB, Liu C, Zeng WF, Ding YH, Tan D, Wu L, Zhou WJ, Chi H, Sun RX, Dong MQ, He SM. A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides. Nat Commun 2019; 10:3404. [PMID: 31363125 PMCID: PMC6667459 DOI: 10.1038/s41467-019-11337-z] [Citation(s) in RCA: 294] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Accepted: 06/20/2019] [Indexed: 01/05/2023] Open
Abstract
We describe pLink 2, a search engine with higher speed and reliability for proteome-scale identification of cross-linked peptides. With a two-stage open search strategy facilitated by fragment indexing, pLink 2 is ~40 times faster than pLink 1 and 3~10 times faster than Kojak. Furthermore, using simulated datasets, synthetic datasets, 15N metabolically labeled datasets, and entrapment databases, four analysis methods were designed to evaluate the credibility of ten state-of-the-art search engines. This systematic evaluation shows that pLink 2 outperforms these methods in precision and sensitivity, especially at proteome scales. Lastly, re-analysis of four published proteome-scale cross-linking datasets with pLink 2 required only a fraction of the time used by pLink 1, with up to 27% more cross-linked residue pairs identified. pLink 2 is therefore an efficient and reliable tool for cross-linking mass spectrometry analysis, and the systematic evaluation methods described here will be useful for future software development. The identification of cross-linked peptides at a proteome scale for interactome analyses represents a complex challenge. Here the authors report an efficient and reliable search engine pLink 2 for proteome-scale cross-linking mass spectrometry analyses, and demonstrate how to systematically evaluate the credibility of search engines.
Collapse
Affiliation(s)
- Zhen-Lin Chen
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jia-Ming Meng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yong Cao
- National Institute of Biological Sciences, Beijing, 102206, China
| | - Ji-Li Yin
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Run-Qian Fang
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Sheng-Bo Fan
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Chao Liu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yue-He Ding
- National Institute of Biological Sciences, Beijing, 102206, China
| | - Dan Tan
- National Institute of Biological Sciences, Beijing, 102206, China
| | - Long Wu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wen-Jing Zhou
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Rui-Xiang Sun
- National Institute of Biological Sciences, Beijing, 102206, China
| | - Meng-Qiu Dong
- National Institute of Biological Sciences, Beijing, 102206, China.
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
26
|
The M, Käll L. Integrated Identification and Quantification Error Probabilities for Shotgun Proteomics. Mol Cell Proteomics 2019; 18:561-570. [PMID: 30482846 PMCID: PMC6398204 DOI: 10.1074/mcp.ra118.001018] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Revised: 11/05/2018] [Indexed: 02/02/2023] Open
Abstract
Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differential proteins use intermediate filters to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered data sets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical data set we discovered 35 proteins at 5% FDR, whereas the original study discovered 1 and MaxQuant/Perseus 4 proteins at this threshold. Compellingly, these 35 proteins showed enrichment for functional annotation terms, whereas the top ranked proteins reported by MaxQuant/Perseus showed no enrichment. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.
Collapse
Affiliation(s)
- Matthew The
- From the ‡Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH - Royal Institute of Technology, Box 1031, 17121 Solna, Sweden
| | - Lukas Käll
- From the ‡Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH - Royal Institute of Technology, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
27
|
Hu A, Lu YY, Bilmes J, Noble WS. Joint Precursor Elution Profile Inference via Regression for Peptide Detection in Data-Independent Acquisition Mass Spectra. J Proteome Res 2019; 18:86-94. [PMID: 30362768 PMCID: PMC6465123 DOI: 10.1021/acs.jproteome.8b00365] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In data independent acquisition (DIA) mass spectrometry, precursor scans are interleaved with wide-window fragmentation scans, resulting in complex fragmentation spectra containing multiple coeluting peptide species. In this setting, detecting the isotope distribution profiles of intact peptides in the precursor scans can be a critical initial step in accurate peptide detection and quantification. This peak detection step is particularly challenging when the isotope peaks associated with two different peptide species overlap-or interfere-with one another. We propose a regression model, called Siren, to detect isotopic peaks in precursor DIA data that can explicitly account for interference. We validate Siren's peak-calling performance on a variety of data sets by counting how many of the peaks Siren identifies are associated with confidently detected peptides. In particular, we demonstrate that substituting the Siren regression model in place of the existing peak-calling step in DIA-Umpire leads to improved overall rates of peptide detection.
Collapse
|
28
|
Chi H, Liu C, Yang H, Zeng WF, Wu L, Zhou WJ, Wang RM, Niu XN, Ding YH, Zhang Y, Wang ZW, Chen ZL, Sun RX, Liu T, Tan GM, Dong MQ, Xu P, Zhang PH, He SM. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol 2018; 36:nbt.4236. [PMID: 30295672 DOI: 10.1038/nbt.4236] [Citation(s) in RCA: 253] [Impact Index Per Article: 36.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2017] [Accepted: 08/03/2018] [Indexed: 12/27/2022]
Abstract
We present a sequence-tag-based search engine, Open-pFind, to identify peptides in an ultra-large search space that includes coeluting peptides, unexpected modifications and digestions. Our method detects peptides with higher precision and speed than seven other search engines. Open-pFind identified 70-85% of the tandem mass spectra in four large-scale datasets and 14,064 proteins, each supported by at least two protein-unique peptides, in a human proteome dataset.
Collapse
Affiliation(s)
- Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Chao Liu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Hao Yang
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Long Wu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wen-Jing Zhou
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Rui-Min Wang
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiu-Nan Niu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yue-He Ding
- National Institute of Biological Sciences, Beijing, Beijing, China
| | - Yao Zhang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
- State Key Laboratory of Biocontrol and Guangdong Provincial Key Laboratory of Plant Resources, College of Ecology and Evolution, Sun Yat-Sen University, Guangzhou, China
| | - Zhao-Wei Wang
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhen-Lin Chen
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Rui-Xiang Sun
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tao Liu
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
| | - Guang-Ming Tan
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
| | - Meng-Qiu Dong
- National Institute of Biological Sciences, Beijing, Beijing, China
| | - Ping Xu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
| | - Pei-Heng Zhang
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
29
|
The M, Edfors F, Perez-Riverol Y, Payne SH, Hoopmann MR, Palmblad M, Forsström B, Käll L. A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms. J Proteome Res 2018; 17:1879-1886. [PMID: 29631402 DOI: 10.1021/acs.jproteome.7b00899] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Fredrik Edfors
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Samuel H Payne
- Biological Sciences Division , Pacific Northwest National Laboratory , Richland , Washington 99352 , United States
| | - Michael R Hoopmann
- Institute for Systems Biology , Seattle , Washington 98109 , United States
| | - Magnus Palmblad
- Center for Proteomics and Metabolomics , Leiden University Medical Center , 2300 RC Leiden , The Netherlands
| | - Björn Forsström
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| |
Collapse
|
30
|
Ting YS, Egertson JD, Bollinger JG, Searle BC, Payne SH, Noble WS, MacCoss MJ. PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data. Nat Methods 2017; 14:903-908. [PMID: 28783153 PMCID: PMC5578911 DOI: 10.1038/nmeth.4390] [Citation(s) in RCA: 137] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 06/20/2017] [Indexed: 12/18/2022]
Abstract
Data-independent acquisition (DIA) is an emerging mass spectrometry (MS)-based technique for unbiased and reproducible measurement of protein mixtures. DIA tandem mass spectrometry spectra are often highly multiplexed, containing product ions from multiple cofragmenting precursors. Detecting peptides directly from DIA data is therefore challenging; most DIA data analyses require spectral libraries. Here we present PECAN (http://pecan.maccosslab.org), a library-free, peptide-centric tool that robustly and accurately detects peptides directly from DIA data. PECAN reports evidence of detection based on product ion scoring, which enables detection of low-abundance analytes with poor precursor ion signal. We demonstrate the chromatographic peak picking accuracy and peptide detection capability of PECAN, and we further validate its detection with data-dependent acquisition and targeted analyses. Lastly, we used PECAN to build a plasma proteome library from DIA data and to query known sequence variants.
Collapse
Affiliation(s)
- Ying S Ting
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Jarrett D Egertson
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - James G Bollinger
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Brian C Searle
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Samuel H Payne
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA.,Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| |
Collapse
|
31
|
Levitsky LI, Ivanov MV, Lobas AA, Gorshkov MV. Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on the Target-Decoy Approach. J Proteome Res 2016; 16:393-397. [DOI: 10.1021/acs.jproteome.6b00144] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Lev I. Levitsky
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141701, Russia
- V.L.
Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119991, Russia
| | - Mark V. Ivanov
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141701, Russia
- V.L.
Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119991, Russia
| | - Anna A. Lobas
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141701, Russia
- V.L.
Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119991, Russia
| | - Mikhail V. Gorshkov
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141701, Russia
- V.L.
Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119991, Russia
| |
Collapse
|
32
|
The M, MacCoss MJ, Noble WS, Käll L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2016; 27:1719-1727. [PMID: 27572102 PMCID: PMC5059416 DOI: 10.1007/s13361-016-1460-7] [Citation(s) in RCA: 286] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 06/15/2016] [Accepted: 07/20/2016] [Indexed: 05/21/2023]
Abstract
Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method-grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein-in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, KTH - Royal Institute of Technology, Box 1031, 17121, Solna, Sweden
| | - Michael J MacCoss
- Department of Genome Sciences, School of Medicine, University of Washington, Seattle, WA, 98195, USA
| | - William S Noble
- Department of Genome Sciences, School of Medicine, University of Washington, Seattle, WA, 98195, USA
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, KTH - Royal Institute of Technology, Box 1031, 17121, Solna, Sweden.
| |
Collapse
|
33
|
Nardiello D, Natale A, Palermo C, Quinto M, Centonze D. Combined use of peptide ion and normalized delta scores to evaluate milk authenticity by ion-trap based proteomics coupled with error tolerant searching. Talanta 2016; 164:684-692. [PMID: 28107990 DOI: 10.1016/j.talanta.2016.10.102] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Revised: 10/25/2016] [Accepted: 10/30/2016] [Indexed: 12/17/2022]
Abstract
A fundamental issue in proteomics is the peptide identification by database searching and the assessment of the goodness of fit between experimental and theoretical data. Despite the different number of ways to measure the quality of search results, the definition of a scoring criterion is still highly desirable in ion-trap based proteomics. Indeed, in order to fully take advantage of a low resolution MS/MS dataset, it is essential to strike a balance between greater information capture and reduced number of incorrect peptide assignments. In addition, the development of user-specified rules is a crucial aspect when very similar proteins of the same family are analyzed in order to infer the origin species. In this study, a post-processing validation scheme is provided for the evaluation of proteomic data in shot-gun ion-trap proteomics, when a flexible database searching based on the error tolerant mode is adopted in combination with a low-specificity enzyme to maximize sequence coverage. To validate peptide assignments, we used standard β-casein digested with trypsin/chymotrypsin or trypsin alone and the popular search engine MASCOT to identify the relevant (known) peptide sequences. A linear combination between peptide ion score and normalized delta score (i.e. the difference between the best and the second best ion score, divided by the best score) is proposed to increase the accuracy in sequence assignments from low-resolution tandem mass spectra. Finally, the optimized post-processing database validation was successfully applied to the direct analysis of milk tryptic/chymotryptic digests of different origin, without resorting to two-dimensional electrophoresis that is usually performed for protein separation in ion-trap proteomics. The identification of species-specific amino acidic sequences among the validated peptide spectrum matches has allowed to fully discriminate between the animal species, so evaluating accurately the milk authenticity.
Collapse
Affiliation(s)
- Donatella Nardiello
- Dipartimento di Scienze Agrarie, degli Alimenti e dell'Ambiente and CSRA, Centro Servizi di Ricerca Applicata, Università degli Studi di Foggia, Via Napoli, 25, 71122 Foggia, Italy.
| | - Anna Natale
- Dipartimento di Scienze Agrarie, degli Alimenti e dell'Ambiente and CSRA, Centro Servizi di Ricerca Applicata, Università degli Studi di Foggia, Via Napoli, 25, 71122 Foggia, Italy
| | - Carmen Palermo
- Dipartimento di Scienze Agrarie, degli Alimenti e dell'Ambiente and CSRA, Centro Servizi di Ricerca Applicata, Università degli Studi di Foggia, Via Napoli, 25, 71122 Foggia, Italy
| | - Maurizio Quinto
- Dipartimento di Scienze Agrarie, degli Alimenti e dell'Ambiente and CSRA, Centro Servizi di Ricerca Applicata, Università degli Studi di Foggia, Via Napoli, 25, 71122 Foggia, Italy
| | - Diego Centonze
- Dipartimento di Scienze Agrarie, degli Alimenti e dell'Ambiente and CSRA, Centro Servizi di Ricerca Applicata, Università degli Studi di Foggia, Via Napoli, 25, 71122 Foggia, Italy
| |
Collapse
|
34
|
The M, Tasnim A, Käll L. How to talk about protein-level false discovery rates in shotgun proteomics. Proteomics 2016; 16:2461-9. [PMID: 27503675 PMCID: PMC5096025 DOI: 10.1002/pmic.201500431] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Revised: 05/12/2016] [Accepted: 07/20/2016] [Indexed: 12/04/2022]
Abstract
A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Ayesha Tasnim
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden.
| |
Collapse
|
35
|
May DH, Timmins-Schiffman E, Mikan MP, Harvey HR, Borenstein E, Nunn BL, Noble WS. An Alignment-Free "Metapeptide" Strategy for Metaproteomic Characterization of Microbiome Samples Using Shotgun Metagenomic Sequencing. J Proteome Res 2016; 15:2697-705. [PMID: 27396978 PMCID: PMC5116374 DOI: 10.1021/acs.jproteome.6b00239] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
In principle, tandem mass spectrometry can be used to detect and quantify the peptides present in a microbiome sample, enabling functional and taxonomic insight into microbiome metabolic activity. However, the phylogenetic diversity constituting a particular microbiome is often unknown, and many of the organisms present may not have assembled genomes. In ocean microbiome samples, with particularly diverse and uncultured bacterial communities, it is difficult to construct protein databases that contain the bulk of the peptides in the sample without losing detection sensitivity due to the overwhelming number of candidate peptides for each tandem mass spectrum. We describe a method for deriving "metapeptides" (short amino acid sequences that may be represented in multiple organisms) from shotgun metagenomic sequencing of microbiome samples. In two ocean microbiome samples, we constructed site-specific metapeptide databases to detect more than one and a half times as many peptides as by searching against predicted genes from an assembled metagenome and roughly three times as many peptides as by searching against the NCBI environmental proteome database. The increased peptide yield has the potential to enrich the taxonomic and functional characterization of sample metaproteomes.
Collapse
Affiliation(s)
- Damon H May
- Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington , Seattle, Washington 98195-5065, United States
| | - Emma Timmins-Schiffman
- Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington , Seattle, Washington 98195-5065, United States
| | - Molly P Mikan
- Department of Ocean, Earth & Atmospheric Sciences, Old Dominion University , Norfolk, Virginia 23529, United States
| | - H Rodger Harvey
- Department of Ocean, Earth & Atmospheric Sciences, Old Dominion University , Norfolk, Virginia 23529, United States
| | - Elhanan Borenstein
- Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington , Seattle, Washington 98195-5065, United States
- Santa Fe Institute , Santa Fe, New Mexico 87501, United States
| | - Brook L Nunn
- Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington , Seattle, Washington 98195-5065, United States
| | - William S Noble
- Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington , Seattle, Washington 98195-5065, United States
| |
Collapse
|
36
|
Savitski MM, Wilhelm M, Hahne H, Kuster B, Bantscheff M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets. Mol Cell Proteomics 2015; 14:2394-404. [PMID: 25987413 DOI: 10.1074/mcp.m114.046995] [Citation(s) in RCA: 325] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2014] [Indexed: 02/06/2023] Open
Abstract
Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target-decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target-decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The "picked" target-decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used "classic" protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software.
Collapse
Affiliation(s)
| | - Mathias Wilhelm
- §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany; ¶SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
| | - Hannes Hahne
- §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany
| | - Bernhard Kuster
- §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany; ‖Center for Integrated Protein Science Munich, Emil Erlenmeyer Forum 5, 85354 Freising, Germany
| | - Marcus Bantscheff
- From the ‡Cellzome GmbH, Meyerhofstrasse 1, 69117 Heidelberg, Germany;
| |
Collapse
|
37
|
Howbert JJ, Noble WS. Computing exact p-values for a cross-correlation shotgun proteomics score function. Mol Cell Proteomics 2014; 13:2467-79. [PMID: 24895379 DOI: 10.1074/mcp.o113.036327] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The core of every protein mass spectrometry analysis pipeline is a function that assesses the quality of a match between an observed spectrum and a candidate peptide. We describe a procedure for computing exact p-values for the oldest and still widely used score function, SEQUEST XCorr. The procedure uses dynamic programming to enumerate efficiently the full distribution of scores for all possible peptides whose masses are close to that of the spectrum precursor mass. Ranking identified spectra by p-value rather than XCorr significantly reduces variance because of spectrum-specific effects on the score. In combination with the Percolator postprocessor, the XCorr p-value yields more spectrum and peptide identifications at a fixed false discovery rate than Mascot, X!Tandem, Comet, and MS-GF+ across a variety of data sets.
Collapse
Affiliation(s)
- J Jeffry Howbert
- From the ‡Department of Genome Sciences, University of Washington, Seattle, Washington
| | - William Stafford Noble
- From the ‡Department of Genome Sciences, University of Washington, Seattle, Washington; §Department of Computer Science and Engineering, University of Washington, Seattle, Washington
| |
Collapse
|
38
|
Granholm V, Kim S, Navarro JCF, Sjölund E, Smith RD, Käll L. Fast and accurate database searches with MS-GF+Percolator. J Proteome Res 2013; 13:890-7. [PMID: 24344789 DOI: 10.1021/pr400937n] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
One can interpret fragmentation spectra stemming from peptides in mass-spectrometry-based proteomics experiments using so-called database search engines. Frequently, one also runs post-processors such as Percolator to assess the confidence, infer unique peptides, and increase the number of identifications. A recent search engine, MS-GF+, has shown promising results, due to a new and efficient scoring algorithm. However, MS-GF+ provides few statistical estimates about the peptide-spectrum matches, hence limiting the biological interpretation. Here, we enabled Percolator processing for MS-GF+ output and observed an increased number of identified peptides for a wide variety of data sets. In addition, Percolator directly reports p values and false discovery rate estimates, such as q values and posterior error probabilities, for peptide-spectrum matches, peptides, and proteins, functions that are useful for the whole proteomics community.
Collapse
Affiliation(s)
- Viktor Granholm
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University , Solna, Sweden
| | | | | | | | | | | |
Collapse
|
39
|
Serang O, Cansizoglu AE, Käll L, Steen H, Steen JA. Nonparametric Bayesian evaluation of differential protein quantification. J Proteome Res 2013; 12:4556-65. [PMID: 24024742 DOI: 10.1021/pr400678m] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Arbitrary cutoffs are ubiquitous in quantitative computational proteomics: maximum acceptable MS/MS PSM or peptide q value, minimum ion intensity to calculate a fold change, the minimum number of peptides that must be available to trust the estimated protein fold change (or the minimum number of PSMs that must be available to trust the estimated peptide fold change), and the "significant" fold change cutoff. Here we introduce a novel experimental setup and nonparametric Bayesian algorithm for determining the statistical quality of a proposed differential set of proteins or peptides. By comparing putatively nonchanging case-control evidence to an empirical null distribution derived from a control-control experiment, we successfully avoid some of these common parameters. We then apply our method to evaluating different fold-change rules and find that for our data a 1.2-fold change is the most permissive of the plausible fold-change rules.
Collapse
Affiliation(s)
- Oliver Serang
- Thermo Fisher Scientific Bremen , Hanna-Kunath-Straße 11, Bremen 28199, Germany
| | | | | | | | | |
Collapse
|