1
|
Madej D, Lam H. PyViscount: Validating False Discovery Rate Estimation Methods via Random Search Space Partition. J Proteome Res 2025; 24:1118-1134. [PMID: 39905949 PMCID: PMC11894659 DOI: 10.1021/acs.jproteome.4c00743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 01/20/2025] [Accepted: 01/28/2025] [Indexed: 02/06/2025]
Abstract
Validating false discovery rate (FDR) estimation is an essential but surprisingly understudied aspect of method development in shotgun proteomics. Currently available validation protocols mostly rely on ground truth data sets, which typically involve manipulating the properties of the search space or query spectra used. As a result, comparing estimated FDR and ground truth-based false discovery proportion values may not be representative of the scenarios involving natural data sets encountered in practice. In this study, we introduce PyViscount─a Python tool implementing a novel validation protocol based on random search space partition, which enables generating a quasi ground-truth using unaltered search spaces of unique candidate peptides and generic data sets of experimental query spectra. Furthermore, validation of existing FDR estimation methods by PyViscount is consistent with alternative validation protocols. The presented novel approach to validation free from the need for synthetic data sets or dubious manipulation of the data may be an attractive alternative for proteomics practitioners, allowing them to obtain deeper insights into the performance of existing and new FDR estimation methods.
Collapse
Affiliation(s)
- Dominik Madej
- Department of Chemical and
Biological Engineering, The Hong Kong University
of Science and Technology, Hong Kong 999077, China
| | - Henry Lam
- Department of Chemical and
Biological Engineering, The Hong Kong University
of Science and Technology, Hong Kong 999077, China
| |
Collapse
|
2
|
Deshpande AS, Lin A, O'Bryon I, Aufrecht JA, Merkley ED. Emerging protein sequencing technologies: proteomics without mass spectrometry? Expert Rev Proteomics 2025; 22:89-106. [PMID: 40105028 DOI: 10.1080/14789450.2025.2476979] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 02/12/2025] [Accepted: 03/03/2025] [Indexed: 03/20/2025]
Abstract
INTRODUCTION Liquid chromatography-tandem mass spectrometry (LC-MS/MS) has been a leading method for proteomics for 30 years. Advantages provided by LC-MS/MS are offset by significant disadvantages, including cost. Recently, several non-mass spectrometric methods have emerged, but little information is available about their capacity to analyze the complex mixtures routine for mass spectrometry. AREAS COVERED We review recent non-mass-spectrometric methods for sequencing proteins and peptides, including those using nanopores, sequencing by degradation, reverse translation, and short-epitope mapping, with comments on bioinformatics challenges, fundamental limitations, and areas where new technologies will be more or less competitive with LC-MS/MS. In addition to conventional literature searches, instrument vendor websites, patents, webinars, and preprints were also consulted to give a more up-to-date picture. EXPERT OPINION Many new technologies are promising. However, demonstrations that they outperform mass spectrometry in terms of peptides and proteins identified have not yet been published, and astute observers note important disadvantages, especially relating to the dynamic range of single-molecule measurements of complex mixtures. Still, even if the performance of emerging methods proves inferior to LC-MS/MS, their low cost could create a different kind of revolution: a dramatic increase in the number of biology laboratories engaging in new forms of proteomics research.
Collapse
Affiliation(s)
- A S Deshpande
- Biogeochemical Transformations Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - A Lin
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - I O'Bryon
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - J A Aufrecht
- Biogeochemical Transformations Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - E D Merkley
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| |
Collapse
|
3
|
Chu F, Lin A. Detecting Human Contaminant Genetically Variant Peptides in Nonhuman Samples. J Proteome Res 2025; 24:579-588. [PMID: 39705712 DOI: 10.1021/acs.jproteome.4c00718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2024]
Abstract
During proteomics data analysis, experimental spectra are searched against a user-defined protein database consisting of proteins that are reasonably expected to be present in the sample. Typically, this database contains the proteome of the organism under study concatenated with expected contaminants, such as trypsin and human keratins. However, there are additional contaminants that are not commonly added to the database. In this study, we describe a new set of protein contaminants and provide evidence that they can be detected in mass spectrometry-based proteomics data. Specifically, we provide evidence that human genetically variant peptides (GVPs) can be detected in nonhuman samples. GVPs are peptides that contain single amino acid polymorphisms that result from nonsynonymous single nucleotide polymorphisms in protein-coding regions of DNA. We reanalyzed previously collected nonhuman data-dependent acquisition (DDA) and data-independent acquisition (DIA) data sets and detected between 0 and 135 GVPs per data set. In addition, we show that GVPs are unlikely to originate from nonhuman sources and that a subset of eight GVPs are commonly detected across data sets.
Collapse
Affiliation(s)
- Fanny Chu
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| |
Collapse
|
4
|
Lin A, See D, Fondrie WE, Keich U, Noble WS. Target-decoy false discovery rate estimation using Crema. Proteomics 2024; 24:e2300084. [PMID: 38380501 DOI: 10.1002/pmic.202300084] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 01/06/2024] [Accepted: 01/16/2024] [Indexed: 02/22/2024]
Abstract
Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington, USA
| | - Donavan See
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| |
Collapse
|
5
|
Bhimani K, Peresadina A, Vozniuk D, Kertész-Farkas A. Exact p-value calculation for XCorr scoring of high-resolution MS/MS data. Proteomics 2024; 24:e2300145. [PMID: 37726251 DOI: 10.1002/pmic.202300145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 08/25/2023] [Accepted: 08/28/2023] [Indexed: 09/21/2023]
Abstract
Exact p-value (XPV)-based methods for dot product-like score functions-such as the XCorr score implemented in Tide, SEQUEST, Comet or shared peak count-based scoring in MSGF+ and ASPV-provide a fairly good calibration for peptide-spectrum-match (PSM) scoring in database searching-based MS/MS spectrum data identification. Unfortunately, standard XPV methods, in practice, cannot handle high-resolution fragmentation data produced by state-of-the-art mass spectrometers because having smaller bins increases the number of fragment matches that are assigned to incorrect bins and scored improperly. In this article, we present an extension of the XPV method, called the high-resolution exact p-value (HR-XPV) method, which can be used to calibrate PSM scores of high-resolution MS/MS spectra obtained with dot product-like scoring such as the XCorr. The HR-XPV carries remainder masses throughout the fragmentation, allowing them to greatly increase the number of fragments that are properly assigned to the correct bin and, thus, taking advantage of high-resolution data. Using four mass spectrometry data sets, our experimental results demonstrate that HR-XPV produces well-calibrated scores, which in turn results in more trusted spectrum annotations at any false discovery rate level.
Collapse
Affiliation(s)
- Kishankumar Bhimani
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow, Russian Federation
| | - Arina Peresadina
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow, Russian Federation
| | - Dmitrii Vozniuk
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow, Russian Federation
| | - Attila Kertész-Farkas
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow, Russian Federation
| |
Collapse
|
6
|
Kertesz-Farkas A, Nii Adoquaye Acquaye FL, Bhimani K, Eng JK, Fondrie WE, Grant C, Hoopmann MR, Lin A, Lu YY, Moritz RL, MacCoss MJ, Noble WS. The Crux Toolkit for Analysis of Bottom-Up Tandem Mass Spectrometry Proteomics Data. J Proteome Res 2023; 22:561-569. [PMID: 36598107 DOI: 10.1021/acs.jproteome.2c00615] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
The Crux tandem mass spectrometry data analysis toolkit provides a collection of algorithms for analyzing bottom-up proteomics tandem mass spectrometry data. Many publications have described various individual components of Crux, but a comprehensive summary has not been published since 2014. The goal of this work is to summarize the functionality of Crux, focusing on developments since 2014. We begin with empirical results demonstrating our recently implemented speedups to the Tide search engine. Other new features include a new score function in Tide, two new confidence estimation procedures, as well as three new tools: Param-medic for estimating search parameters directly from mass spectrometry data, Kojak for searching cross-linked mass spectra, and DIAmeter for searching data independent acquisition data against a sequence database.
Collapse
Affiliation(s)
- Attila Kertesz-Farkas
- Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, 20 Myasnitskaya ulitsa, Moscow 101000, Russia
| | - Frank Lawrence Nii Adoquaye Acquaye
- Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, 20 Myasnitskaya ulitsa, Moscow 101000, Russia
| | - Kishankumar Bhimani
- Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, 20 Myasnitskaya ulitsa, Moscow 101000, Russia
| | - Jimmy K Eng
- Proteomics Resource, University of Washington, 850 Republican Street, Seattle, Washington 98109-4725, United States
| | - William E Fondrie
- Talus Bioscience550 17th Avenue, Seattle, Washington 98122, United States
| | - Charles Grant
- Department of Genome Sciences, University of Washington3720 15th Avenue NE, Seattle, Washington 98195, United States
| | - Michael R Hoopmann
- Insititute for Systems Biology, 401 Terry Avenue N, Seattle, Washington 98109, United States
| | - Andy Lin
- Department of Genome Sciences, University of Washington3720 15th Avenue NE, Seattle, Washington 98195, United States
| | - Yang Y Lu
- Department of Genome Sciences, University of Washington3720 15th Avenue NE, Seattle, Washington 98195, United States
| | - Robert L Moritz
- Insititute for Systems Biology, 401 Terry Avenue N, Seattle, Washington 98109, United States
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington3720 15th Avenue NE, Seattle, Washington 98195, United States
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington3720 15th Avenue NE, Seattle, Washington 98195, United States.,Paul G. Allen School of Computer Science and Engineering, University of Washington185 E Stevens Way NE, Seattle, Washington 98195-2350, United States
| |
Collapse
|
7
|
Lin A, Deatherage Kaiser BL, Hutchison JR, Bilmes JA, Noble WS. MS1Connect: a mass spectrometry run similarity measure. Bioinformatics 2023; 39:7005198. [PMID: 36702456 PMCID: PMC9913042 DOI: 10.1093/bioinformatics/btad058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 01/05/2023] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION Interpretation of newly acquired mass spectrometry data can be improved by identifying, from an online repository, previous mass spectrometry runs that resemble the new data. However, this retrieval task requires computing the similarity between an arbitrary pair of mass spectrometry runs. This is particularly challenging for runs acquired using different experimental protocols. RESULTS We propose a method, MS1Connect, that calculates the similarity between a pair of runs by examining only the intact peptide (MS1) scans, and we show evidence that the MS1Connect score is accurate. Specifically, we show that MS1Connect outperforms several baseline methods on the task of predicting the species from which a given proteomics sample originated. In addition, we show that MS1Connect scores are highly correlated with similarities computed from fragment (MS2) scans, even though these data are not used by MS1Connect. AVAILABILITY AND IMPLEMENTATION The MS1Connect software is available at https://github.com/bmx8177/MS1Connect. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Richland, WA 99354, USA
| | | | - Janine R Hutchison
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Richland, WA 99354, USA
| | - Jeffrey A Bilmes
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.,Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
8
|
Lin A, Short T, Noble WS, Keich U. Improving Peptide-Level Mass Spectrometry Analysis via Double Competition. J Proteome Res 2022; 21:2412-2420. [PMID: 36166314 PMCID: PMC10108709 DOI: 10.1021/acs.jproteome.2c00282] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The analysis of shotgun proteomics data often involves generating lists of inferred peptide-spectrum matches (PSMs) and/or of peptides. The canonical approach for generating these discovery lists is by controlling the false discovery rate (FDR), most commonly through target-decoy competition (TDC). At the PSM level, TDC is implemented by competing each spectrum's best-scoring target (real) peptide match with its best match against a decoy database. This PSM-level procedure can be adapted to the peptide level by selecting the top-scoring PSM per peptide prior to FDR estimation. Here, we first highlight and empirically augment a little known previous work by He et al., which showed that TDC-based PSM-level FDR estimates can be liberally biased. We thus propose that researchers instead focus on peptide-level analysis. We then investigate three ways to carry out peptide-level TDC and show that the most common method ("PSM-only") offers the lowest statistical power in practice. An alternative approach that carries out a double competition, first at the PSM and then at the peptide level ("PSM-and-peptide"), is the most powerful method, yielding an average increase of 17% more discovered peptides at 1% FDR threshold relative to the PSM-only method.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Temana Short
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| |
Collapse
|
9
|
Kudriavtseva P, Kashkinov M, Kertész-Farkas A. Deep Convolutional Neural Networks Help Scoring Tandem Mass Spectrometry Data in Database-Searching Approaches. J Proteome Res 2021; 20:4708-4717. [PMID: 34449232 DOI: 10.1021/acs.jproteome.1c00315] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Spectrum annotation is a challenging task due to the presence of unexpected peptide fragmentation ions as well as the inaccuracy of the detectors of the spectrometers. We present a deep convolutional neural network, called Slider, which learns an optimal feature extraction in its kernels for scoring mass spectrometry (MS)/MS spectra to increase the number of spectrum annotations with high confidence. Experimental results using publicly available data sets show that Slider can annotate slightly more spectra than the state-of-the-art methods (BoltzMatch, Res-EV, Prosit), albeit 2-10 times faster. More interestingly, Slider provides only 2-4% fewer spectrum annotations with low-resolution fragmentation information than other methods with high-resolution information. This means that Slider can exploit nearly as much information from the context of low-resolution spectrum peaks as the high-resolution fragmentation information can provide for other scoring methods. Thus, Slider can be an optimal choice for practitioners using old spectrometers with low-resolution detectors.
Collapse
Affiliation(s)
- Polina Kudriavtseva
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, 11 Pokrovsky Bvld., Moscow 109028, Russian Federation
| | - Matvey Kashkinov
- Faculty of Computer Science, HSE University, 11 Pokrovsky Bvld., Moscow 109028, Russian Federation
| | - Attila Kertész-Farkas
- Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, 11 Pokrovsky Bvld., Moscow 109028, Russian Federation
| |
Collapse
|
10
|
Lin A, Plubell DL, Keich U, Noble WS. Accurately Assigning Peptides to Spectra When Only a Subset of Peptides Are Relevant. J Proteome Res 2021; 20:4153-4164. [PMID: 34236864 PMCID: PMC8489664 DOI: 10.1021/acs.jproteome.1c00483] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
The standard proteomics database search strategy involves searching spectra against a peptide database and estimating the false discovery rate (FDR) of the resulting set of peptide-spectrum matches. One assumption of this protocol is that all the peptides in the database are relevant to the hypothesis being investigated. However, in settings where researchers are interested in a subset of peptides, alternative search and FDR control strategies are needed. Recently, two methods were proposed to address this problem: subset-search and all-sub. We show that both methods fail to control the FDR. For subset-search, this failure is due to the presence of "neighbor" peptides, which are defined as irrelevant peptides with a similar precursor mass and fragmentation spectrum as a relevant peptide. Not considering neighbors compromises the FDR estimate because a spectrum generated by an irrelevant peptide can incorrectly match well to a relevant peptide. Therefore, we have developed a new method, "subset-neighbor search" (SNS), that accounts for neighbor peptides. We show evidence that SNS controls the FDR when neighbors are present and that SNS outperforms group-FDR, the only other method that appears to control the FDR relative to a subset of relevant peptides.
Collapse
Affiliation(s)
- Andy Lin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Deanna L. Plubell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, NSW, Australia
| | - William S. Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
11
|
Abstract
Proteomics studies rely on the accurate assignment of peptides to the acquired tandem mass spectra-a task where machine learning algorithms have proven invaluable. We describe mokapot, which provides a flexible semisupervised learning algorithm that allows for highly customized analyses. We demonstrate some of the unique features of mokapot by improving the detection of RNA-cross-linked peptides from an analysis of RNA-binding proteins and increasing the consistency of peptide detection in a single-cell proteomics study.
Collapse
Affiliation(s)
- William
E. Fondrie
- Department
of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
| | - William S. Noble
- Department
of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul
G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
12
|
Development of an MS Workflow Based on Combining Database Search Engines for Accurate Protein Identification and Its Validation to Identify the Serum Proteomic Profile in Female Stress Urinary Incontinence. BIOMED RESEARCH INTERNATIONAL 2020. [DOI: 10.1155/2020/8740468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A critical stage of shotgun proteomics is database search, a process which attempts to match the experimental spectra to the theoretical one. Given the considerable time and effort spent in analysis, it is self-evident for a researcher to aspire for rigorous computational analysis and a more confident and accurate peptide/protein identification. Mass spectrometry (MS) has been applied across several clinical disciplines. The pathophysiology of Stress Urinary Incontinence (SUI), caused by a damaged pelvic floor, has become a boundless disease altering the quality of life worldwide. Although some studies pointed markers that can be bioindicators for SUI, these findings raise the issue of sensitivity and specificity. Therefore, it is critical to have a sensitive and specific analytical approach to identify markers that have been associated with protective and deleterious associations in disease. Here, we describe our designed and developed workflow for protein identification from tandem mass spectrometry that uses multiple search engines. We apply our workflow to an existing study addressing the pathophysiology of SUI. We demonstrate how using the combined approach together with high-performance computing techniques can surmount the challenges of complex analyses and extended computing time. We also compare the relative performance of each combination. Our results suggest that a combination of MS-GF+ and COMET represents the best sensitivity-specificity trade-off, outperforming all other tested combinations. The approach was also sensitive and accurately identified a set of protein that was shown to be markers for categories of diseases associated with the pathophysiology of SUI. This workflow was developed to encourage proteomic researchers to adopt MS-based techniques for accurate analysis and to promote MS as a routine tool to the clinical cohorts.
Collapse
|
13
|
Sulimov P, Voronkova A, Kertész-Farkas A. Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics. Bioinformatics 2020; 36:3781-3787. [PMID: 32207518 DOI: 10.1093/bioinformatics/btaa206] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/18/2020] [Accepted: 03/20/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The discrimination ability of score functions to separate correct from incorrect peptide-spectrum-matches in database-searching-based spectrum identification is hindered by many superfluous peaks belonging to unexpected fragmentation ions or by the lacking peaks of anticipated fragmentation ions. RESULTS Here, we present a new method, called BoltzMatch, to learn score functions using a particular stochastic neural networks, called restricted Boltzmann machines, in order to enhance their discrimination ability. BoltzMatch learns chemically explainable patterns among peak pairs in the spectrum data, and it can augment peaks depending on their semantic context or even reconstruct lacking peaks of expected ions during its internal scoring mechanism. As a result, BoltzMatch achieved 50% and 33% more annotations on high- and low-resolution MS2 data than XCorr at a 0.1% false discovery rate in our benchmark; conversely, XCorr yielded the same number of spectrum annotations as BoltzMatch, albeit with 4-6 times more errors. In addition, BoltzMatch alone does yield 14% more annotations than Prosit (which runs with Percolator), and BoltzMatch with Percolator yields 32% more annotations than Prosit at 0.1% FDR level in our benchmark. AVAILABILITY AND IMPLEMENTATION BoltzMatch is freely available at: https://github.com/kfattila/BoltzMatch. CONTACT akerteszfarkas@hse.ru. SUPPORTING INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pavel Sulimov
- Faculty of Computer Science, School of Data Analysis and Artificial Intelligence, Moscow 101000, Russia
| | - Anastasia Voronkova
- Faculty of Computer Science, School of Data Analysis and Artificial Intelligence, Moscow 101000, Russia
| | - Attila Kertész-Farkas
- Faculty of Computer Science, School of Data Analysis and Artificial Intelligence, Moscow 101000, Russia
| |
Collapse
|
14
|
Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis. Int J Mol Sci 2020; 21:ijms21082873. [PMID: 32326049 PMCID: PMC7216093 DOI: 10.3390/ijms21082873] [Citation(s) in RCA: 145] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 04/16/2020] [Accepted: 04/18/2020] [Indexed: 01/15/2023] Open
Abstract
Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks.
Collapse
|
15
|
Sulimov P, Kertész-Farkas A. Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics. J Proteome Res 2020; 19:1481-1490. [DOI: 10.1021/acs.jproteome.9b00736] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Pavel Sulimov
- Department of Data Analysis and Artificial Intelligence, Faculty of Computer Science, National Research University Higher School of Economics (HSE), 11 Pokrovsky Boulevard, Moscow 109028, Russian Federation
| | - Attila Kertész-Farkas
- Department of Data Analysis and Artificial Intelligence, Faculty of Computer Science, National Research University Higher School of Economics (HSE), 11 Pokrovsky Boulevard, Moscow 109028, Russian Federation
| |
Collapse
|
16
|
Fondrie WE, Noble WS. Machine Learning Strategy That Leverages Large Data sets to Boost Statistical Power in Small-Scale Experiments. J Proteome Res 2020; 19:1267-1274. [PMID: 32009418 PMCID: PMC8455073 DOI: 10.1021/acs.jproteome.9b00780] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Machine learning methods have proven invaluable for increasing the sensitivity of peptide detection in proteomics experiments. Most modern tools, such as Percolator and PeptideProphet, use semi-supervised algorithms to learn models directly from the datasets that they analyze. Although these methods are effective for many proteomics experiments, we suspected that they may be suboptimal for experiments of smaller scale. In this work, we found that the power and consistency of Percolator results was reduced as the size of the experiment was decreased. As an alternative, we propose a different operating mode for Percolator: learn a model with Percolator from a large dataset and use the learned model to evaluate the small-scale experiment. We call this a “static modeling” approach, in contrast to Percolator’s usual “dynamic model” that is trained anew for each dataset. We applied this static modeling approach to two settings: small, gel-based experiments and single-cell proteomics. In both cases, static models increased the yield of detected peptides and eliminated the model-induced variability of the standard dynamic approach. These results suggest that static models are a powerful tool for bringing the full benefits of Percolator and other semi-supervised algorithms to small-scale experiments.
Collapse
Affiliation(s)
- William E Fondrie
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195-5065, United States
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195-5065, United States.,Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195-5065, United States
| |
Collapse
|