1
|
Deshpande AS, Lin A, O'Bryon I, Aufrecht JA, Merkley ED. Emerging protein sequencing technologies: proteomics without mass spectrometry? Expert Rev Proteomics 2025; 22:89-106. [PMID: 40105028 DOI: 10.1080/14789450.2025.2476979] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 02/12/2025] [Accepted: 03/03/2025] [Indexed: 03/20/2025]
Abstract
INTRODUCTION Liquid chromatography-tandem mass spectrometry (LC-MS/MS) has been a leading method for proteomics for 30 years. Advantages provided by LC-MS/MS are offset by significant disadvantages, including cost. Recently, several non-mass spectrometric methods have emerged, but little information is available about their capacity to analyze the complex mixtures routine for mass spectrometry. AREAS COVERED We review recent non-mass-spectrometric methods for sequencing proteins and peptides, including those using nanopores, sequencing by degradation, reverse translation, and short-epitope mapping, with comments on bioinformatics challenges, fundamental limitations, and areas where new technologies will be more or less competitive with LC-MS/MS. In addition to conventional literature searches, instrument vendor websites, patents, webinars, and preprints were also consulted to give a more up-to-date picture. EXPERT OPINION Many new technologies are promising. However, demonstrations that they outperform mass spectrometry in terms of peptides and proteins identified have not yet been published, and astute observers note important disadvantages, especially relating to the dynamic range of single-molecule measurements of complex mixtures. Still, even if the performance of emerging methods proves inferior to LC-MS/MS, their low cost could create a different kind of revolution: a dramatic increase in the number of biology laboratories engaging in new forms of proteomics research.
Collapse
Affiliation(s)
- A S Deshpande
- Biogeochemical Transformations Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - A Lin
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - I O'Bryon
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - J A Aufrecht
- Biogeochemical Transformations Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - E D Merkley
- Chemical and Biological Signatures Group, Pacific Northwest National Laboratory, Richland, Washington, USA
| |
Collapse
|
2
|
Gabriel W, González RM, Laposchan S, Riedel E, Dündar G, Poppenberger B, Wilhelm M, Lee CY. Deep Learning Enhances Precision of Citrullination Identification in Human and Plant Tissue Proteomes. Mol Cell Proteomics 2025; 24:100924. [PMID: 39921205 PMCID: PMC11925583 DOI: 10.1016/j.mcpro.2025.100924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/17/2025] [Accepted: 01/28/2025] [Indexed: 02/10/2025] Open
Abstract
Citrullination is a critical yet understudied post-translational modification (PTM) implicated in various biological processes. Exploring its role in health and disease requires a comprehensive understanding of the prevalence of this PTM at a proteome-wide scale. Although mass spectrometry has enabled the identification of citrullination sites in complex biological samples, it faces significant challenges, including limited enrichment tools and a high rate of false positives due to the identical mass with deamidation (+0.9840 Da) and errors in monoisotopic ion selection. These issues often necessitate manual spectrum inspection, reducing throughput in large-scale studies. In this work, we present a novel data analysis pipeline that incorporates the deep learning model Prosit-Cit into the MS database search workflow to improve both the sensitivity and the precision of citrullination site identification. Prosit-Cit, an extension of the existing Prosit model, has been trained on ∼53,000 spectra from ∼2500 synthetic citrullinated peptides and provides precise predictions for chromatographic retention time and fragment ion intensities of both citrullinated and deamidated peptides. This enhances the accuracy of identification and reduces false positives. Our pipeline demonstrated high precision on the evaluation dataset, recovering the majority of known citrullination sites in human tissue proteomes and improving sensitivity by identifying up to 14 times more citrullinated sites. Sequence motif analysis revealed consistency with previously reported findings, validating the reliability of our approach. Furthermore, extending the pipeline to a tissue proteome dataset of the model plant Arabidopsis thaliana enabled the identification of ∼200 citrullination sites across 169 proteins from 30 tissues, representing the first large-scale citrullination mapping in plants. This pipeline can be seamlessly applied to existing proteomics datasets, offering a robust tool for advancing biological discoveries and deepening our understanding of protein citrullination across species.
Collapse
Affiliation(s)
- Wassim Gabriel
- Computational Mass Spectrometry, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Rebecca Meelker González
- Young Investigator Group: Mass Spectrometry in Systems Neurosciences, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Sophia Laposchan
- Young Investigator Group: Mass Spectrometry in Systems Neurosciences, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Erik Riedel
- Young Investigator Group: Mass Spectrometry in Systems Neurosciences, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Gönül Dündar
- Biotechnology of Horticultural Crops, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Brigitte Poppenberger
- Biotechnology of Horticultural Crops, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, School of Life Sciences, Technical University of Munich, Freising, Germany; Munich Data Science Institute (MDSI), Technical University of Munich, Garching, Germany.
| | - Chien-Yun Lee
- Young Investigator Group: Mass Spectrometry in Systems Neurosciences, School of Life Sciences, Technical University of Munich, Freising, Germany.
| |
Collapse
|
3
|
Chu F, Lin A. Detecting Human Contaminant Genetically Variant Peptides in Nonhuman Samples. J Proteome Res 2025; 24:579-588. [PMID: 39705712 DOI: 10.1021/acs.jproteome.4c00718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2024]
Abstract
During proteomics data analysis, experimental spectra are searched against a user-defined protein database consisting of proteins that are reasonably expected to be present in the sample. Typically, this database contains the proteome of the organism under study concatenated with expected contaminants, such as trypsin and human keratins. However, there are additional contaminants that are not commonly added to the database. In this study, we describe a new set of protein contaminants and provide evidence that they can be detected in mass spectrometry-based proteomics data. Specifically, we provide evidence that human genetically variant peptides (GVPs) can be detected in nonhuman samples. GVPs are peptides that contain single amino acid polymorphisms that result from nonsynonymous single nucleotide polymorphisms in protein-coding regions of DNA. We reanalyzed previously collected nonhuman data-dependent acquisition (DDA) and data-independent acquisition (DIA) data sets and detected between 0 and 135 GVPs per data set. In addition, we show that GVPs are unlikely to originate from nonhuman sources and that a subset of eight GVPs are commonly detected across data sets.
Collapse
Affiliation(s)
- Fanny Chu
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| |
Collapse
|
4
|
Jiang Y, Rex DA, Schuster D, Neely BA, Rosano GL, Volkmar N, Momenzadeh A, Peters-Clarke TM, Egbert SB, Kreimer S, Doud EH, Crook OM, Yadav AK, Vanuopadath M, Hegeman AD, Mayta M, Duboff AG, Riley NM, Moritz RL, Meyer JG. Comprehensive Overview of Bottom-Up Proteomics Using Mass Spectrometry. ACS MEASUREMENT SCIENCE AU 2024; 4:338-417. [PMID: 39193565 PMCID: PMC11348894 DOI: 10.1021/acsmeasuresciau.3c00068] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 05/03/2024] [Accepted: 05/03/2024] [Indexed: 08/29/2024]
Abstract
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this Review will serve as a handbook for researchers who are new to the field of bottom-up proteomics.
Collapse
Affiliation(s)
- Yuming Jiang
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Devasahayam Arokia
Balaya Rex
- Center for
Systems Biology and Molecular Medicine, Yenepoya Research Centre, Yenepoya (Deemed to be University), Mangalore 575018, India
| | - Dina Schuster
- Department
of Biology, Institute of Molecular Systems
Biology, ETH Zurich, Zurich 8093, Switzerland
- Department
of Biology, Institute of Molecular Biology
and Biophysics, ETH Zurich, Zurich 8093, Switzerland
- Laboratory
of Biomolecular Research, Division of Biology and Chemistry, Paul Scherrer Institute, Villigen 5232, Switzerland
| | - Benjamin A. Neely
- Chemical
Sciences Division, National Institute of
Standards and Technology, NIST, Charleston, South Carolina 29412, United States
| | - Germán L. Rosano
- Mass
Spectrometry
Unit, Institute of Molecular and Cellular
Biology of Rosario, Rosario, 2000 Argentina
| | - Norbert Volkmar
- Department
of Biology, Institute of Molecular Systems
Biology, ETH Zurich, Zurich 8093, Switzerland
| | - Amanda Momenzadeh
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Trenton M. Peters-Clarke
- Department
of Pharmaceutical Chemistry, University
of California—San Francisco, San Francisco, California, 94158, United States
| | - Susan B. Egbert
- Department
of Chemistry, University of Manitoba, Winnipeg, Manitoba, R3T 2N2 Canada
| | - Simion Kreimer
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Emma H. Doud
- Center
for Proteome Analysis, Indiana University
School of Medicine, Indianapolis, Indiana, 46202-3082, United States
| | - Oliver M. Crook
- Oxford
Protein Informatics Group, Department of Statistics, University of Oxford, Oxford OX1 3LB, United
Kingdom
| | - Amit Kumar Yadav
- Translational
Health Science and Technology Institute, NCR Biotech Science Cluster 3rd Milestone Faridabad-Gurgaon
Expressway, Faridabad, Haryana 121001, India
| | | | - Adrian D. Hegeman
- Departments
of Horticultural Science and Plant and Microbial Biology, University of Minnesota, Twin Cities, Minnesota 55108, United States
| | - Martín
L. Mayta
- School
of Medicine and Health Sciences, Center for Health Sciences Research, Universidad Adventista del Plata, Libertador San Martin 3103, Argentina
- Molecular
Biology Department, School of Pharmacy and Biochemistry, Universidad Nacional de Rosario, Rosario 2000, Argentina
| | - Anna G. Duboff
- Department
of Chemistry, University of Washington, Seattle, Washington 98195, United States
| | - Nicholas M. Riley
- Department
of Chemistry, University of Washington, Seattle, Washington 98195, United States
| | - Robert L. Moritz
- Institute
for Systems biology, Seattle, Washington 98109, United States
| | - Jesse G. Meyer
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| |
Collapse
|
5
|
Freestone J, Noble WS, Keich U. Analysis of Tandem Mass Spectrometry Data with CONGA: Combining Open and Narrow Searches with Group-Wise Analysis. J Proteome Res 2024; 23:1894-1906. [PMID: 38652578 DOI: 10.1021/acs.jproteome.3c00399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searches─a traditional "narrow window" search and an open modification search─while carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| |
Collapse
|
6
|
Jiang Y, Rex DAB, Schuster D, Neely BA, Rosano GL, Volkmar N, Momenzadeh A, Peters-Clarke TM, Egbert SB, Kreimer S, Doud EH, Crook OM, Yadav AK, Vanuopadath M, Mayta ML, Duboff AG, Riley NM, Moritz RL, Meyer JG. Comprehensive Overview of Bottom-Up Proteomics using Mass Spectrometry. ARXIV 2023:arXiv:2311.07791v1. [PMID: 38013887 PMCID: PMC10680866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods to aid the novice and experienced researcher. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this work to serve as a basic resource for new practitioners in the field of shotgun or bottom-up proteomics.
Collapse
Affiliation(s)
- Yuming Jiang
- Department of Computational Biomedicine, Cedars Sinai Medical Center
| | - Devasahayam Arokia Balaya Rex
- Center for Systems Biology and Molecular Medicine, Yenepoya Research Centre, Yenepoya (Deemed to be University), Mangalore 575018, India
| | - Dina Schuster
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich 8093, Switzerland; Department of Biology, Institute of Molecular Biology and Biophysics, ETH Zurich, Zurich 8093, Switzerland; Laboratory of Biomolecular Research, Division of Biology and Chemistry, Paul Scherrer Institute, Villigen 5232, Switzerland
| | - Benjamin A. Neely
- Chemical Sciences Division, National Institute of Standards and Technology, NIST Charleston · Funded by NIST
| | - Germán L. Rosano
- Mass Spectrometry Unit, Institute of Molecular and Cellular Biology of Rosario, Rosario, Argentina · Funded by Grant PICT 2019-02971 (Agencia I+D+i)
| | - Norbert Volkmar
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich 8093, Switzerland
| | - Amanda Momenzadeh
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California, USA
| | | | - Susan B. Egbert
- Department of Chemistry, University of Manitoba, Winnipeg, Cananda
| | - Simion Kreimer
- Smidt Heart Institute, Cedars Sinai Medical Center; Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center
| | - Emma H. Doud
- Center for Proteome Analysis, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Oliver M. Crook
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute · Funded by Grant BT/PR16456/BID/7/624/2016 (Department of Biotechnology, India); Grant Translational Research Program (TRP) at THSTI funded by DBT
| | - Muralidharan Vanuopadath
- School of Biotechnology, Amrita Vishwa Vidyapeetham, Kollam-690 525, Kerala, India · Funded by Department of Health Research, Indian Council of Medical Research, Government of India (File No.R.12014/31/2022-HR)
| | - Martín L. Mayta
- School of Medicine and Health Sciences, Center for Health Sciences Research, Universidad Adventista del Plata, Libertador San Martín 3103, Argentina; Molecular Biology Department, School of Pharmacy and Biochemistry, Universidad Nacional de Rosario, Rosario 2000, Argentina
| | - Anna G. Duboff
- Department of Chemistry, University of Washington · Funded by Summer Research Acceleration Fellowship, Department of Chemistry, University of Washington
| | - Nicholas M. Riley
- Department of Chemistry, University of Washington · Funded by National Institutes of Health Grant R00 GM147304
| | - Robert L. Moritz
- Institute for Systems biology, Seattle, WA, USA, 98109 · Funded by National Institutes of Health Grants R01GM087221, R24GM127667, U19AG023122, S10OD026936; National Science Foundation Award 1920268
| | - Jesse G. Meyer
- Department of Computational Biomedicine, Cedars Sinai Medical Center · Funded by National Institutes of Health Grant R21 AG074234; National Institutes of Health Grant R35 GM142502
| |
Collapse
|
7
|
Etourneau L, Burger T. Challenging Targets or Describing Mismatches? A Comment on Common Decoy Distribution by Madej et al. J Proteome Res 2022; 21:2840-2845. [PMID: 36305797 DOI: 10.1021/acs.jproteome.2c00279] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
In their recent article, Madej et al. (Madej, D.; Wu, L.; Lam, H.Common Decoy Distributions Simplify False Discovery Rate Estimation in Shotgun Proteomics. J. Proteome Res.2022, 21 (2), 339-348) proposed an original way to solve the recurrent issue of controlling for the false discovery rate (FDR) in peptide-spectrum-match (PSM) validation. Briefly, they proposed to derive a single precise distribution of decoy matches termed the Common Decoy Distribution (CDD) and to use it to control for FDR during a target-only search. Conceptually, this approach is appealing as it takes the best of two worlds, i.e., decoy-based approaches (which leverage a large-scale collection of empirical mismatches) and decoy-free approaches (which are not subject to the randomness of decoy generation while sparing an additional database search). Interestingly, CDD also corresponds to a middle-of-the-road approach in statistics with respect to the two main families of FDR control procedures: Although historically based on estimating the false-positive distribution, FDR control has recently been demonstrated to be possible thanks to competition between the original variables (in proteomics, target sequences) and their fictional counterparts (in proteomics, decoys). Discriminating between these two theoretical trends is of prime importance for computational proteomics. In addition to highlighting why proteomics was a source of inspiration for theoretical biostatistics, it provides practical insights into the improvements that can be made to FDR control methods used in proteomics, including CDD.
Collapse
Affiliation(s)
- Lucas Etourneau
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| | - Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| |
Collapse
|
8
|
Lin A, Short T, Noble WS, Keich U. Improving Peptide-Level Mass Spectrometry Analysis via Double Competition. J Proteome Res 2022; 21:2412-2420. [PMID: 36166314 PMCID: PMC10108709 DOI: 10.1021/acs.jproteome.2c00282] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The analysis of shotgun proteomics data often involves generating lists of inferred peptide-spectrum matches (PSMs) and/or of peptides. The canonical approach for generating these discovery lists is by controlling the false discovery rate (FDR), most commonly through target-decoy competition (TDC). At the PSM level, TDC is implemented by competing each spectrum's best-scoring target (real) peptide match with its best match against a decoy database. This PSM-level procedure can be adapted to the peptide level by selecting the top-scoring PSM per peptide prior to FDR estimation. Here, we first highlight and empirically augment a little known previous work by He et al., which showed that TDC-based PSM-level FDR estimates can be liberally biased. We thus propose that researchers instead focus on peptide-level analysis. We then investigate three ways to carry out peptide-level TDC and show that the most common method ("PSM-only") offers the lowest statistical power in practice. An alternative approach that carries out a double competition, first at the PSM and then at the peptide level ("PSM-and-peptide"), is the most powerful method, yielding an average increase of 17% more discovered peptides at 1% FDR threshold relative to the PSM-only method.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Temana Short
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| |
Collapse
|
9
|
Fancello L, Burger T. An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics. Genome Biol 2022; 23:132. [PMID: 35725496 PMCID: PMC9208142 DOI: 10.1186/s13059-022-02701-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 06/09/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Proteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases, which only contain proteins whose transcripts are detected in the sample-matched transcriptome. These were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach. RESULTS We establish that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative false discovery rates (FDRs) are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible. CONCLUSIONS In summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce the ambiguity of protein identifications.
Collapse
Affiliation(s)
- Laura Fancello
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France
| | - Thomas Burger
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France.
| |
Collapse
|
10
|
Heil LR, Fondrie WE, McGann CD, Federation AJ, Noble WS, MacCoss MJ, Keich U. Building Spectral Libraries from Narrow-Window Data-Independent Acquisition Mass Spectrometry Data. J Proteome Res 2022; 21:1382-1391. [PMID: 35549345 DOI: 10.1021/acs.jproteome.1c00895] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Advances in library-based methods for peptide detection from data-independent acquisition (DIA) mass spectrometry have made it possible to detect and quantify tens of thousands of peptides in a single mass spectrometry run. However, many of these methods rely on a comprehensive, high-quality spectral library containing information about the expected retention time and fragmentation patterns of peptides in the sample. Empirical spectral libraries are often generated through data-dependent acquisition and may suffer from biases as a result. Spectral libraries can be generated in silico, but these models are not trained to handle all possible post-translational modifications. Here, we propose a false discovery rate-controlled spectrum-centric search workflow to generate spectral libraries directly from gas-phase fractionated DIA tandem mass spectrometry data. We demonstrate that this strategy is able to detect phosphorylated peptides and can be used to generate a spectral library for accurate peptide detection and quantitation in wide-window DIA data. We compare the results of this search workflow to other library-free approaches and demonstrate that our search is competitive in terms of accuracy and sensitivity. These results demonstrate that the proposed workflow has the capacity to generate spectral libraries while avoiding the limitations of other methods.
Collapse
Affiliation(s)
- Lilian R Heil
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - William E Fondrie
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Christopher D McGann
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Alexander J Federation
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States.,Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, Washington 98105, United States
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
11
|
Blakeley-Ruiz JA, Kleiner M. Considerations for Constructing a Protein Sequence Database for Metaproteomics. Comput Struct Biotechnol J 2022; 20:937-952. [PMID: 35242286 PMCID: PMC8861567 DOI: 10.1016/j.csbj.2022.01.018] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 01/14/2022] [Accepted: 01/18/2022] [Indexed: 12/14/2022] Open
Abstract
Mass spectrometry-based metaproteomics has emerged as a prominent technique for interrogating the functions of specific organisms in microbial communities, in addition to total community function. Identifying proteins by mass spectrometry requires matching mass spectra of fragmented peptide ions to a database of protein sequences corresponding to the proteins in the sample. This sequence database determines which protein sequences can be identified from the measurement, and as such the taxonomic and functional information that can be inferred from a metaproteomics measurement. Thus, the construction of the protein sequence database directly impacts the outcome of any metaproteomics study. Several factors, such as source of sequence information and database curation, need to be considered during database construction to maximize accurate protein identifications traceable to the species of origin. In this review, we provide an overview of existing strategies for database construction and the relevant studies that have sought to test and validate these strategies. Based on this review of the literature and our experience we provide a decision tree and best practices for choosing and implementing database construction strategies.
Collapse
Affiliation(s)
- J. Alfredo Blakeley-Ruiz
- Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, USA
- Center for Gastrointestinal Biology and Disease, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Corresponding authors at: Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, USA.
| | - Manuel Kleiner
- Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, USA
- Corresponding authors at: Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|