1
|
Raj A, Aggarwal S, Singh P, Yadav AK, Dash D. PgxSAVy: A tool for comprehensive evaluation of variant peptide quality in proteogenomics - catching the (un)usual suspects. Comput Struct Biotechnol J 2024; 23:711-722. [PMID: 38292474 PMCID: PMC10825656 DOI: 10.1016/j.csbj.2023.12.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/19/2023] [Accepted: 12/23/2023] [Indexed: 02/01/2024] Open
Abstract
Variant peptides resulting from single nucleotide polymorphisms (SNPs) can lead to aberrant protein functions and have translational potential for disease diagnosis and personalized therapy. Variant peptides detected by proteogenomics are fraught with high number of false positives, but there is no uniform and comprehensive approach to assess variant quality across analysis pipelines. Despite class-specific FDR along with ad-hoc filters, the problem is far from solved. These protocols are typically manual and tedious, and thus not uniform across labs. We demonstrate that variant peptide rescoring, integrated with intensity, variant event information and search result features, allows better discrimination of correct variant peptides. Implemented into PgxSAVy - a tool for quality control of variant peptides, this method can tackle the high rate of false positives. PgxSAVy provides a rigorous framework for quality control and annotations of variant peptides on the basis of (i) variant quality, (ii) isobaric masses, and (iii) disease annotation. PgxSAVy demonstrated high accuracy by identifying true variants with 98.43% accuracy on simulated data. Large-scale proteogenomic reanalysis of ∼2.8 million spectra (PXD004010 and PXD001468) resulted in 12,705 variant peptide spectrum matches (PSMs), of which PgxSAVy evaluated 3028 (23.8%), 1409 (11.1%) and 8268 (65.1%) as confident, semi-confident and doubtful respectively. PgxSAVy also annotates the variants based on their pathogenicity and provides support for assisted manual validation. The analysis of proteins carrying variants can provide fine granularity in discovering important pathways. PgxSAVy will advance personalized medicine by providing a comprehensive framework for quality control and prioritization of proteogenomics variants. PgxSAVy is freely available at https://pgxsavy.igib.res.in/ as a webserver and https://github.com/anuragraj/PgxSAVy as a stand-alone tool.
Collapse
Affiliation(s)
- Anurag Raj
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Suruchi Aggarwal
- Computational and Mathematical Biology Centre (CMBC), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Drug Discovery (CDD), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Microbial Research (CMR), Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
| | - Prateek Singh
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Amit Kumar Yadav
- Computational and Mathematical Biology Centre (CMBC), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Drug Discovery (CDD), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Microbial Research (CMR), Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
| | - Debasis Dash
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| |
Collapse
|
2
|
Lei JT, Jaehnig EJ, Smith H, Holt MV, Li X, Anurag M, Ellis MJ, Mills GB, Zhang B, Labrie M. The Breast Cancer Proteome and Precision Oncology. Cold Spring Harb Perspect Med 2023; 13:a041323. [PMID: 37137501 PMCID: PMC10547392 DOI: 10.1101/cshperspect.a041323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
The goal of precision oncology is to translate the molecular features of cancer into predictive and prognostic tests that can be used to individualize treatment leading to improved outcomes and decreased toxicity. Success for this strategy in breast cancer is exemplified by efficacy of trastuzumab in tumors overexpressing ERBB2 and endocrine therapy for tumors that are estrogen receptor positive. However, other effective treatments, including chemotherapy, immune checkpoint inhibitors, and CDK4/6 inhibitors are not associated with strong predictive biomarkers. Proteomics promises another tier of information that, when added to genomic and transcriptomic features (proteogenomics), may create new opportunities to improve both treatment precision and therapeutic hypotheses. Here, we review both mass spectrometry-based and antibody-dependent proteomics as complementary approaches. We highlight how these methods have contributed toward a more complete understanding of breast cancer and describe the potential to guide diagnosis and treatment more accurately.
Collapse
Affiliation(s)
- Jonathan T Lei
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Eric J Jaehnig
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Hannah Smith
- Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon 97239, USA
| | - Matthew V Holt
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Xi Li
- Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon 97239, USA
| | - Meenakshi Anurag
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Matthew J Ellis
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Gordon B Mills
- Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon 97239, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center and Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Marilyne Labrie
- Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon 97239, USA
| |
Collapse
|
3
|
Choong WK, Sung TY. Multiaspect Examinations of Possible Alternative Mappings of Identified Variant Peptides: A Case Study on the HEK293 Cell Line. ACS OMEGA 2022; 7:16454-16467. [PMID: 35601313 PMCID: PMC9118379 DOI: 10.1021/acsomega.2c00466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 04/20/2022] [Indexed: 06/15/2023]
Abstract
Adopting proteogenomics approach to validate single nucleotide variation events by identifying corresponding single amino acid variant peptides from mass spectrometry (MS)-based proteomics data facilitates translational and clinical research. Although variant peptides are usually identified from MS data with a stringent false discovery rate (FDR), FDR control could fail to eliminate dubious results caused by several issues; thus, postexamination to eliminate dubious results is required. However, comprehensive postexaminations of identification results are still lacking. Therefore, we propose a framework of three bottom-up levels, peptide-spectrum match, peptide, and variant event levels, that consists of rigorous 11-aspect examinations from the MS perspective to further confirm the reliability of variant events. As a proof of concept and showing feasibility, we demonstrate 11 examinations on the identified variant peptides from an HEK293 cell line data set, where various database search strategies were applied to maximize the number of identified variant PSMs with an FDR <1% for postexaminations. The results showed that only FDR criterion is insufficient to validate identified variant peptides and the 11 postexaminations can reveal low-confidence variant events detected by shotgun proteomics experiments. Therefore, we suggest that postexaminations of identified variant events based on the proposed framework are necessary for proteogenomics studies.
Collapse
|
4
|
Yi X, Gong F, Fu Y. Transfer posterior error probability estimation for peptide identification. BMC Bioinformatics 2020; 21:173. [PMID: 32366221 PMCID: PMC7199311 DOI: 10.1186/s12859-020-3485-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores. RESULTS The transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods. CONCLUSIONS We presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields.
Collapse
Affiliation(s)
- Xinpei Yi
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fuzhou Gong
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China. .,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Yan Fu
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China. .,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
5
|
Wen B, Li K, Zhang Y, Zhang B. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nat Commun 2020; 11:1759. [PMID: 32273506 PMCID: PMC7145864 DOI: 10.1038/s41467-020-15456-w] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 03/10/2020] [Indexed: 01/01/2023] Open
Abstract
Genomics-based neoantigen discovery can be enhanced by proteomic evidence, but there remains a lack of consensus on the performance of different quality control methods for variant peptide identification in proteogenomics. We propose to use the difference between accurately predicted and observed retention times for each peptide as a metric to evaluate different quality control methods. To this end, we develop AutoRT, a deep learning algorithm with high accuracy in retention time prediction. Analysis of three cancer data sets with a total of 287 tumor samples using different quality control strategies results in substantially different numbers of identified variant peptides and putative neoantigens. Our systematic evaluation, using the proposed retention time metric, provides insights and practical guidance on the selection of quality control strategies. We implement the recommended strategy in a computational workflow named NeoFlow to support proteogenomics-based neoantigen prioritization, enabling more sensitive discovery of putative neoantigens. Identifying mutation-derived neoantigens by proteogenomics requires robust strategies for quality control. Here, the authors propose peptide retention time as an evaluation metric for proteogenomics quality control methods, and develop a deep learning algorithm for accurate retention time prediction.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Kai Li
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Yun Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
6
|
Tan Z, Zhu J, Stemmer PM, Sun L, Yang Z, Schultz K, Gaffrey MJ, Cesnik AJ, Yi X, Hao X, Shortreed MR, Shi T, Lubman DM. Comprehensive Detection of Single Amino Acid Variants and Evaluation of Their Deleterious Potential in a PANC-1 Cell Line. J Proteome Res 2020; 19:1635-1646. [PMID: 32058723 DOI: 10.1021/acs.jproteome.9b00840] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Identifying single amino acid variants (SAAVs) in cancer is critical for precision oncology. Several advanced algorithms are now available to identify SAAVs, but attempts to combine different algorithms and optimize them on large data sets to achieve a more comprehensive coverage of SAAVs have not been implemented. Herein, we report an expanded detection of SAAVs in the PANC-1 cell line using three different strategies, which results in the identification of 540 SAAVs in the mass spectrometry data. Among the set of 540 SAAVs, 79 are evaluated as deleterious SAAVs based on analysis using the novel AssVar software in which one of the driver mutations found in each protein of KRAS, TP53, and SLC37A4 is further validated using independent selected reaction monitoring (SRM) analysis. Our study represents the most comprehensive discovery of SAAVs to date and the first large-scale detection of deleterious SAAVs in the PANC-1 cell line. This work may serve as the basis for future research in pancreatic cancer and personal immunotherapy and treatment.
Collapse
Affiliation(s)
- Zhijing Tan
- Department of Surgery, The University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Jianhui Zhu
- Department of Surgery, The University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Stemmer
- Institute of Environmental Health Sciences, Wayne State University, Detroit, Michigan 48202, United States
| | - Liangliang Sun
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Zhichang Yang
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Kendall Schultz
- Integrative Omics Group, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Matthew J Gaffrey
- Integrative Omics Group, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Anthony J Cesnik
- Department of Genetics, Stanford University, Stanford, California 94305, United States
| | - Xinpei Yi
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030, United States
| | - Xiaohu Hao
- Shanghai Institutes for Biological Science, Chinese Academy of Science, Shanghai 200031, China
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Tujin Shi
- Integrative Omics Group, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - David M Lubman
- Department of Surgery, The University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
7
|
Beyond Genes: Re-Identifiability of Proteomic Data and Its Implications for Personalized Medicine. Genes (Basel) 2019; 10:genes10090682. [PMID: 31492022 PMCID: PMC6770961 DOI: 10.3390/genes10090682] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/30/2019] [Accepted: 09/01/2019] [Indexed: 02/07/2023] Open
Abstract
The increasing availability of high throughput proteomics data provides us with opportunities as well as posing new ethical challenges regarding data privacy and re-identifiability of participants. Moreover, the fact that proteomics represents a level between the genotype and the phenotype further exacerbates the situation, introducing dilemmas related to publicly available data, anonymization, ownership of information and incidental findings. In this paper, we try to differentiate proteomics from genomics data and cover the ethical challenges related to proteomics data sharing. Finally, we give an overview of the proposed solutions and the outlook for future studies.
Collapse
|
8
|
Tan Z, Yi X, Carruthers NJ, Stemmer PM, Lubman DM. Single Amino Acid Variant Discovery in Small Numbers of Cells. J Proteome Res 2018; 18:417-425. [PMID: 30404448 DOI: 10.1021/acs.jproteome.8b00694] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
We have performed deep proteomic profiling down to as few as 9 Panc-1 cells using sample fractionation, TMT multiplexing, and a carrier/reference strategy. Off line fractionation of the TMT-labeled sample pooled with TMT-labeled carrier Panc-1 whole cell proteome was achieved using alkaline reversed phase spin columns. The fractionation in conjunction with the carrier/reference (C/R) proteome allowed us to detect 47 414 unique peptides derived from 6261 proteins, which provided a sufficient coverage to search for single amino acid variants (SAAVs) related to cancer. This high sample coverage is essential in order to detect a significant number of SAAVs. In order to verify genuine SAAVs versus false SAAVs, we used the SAVControl pipeline and found a total of 79 SAAVs from the 9-cell Panc-1 sample and 174 SAAVs from the 5000-cell Panc-1 C/R proteome. The SAAVs as sorted into high confidence and low confidence SAAVs were checked manually. All the high confidence SAAVs were found to be genuine SAAVs, while half of the low confidence SAAVs were found to be false SAAVs mainly related to PTMs. We identified several cancer-related SAAVs including KRAS, which is an important oncoprotein in pancreatic cancer. In addition, we were able to detect sites involved in loss or gain of glycosylation due to the enhanced coverage available in these experiments where we can detect both sites of loss and gain of glycosylation.
Collapse
Affiliation(s)
- Zhijing Tan
- Department of Surgery , University of Michigan , Ann Arbor , Michigan 48109 , United States
| | - Xinpei Yi
- NCMIS, RCSDS, Academy of Mathematics and Systems Science , Chinese Academy of Sciences , Beijing 100190 , China.,School of Mathematical Sciences , University of Chinese Academy of Sciences , Beijing 100049 , China
| | - Nicholas J Carruthers
- Institute of Environmental Health Sciences , Wayne State University , Detroit , Michigan 48202 , United States
| | - Paul M Stemmer
- Institute of Environmental Health Sciences , Wayne State University , Detroit , Michigan 48202 , United States
| | - David M Lubman
- Department of Surgery , University of Michigan , Ann Arbor , Michigan 48109 , United States
| |
Collapse
|