1
|
Hui HWH, Chan WX, Goh WWB. Assessing the impact of batch effect associated missing values on downstream analysis in high-throughput biomedical data. Brief Bioinform 2025; 26:bbaf168. [PMID: 40230039 DOI: 10.1093/bib/bbaf168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 03/10/2025] [Accepted: 03/24/2025] [Indexed: 04/16/2025] Open
Abstract
Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through simulations and analyses of real-world datasets including the Clinical Proteomic Tumour Analysis Consortium (CPTAC), we evaluated six MVI methods: K-nearest neighbors (KNN), Mean, MinProb, Singular Value Decomposition (SVD), Multivariate Imputation by Chained Equations (MICE), and Random Forest (RF), with ComBat and limma as the BECAs. We demonstrated that BEAMs strongly affect MVI performance, resulting in inaccurate imputed values, inflated significant P-values, and compromised BE correction. KNN, SVD, and RF were particularly prone to propagating random signals, resulting in false statistical confidence. While imputation with Mean and MinProb were less detrimental, artifacts were nonetheless introduced. Furthermore, the detrimental effect of BEAMs increased in parallel with its severity in the data. Our findings highlight the necessity of comprehensive assessments and tailored strategies to handle BEAMs in multi-batch datasets to ensure reliable data analysis and interpretation. Future work should investigate more advanced simulations and a variety of dedicated MVI methods to robustly address BEAMs.
Collapse
Affiliation(s)
- Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
| | - Wei Xin Chan
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
- Center for Artificial Intelligence in Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, Burlington Danes, The Hammersmith Hospital, Du Cane Road, London W12 0NN, United Kingdom
| |
Collapse
|
2
|
Mou X, Du H, Qiao G, Li J. Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data. Brief Bioinform 2025; 26:bbaf141. [PMID: 40254829 PMCID: PMC12009712 DOI: 10.1093/bib/bbaf141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2025] [Revised: 03/03/2025] [Accepted: 03/08/2025] [Indexed: 04/22/2025] Open
Abstract
For metaproteomics data derived from the collective protein composition of dynamic multi-organism systems, the proportion of missing values and dimensions of data exceeds that observed in single-organism experiments. Consequently, evaluations of differential analysis strategies in other mass spectrometry (MS) data (such as proteomics and metabolomics) may not be directly applicable to metaproteomics data. In this study, we systematically evaluated five imputation methods [sample minimum, quantile regression, k-nearest neighbors (KNN), Bayesian principal component analysis (bPCA), random forest (RF)] and six imputation-free methods (moderated t-test, two-part t-test, two-part Wilcoxon test, semiparametric differential abundance analysis, differential abundance analysis with Bayes shrinkage estimation of variance method, and Mixture) for differential analysis in simulated metaproteomic datasets based on both data-dependent acquisition MS experiments and emerging data-independent acquisition experiments. The simulation datasets comprised 588 scenarios by considering the impacts of sample size, fold change between case and control, and missing value ratio at random and nonrandom. Compared to imputation-free methods, KNN, bPCA, and RF imputation performed poorly in datasets with a high missingness ratio and large sample size and resulted in a high false-positive risk. We made empirical recommendations based on the balance of sensitivity in analysis and control of false positives. The moderated t-test was optimal in scenarios of large sample size with a low missingness ratio. The two-part Wilcoxon test was recommended in scenarios of small sample size with a low missingness ratio or large sample size with a high missingness ratio. The comprehensive evaluations in our study can provide guidance for the differential abundance analysis in metaproteomics.
Collapse
Affiliation(s)
- Xinyi Mou
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Haoyu Du
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Guanghua Qiao
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Jing Li
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| |
Collapse
|
3
|
Reed ER, Chandler KB, Lopez P, Costello CE, Andersen SL, Perls TT, Li M, Bae H, Soerensen M, Monti S, Sebastiani P. Cross-platform proteomics signatures of extreme old age. GeroScience 2025; 47:1199-1220. [PMID: 39048883 PMCID: PMC11872828 DOI: 10.1007/s11357-024-01286-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 07/10/2024] [Indexed: 07/27/2024] Open
Abstract
In previous work, we used a SomaLogic platform targeting approximately 5000 proteins to generate a serum protein signature of centenarians that we validated in independent studies that used the same technology. We set here to validate and possibly expand the results by profiling the serum proteome of a subset of individuals included in the original study using liquid chromatography tandem mass spectrometry (LC-MS/MS). Following pre-processing, the LC-MS/MS data provided quantification of 398 proteins, with only 266 proteins shared by both platforms. At 1% FDR statistical significance threshold, the analysis of LC-MS/MS data detected 44 proteins associated with extreme old age, including 23 of the original analysis. To identify proteins for which associations between expression and extreme-old age were conserved across platforms, we performed inter-study conservation testing of the 266 proteins quantified by both platforms using a method that accounts for the correlation between the results. From these tests, a total of 80 proteins reached 5% FDR statistical significance, and 26 of these proteins had concordant pattern of gene expression in whole blood generated in an independent set. This signature of 80 proteins points to blood coagulation, IGF signaling, extracellular matrix (ECM) organization, and complement cascade as important pathways whose protein level changes provide evidence for age-related adjustments that distinguish centenarians from younger individuals. The comparison with blood transcriptomics also highlights a possible role for neutrophil degranulation in aging.
Collapse
Affiliation(s)
- Eric R Reed
- Data Intensive Study Center, Tufts University, Boston, MA, USA
| | - Kevin B Chandler
- Center for Biomedical Mass Spectrometry, Department of Biochemistry and Cell Biology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Cellular and Molecular Medicine, Florida International University, Miami, FL, USA
| | - Prisma Lopez
- Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, MA, USA
| | - Catherine E Costello
- Center for Biomedical Mass Spectrometry, Department of Biochemistry and Cell Biology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Stacy L Andersen
- Geriatric Section, Department of Medicine, Boston University Chobanian & Avedisian School of Medicine and Boston Medical Center, Boston, MA, USA
| | - Thomas T Perls
- Geriatric Section, Department of Medicine, Boston University Chobanian & Avedisian School of Medicine and Boston Medical Center, Boston, MA, USA
| | - Mengze Li
- Division of Computational Biomedicine, Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Harold Bae
- Biostatistics Program, College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA
| | - Mette Soerensen
- Department of Public Health, University of Southern Denmark, Odense, Denmark
| | - Stefano Monti
- Division of Computational Biomedicine, Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Paola Sebastiani
- Data Intensive Study Center, Tufts University, Boston, MA, USA.
- Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, MA, USA.
- Department of Medicine, School of Medicine, Tufts University, Boston, MA, USA.
| |
Collapse
|
4
|
Bramer LM, Nakayasu ES, Flores JE, Van Eyk JE, MacCoss MJ, Parikh HM, Metz TO, Webb-Robertson BJM. Data from a multi-year targeted proteomics study of a longitudinal birth cohort of type 1 diabetes. Sci Data 2025; 12:112. [PMID: 39833216 PMCID: PMC11747092 DOI: 10.1038/s41597-024-04249-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 12/05/2024] [Indexed: 01/22/2025] Open
Abstract
The deployment of liquid chromatography-mass spectrometry-based plasma proteomics experiments in a large cohort is sparse, leading to a lack of data available for benchmarking, method development or validation. Comprised of 6,426 plasma analyses, The Environmental Determinants of Diabetes in the Young (TEDDY) proteomics validation study constitutes one of the largest targeted proteomics experiments in the literature to date. The proteomics data from this study were generated over the course of 2.5 years from over 900 study subjects, each providing up to 29 longitudinal samples. The data also includes 916 quality control samples. The targeted mass spectrometry assay was comprised of 694 peptides mapping to 167 proteins and the panel was measured in each subject and QC sample. The targeted proteomic dataset presented here can be used as a resource for new computational methods development, such as for batch correction, as well as for benchmarking and comparing the performance of different methods/tools.
Collapse
Grants
- R01 DK138335 NIDDK NIH HHS
- U01 KD127786-S1 U.S. Department of Health & Human Services | NIH | National Institute of Diabetes and Digestive and Kidney Diseases (National Institute of Diabetes & Digestive & Kidney Diseases)
- U01 DK127786 NIDDK NIH HHS
- U.S. Department of Health & Human Services | NIH | National Institute of Diabetes and Digestive and Kidney Diseases (National Institute of Diabetes & Digestive & Kidney Diseases)
- U.S. Department of Health & Human Services | NIH | Office of Extramural Research, National Institutes of Health (OER)
- National Institutes of Health: U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, U01 DK124166, U01 DK128847, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and Breakthrough T1D (formerly JDRF). This work is supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR002535).
Collapse
Affiliation(s)
- Lisa M Bramer
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA.
| | - Ernesto S Nakayasu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | - Javier E Flores
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | - Jennifer E Van Eyk
- Department of Cardiology, Advanced Clinical Biosystem Research Institue, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Hemang M Parikh
- Health Informatics Institute, University of South Florida, Tampa, FL, USA
| | - Thomas O Metz
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | | |
Collapse
|
5
|
MOON HAEUN, DU JINHONG, LEI JING, ROEDER KATHRYN. AUGMENTED DOUBLY ROBUST POST-IMPUTATION INFERENCE FOR PROTEOMIC DATA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.03.23.586387. [PMID: 39868108 PMCID: PMC11761724 DOI: 10.1101/2024.03.23.586387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of proteins in molecular mechanisms. However, analysis of such data is challenging due to the large proportion of missing values. A common strategy to address this issue is to utilize an imputed dataset, which often introduces systematic bias into downstream analyses if the imputation errors are ignored. In this paper, we propose a statistical framework inspired by doubly robust estimators that offers valid and efficient inference for proteomic data. Our framework combines powerful machine learning tools, such as variational autoencoders, to augment the imputation quality with high-dimensional peptide data, and a parametric model to estimate the propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework and has provable properties. Simulation studies verify its empirical superiority over other existing procedures. In application to both single-cell proteomic data and bulk-cell Alzheimer's Disease data our method utilizes the imputed data to gain additional, meaningful discoveries and yet maintains good control of false positives.
Collapse
Affiliation(s)
- HAEUN MOON
- Department of Statistics, Seoul National University
| | - JIN-HONG DU
- Department of Statistics and Data Science, Carnegie Mellon University
| | - JING LEI
- Department of Statistics and Data Science, Carnegie Mellon University
| | - KATHRYN ROEDER
- Department of Statistics and Data Science, Carnegie Mellon University
| |
Collapse
|
6
|
Schumann Y, Gocke A, Neumann JE. Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets. Proteomics 2025; 25:e202400100. [PMID: 39740174 DOI: 10.1002/pmic.202400100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 11/08/2024] [Accepted: 11/26/2024] [Indexed: 01/02/2025]
Abstract
Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses. We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.
Collapse
Affiliation(s)
- Yannis Schumann
- IT-Department, Deutsches Elektronen-Synchroton DESY, Hamburg, Germany
| | - Antonia Gocke
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Core Facility Mass Spectrometric Proteomics, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| | - Julia E Neumann
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Institute of Neuropathology, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| |
Collapse
|
7
|
Niloofar P, Aghdam R, Eslahchi C. GAEM: Genetic Algorithm based Expectation-Maximization for inferring Gene Regulatory Networks from incomplete data. Comput Biol Med 2024; 183:109238. [PMID: 39426072 DOI: 10.1016/j.compbiomed.2024.109238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 09/02/2024] [Accepted: 09/30/2024] [Indexed: 10/21/2024]
Abstract
In Bioinformatics, inferring the structure of a Gene Regulatory Network (GRN) from incomplete gene expression data is a difficult task. One popular method for inferring the structure GRNs is to apply the Path Consistency Algorithm based on Conditional Mutual Information (PCA-CMI). Although PCA-CMI excels at extracting GRN skeletons, it struggles with missing values in datasets. As a result, applying PCA-CMI to infer GRNs, necessitates a preprocessing method for data imputation. In this paper, we present the GAEM algorithm, which uses an iterative approach based on a combination of Genetic Algorithm and Expectation-Maximization to infer the structure of GRN from incomplete gene expression datasets. GAEM learns the GRN structure from the incomplete dataset via an algorithm that iteratively updates the imputed values based on the learnt GRN until the convergence criteria are met. We evaluate the performance of this algorithm under various missingness mechanisms (ignorable and nonignorable) and percentages (5%, 15%, and 40%). The traditional approach to handling missing values in gene expression datasets involves estimating them first and then constructing the GRN. However, our methodology differs in that both missing values and the GRN are updated iteratively until convergence. Results from the DREAM3 dataset demonstrate that the GAEM algorithm appears to be a more reliable method overall, especially for smaller network sizes, GAEM outperforms methods where the incomplete dataset is imputed first, followed by learning the GRN structure from the imputed data. We have implemented the GAEM algorithm within the GAEM R package, which is accessible at the following GitHub repository: https://github.com/parniSDU/GAEM.
Collapse
Affiliation(s)
- Parisa Niloofar
- Mærsk Mc-Kinney Møller Institute, University of Southern Denmark, Campusvej 55, Odense, 5230, Denmark.
| | - Rosa Aghdam
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, WI, Madison, USA; School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Iran
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Iran; School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Iran
| |
Collapse
|
8
|
Harris L, Noble WS. Imputation of cancer proteomics data with a deep model that learns from many datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.26.609780. [PMID: 39253518 PMCID: PMC11383014 DOI: 10.1101/2024.08.26.609780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Missing values are a major challenge in the analysis of mass spectrometry proteomics data. Missing values hinder reproducibility, decrease statistical power for identifying differentially expressed (DE) proteins and make it challenging to analyze low-abundance proteins. We present Lupine, a deep learning-based method for imputing, or estimating, missing values in tandem mass tag (TMT) proteomics data. Lupine is, to our knowledge, the first imputation method that is designed to learn jointly from many datasets, and we provide evidence that this approach leads to more accurate predictions. We validated Lupine by applying it to TMT data from >1,000 cancer patient samples spanning ten cancer types from the Clinical Proteomics Tumor Atlas Consortium (CPTAC). Lupine outperforms the state of the art for TMT imputation, identifies more DE proteins than other methods, corrects for TMT batch effects, and learns a meaningful representation of proteins and patient samples. Lupine is implemented as an open source Python package.
Collapse
Affiliation(s)
| | - William S. Noble
- Department of Genome Sciences, University of Washington
- Paul G. Allen School of Computer Science and Engineering, University of Washington
| |
Collapse
|
9
|
Woo DU, Lee Y, Min CW, Kim ST, Kang YJ. RiceProteomeDB (RPDB): a user-friendly database for proteomics data storage, retrieval, and analysis. Sci Rep 2024; 14:3671. [PMID: 38351208 PMCID: PMC10864295 DOI: 10.1038/s41598-024-54151-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 02/08/2024] [Indexed: 02/16/2024] Open
Abstract
Rice, feeding a significant portion of the world, poses unique proteomic challenges critical to agricultural research and global food security. The complexity of the rice proteome, influenced by various genetic and environmental factors, demands specialized analytical approaches for effective study. The central challenges in rice proteomics lie in developing custom methods suited to the unique aspects of rice biology. These include data preprocessing, method selection, and result validation, all of which are essential for advancing rice research. Our aim is to decode these proteomic intricacies to facilitate breakthroughs in strain improvement, disease resistance, and yield optimization, all vital for combating global food insecurity. To achieve this, we have created the RiceProteomeDB (RPDB), a React + Django database, offering a streamlined and comprehensive platform for the analysis of rice proteomics data. RiceProteomeDB (RPDB) simplifies proteomics data management and analysis. It offers features for data organization, preprocessing, method selection, result validation, and data sharing. Researchers can access processed rice proteomics data, conduct analyses, and explore experimental conditions. The user-friendly web interface enhances navigation and interaction. RPDB fosters collaboration by enabling data sharing and proper acknowledgment of sources, contributing to proteomics research and knowledge dissemination. Availability and implementation: Web application: http://riceproteome.plantprofile.net/ . The web application's source code, user's manual, and sample data: https://github.com/dongu7610/Riceproteome .
Collapse
Affiliation(s)
- Dong U Woo
- Division of Bio & Medical Bigdata Department (BK4 Program), Gyeongsang National University, 501, Jinju-daero, Jinju-si, Gyeongsangnam-do, 52828, Republic of Korea
| | - Yejin Lee
- Division of Bio & Medical Bigdata Department (BK4 Program), Gyeongsang National University, 501, Jinju-daero, Jinju-si, Gyeongsangnam-do, 52828, Republic of Korea
| | - Cheol Woo Min
- Department of Plant Bioscience, Life and Industry Convergence Research Institute, Pusan National University, Milyang, 50463, Republic of Korea
| | - Sun Tae Kim
- Department of Plant Bioscience, Life and Industry Convergence Research Institute, Pusan National University, Milyang, 50463, Republic of Korea
| | - Yang Jae Kang
- Division of Bio & Medical Bigdata Department (BK4 Program), Gyeongsang National University, 501, Jinju-daero, Jinju-si, Gyeongsangnam-do, 52828, Republic of Korea.
- Division of Life Science Department, Gyeongsang National University, Jinju, 52828, Republic of Korea.
| |
Collapse
|
10
|
Pasternack H, Polzer M, Gemoll T, Kümpers C, Sauer T, Lazar-Karsten P, Hinrichs S, Bohnet S, Perner S, Dressler FF, Kirfel J. Proteomic analyses identify HK1 and ATP5A to be overexpressed in distant metastases of lung adenocarcinomas compared to matched primary tumors. Sci Rep 2023; 13:20948. [PMID: 38016997 PMCID: PMC10684588 DOI: 10.1038/s41598-023-47767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 11/17/2023] [Indexed: 11/30/2023] Open
Abstract
Lung cancer is the leading cause of cancer-related deaths worldwide with lung adenocarcinoma (LUAD) being the most common type. Genomic studies of LUAD have advanced our understanding of its tumor biology and accelerated targeted therapy. However, the proteomic characteristics of LUAD are still insufficiently explored. The prognosis for lung cancer patients is still mostly determined by the stage of disease at the time of diagnosis. Focusing on late-stage metastatic LUAD with poor prognosis, we compared the proteomic profiles of primary tumors and matched distant metastases to identify relevant and potentially druggable differences. We performed high-performance liquid chromatography (HPLC) and electrospray ionization tandem mass spectrometry (ESI-MS/MS) on a total of 38 FFPE (formalin-fixed and paraffin-embedded) samples. Using differential expression analysis and unsupervised clustering we identified several proteins that were differentially regulated in metastases compared to matched primary tumors. Selected proteins (HK1, ATP5A, SRI and ARHGDIB) were subjected to validation by immunoblotting. Thereby, significant differential expression could be confirmed for HK1 and ATP5A, both upregulated in metastases compared to matched primary tumors. Our findings give a better understanding of tumor progression and metastatic spreads in LUAD but also demonstrate considerable inter-individual heterogeneity on the proteomic level.
Collapse
Affiliation(s)
- Helen Pasternack
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Mirjam Polzer
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
- Institute of Legal Medicine, University Hospital Münster, Münster, Germany
| | - Timo Gemoll
- Section for Translational Surgical Oncology and Biobanking, Department of Surgery, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Christiane Kümpers
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Thorben Sauer
- Section for Translational Surgical Oncology and Biobanking, Department of Surgery, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Pamela Lazar-Karsten
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Sofie Hinrichs
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Sabine Bohnet
- Department of Pulmonology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
| | - Sven Perner
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
- Pathology, Research Center Borstel, Leibniz Lung Center, Borstel, Germany
- Institute of Pathology and Hematopathology, Hamburg, Germany
| | - Franz Friedrich Dressler
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany
- Institute of Pathology, Charité -Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Jutta Kirfel
- Institute of Pathology, University Hospital Schleswig-Holstein, Campus Luebeck, Luebeck, Germany.
| |
Collapse
|
11
|
Harris L, Fondrie WE, Oh S, Noble WS. Evaluating Proteomics Imputation Methods with Improved Criteria. J Proteome Res 2023; 22:3427-3438. [PMID: 37861703 PMCID: PMC10949645 DOI: 10.1021/acs.jproteome.3c00205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2023]
Abstract
Quantitative measurements produced by tandem mass spectrometry proteomics experiments typically contain a large proportion of missing values. Missing values hinder reproducibility, reduce statistical power, and make it difficult to compare across samples or experiments. Although many methods exist for imputing missing values, in practice, the most commonly used methods are among the worst performing. Furthermore, previous benchmarking studies have focused on relatively simple measurements of error such as the mean-squared error between imputed and held-out values. Here we evaluate the performance of commonly used imputation methods using three practical, "downstream-centric" criteria. These criteria measure the ability to identify differentially expressed peptides, generate new quantitative peptides, and improve the peptide lower limit of quantification. Our evaluation comprises several experiment types and acquisition strategies, including data-dependent and data-independent acquisition. We find that imputation does not necessarily improve the ability to identify differentially expressed peptides but that it can identify new quantitative peptides and improve the peptide lower limit of quantification. We find that MissForest is generally the best performing method per our downstream-centric criteria. We also argue that existing imputation methods do not properly account for the variance of peptide quantifications and highlight the need for methods that do.
Collapse
Affiliation(s)
- Lincoln Harris
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
| | | | - Sewoong Oh
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
12
|
Abstract
Missing values are a notable challenge when analyzing mass spectrometry-based proteomics data. While the field is still actively debating the best practices, the challenge increased with the emergence of mass spectrometry-based single-cell proteomics and the dramatic increase in missing values. A popular approach to deal with missing values is to perform imputation. Imputation has several drawbacks for which alternatives exist, but currently, imputation is still a practical solution widely adopted in single-cell proteomics data analysis. This perspective discusses the advantages and drawbacks of imputation. We also highlight 5 main challenges linked to missing value management in single-cell proteomics. Future developments should aim to solve these challenges, whether it is through imputation or data modeling. The perspective concludes with recommendations for reporting missing values, for reporting methods that deal with missing values, and for proper encoding of missing values.
Collapse
Affiliation(s)
- Christophe Vanderaa
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| | - Laurent Gatto
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| |
Collapse
|
13
|
Hediyeh-Zadeh S, Webb AI, Davis MJ. MsImpute: Estimation of Missing Peptide Intensity Data in Label-Free Quantitative Mass Spectrometry. Mol Cell Proteomics 2023; 22:100558. [PMID: 37105364 PMCID: PMC10368900 DOI: 10.1016/j.mcpro.2023.100558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 04/18/2023] [Accepted: 04/21/2023] [Indexed: 04/29/2023] Open
Abstract
Mass spectrometry (MS) enables high-throughput identification and quantification of proteins in complex biological samples and can provide insights into the global function of biological systems. Label-free quantification is cost-effective and suitable for the analysis of human samples. Despite rapid developments in label-free data acquisition workflows, the number of proteins quantified across samples can be limited by technical and biological variability. This variation can result in missing values which can in turn challenge downstream data analysis tasks. General purpose or gene expression-specific imputation algorithms are widely used to improve data completeness. Here, we propose an imputation algorithm designated for label-free MS data that is aware of the type of missingness affecting data. On published datasets acquired by data-dependent and data-independent acquisition workflows with variable degrees of biological complexity, we demonstrate that the proposed missing value estimation procedure by barycenter computation competes closely with the state-of-the-art imputation algorithms in differential abundance tasks while outperforming them in the accuracy of variance estimates of the peptide abundance measurements, and better controls the false discovery rate in label-free MS experiments. The barycenter estimation procedure is implemented in the msImpute software package and is available from the Bioconductor repository.
Collapse
Affiliation(s)
- Soroor Hediyeh-Zadeh
- Bioinformatics Division, WEHI, Melbourne, Australia; Department of Medical Biology, University of Melbourne, Melbourne, Australia; Colonial Foundation Healthy Ageing Centre, WEHI, Melbourne, Australia
| | - Andrew I Webb
- Department of Medical Biology, University of Melbourne, Melbourne, Australia; Colonial Foundation Healthy Ageing Centre, WEHI, Melbourne, Australia; Advanced Technology and Biology Division, WEHI, Melbourne, Australia
| | - Melissa J Davis
- Bioinformatics Division, WEHI, Melbourne, Australia; Department of Medical Biology, University of Melbourne, Melbourne, Australia; Department of Clinical Pathology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Australia; The Diamantina Institute, The University of Queensland, Brisbane, Australia; The South Australian Immunogenomics Cancer Institute, The University of Adelaide, Adelaide, Australia.
| |
Collapse
|
14
|
Kong W, Wong BJH, Hui HWH, Lim KP, Wang Y, Wong L, Goh WWB. ProJect: a powerful mixed-model missing value imputation method. Brief Bioinform 2023:bbad233. [PMID: 37419612 DOI: 10.1093/bib/bbad233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/24/2023] [Accepted: 06/05/2023] [Indexed: 07/09/2023] Open
Abstract
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect.
Collapse
Affiliation(s)
- Weijia Kong
- School of Biological Sciences, Nanyang Technological University, Singapore
- Department of Computer Science, National University of Singapore, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | | | | | - Kai Peng Lim
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Yulan Wang
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore
| |
Collapse
|
15
|
Jones J, MacKrell EJ, Wang TY, Lomenick B, Roukes ML, Chou TF. Tidyproteomics: an open-source R package and data object for quantitative proteomics post analysis and visualization. BMC Bioinformatics 2023; 24:239. [PMID: 37280522 DOI: 10.1186/s12859-023-05360-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 05/25/2023] [Indexed: 06/08/2023] Open
Abstract
BACKGROUND The analysis of mass spectrometry-based quantitative proteomics data can be challenging given the variety of established analysis platforms, the differences in reporting formats, and a general lack of approachable standardized post-processing analyses such as sample group statistics, quantitative variation and even data filtering. We developed tidyproteomics to facilitate basic analysis, improve data interoperability and potentially ease the integration of new processing algorithms, mainly through the use of a simplified data-object. RESULTS The R package tidyproteomics was developed as both a framework for standardizing quantitative proteomics data and a platform for analysis workflows, containing discrete functions that can be connected end-to-end, thus making it easier to define complex analyses by breaking them into small stepwise units. Additionally, as with any analysis workflow, choices made during analysis can have large impacts on the results and as such, tidyproteomics allows researchers to string each function together in any order, select from a variety of options and in some cases develop and incorporate custom algorithms. CONCLUSIONS Tidyproteomics aims to simplify data exploration from multiple platforms, provide control over individual functions and analysis order, and serve as a tool to assemble complex repeatable processing workflows in a logical flow. Datasets in tidyproteomics are easy to work with, have a structure that allows for biological annotations to be added, and come with a framework for developing additional analysis tools. The consistent data structure and accessible analysis and plotting tools also offers a way for researchers to save time on mundane data manipulation tasks.
Collapse
Affiliation(s)
- Jeff Jones
- Proteome Exploration Laboratory, Beckman Institute, California Institute of Technology, Pasadena, CA, 91125, USA.
- Division of Physics, Mathematics and Astronomy, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA, 91125, USA.
| | - Elliot J MacKrell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA, 91125, USA
| | - Ting-Yu Wang
- Proteome Exploration Laboratory, Beckman Institute, California Institute of Technology, Pasadena, CA, 91125, USA
| | - Brett Lomenick
- Proteome Exploration Laboratory, Beckman Institute, California Institute of Technology, Pasadena, CA, 91125, USA
| | - Michael L Roukes
- Division of Physics, Mathematics and Astronomy, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA, 91125, USA
| | - Tsui-Fen Chou
- Proteome Exploration Laboratory, Beckman Institute, California Institute of Technology, Pasadena, CA, 91125, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| |
Collapse
|
16
|
Gatto L, Aebersold R, Cox J, Demichev V, Derks J, Emmott E, Franks AM, Ivanov AR, Kelly RT, Khoury L, Leduc A, MacCoss MJ, Nemes P, Perlman DH, Petelski AA, Rose CM, Schoof EM, Van Eyk J, Vanderaa C, Yates JR, Slavov N. Initial recommendations for performing, benchmarking and reporting single-cell proteomics experiments. Nat Methods 2023; 20:375-386. [PMID: 36864200 PMCID: PMC10130941 DOI: 10.1038/s41592-023-01785-3] [Citation(s) in RCA: 82] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 01/24/2023] [Indexed: 03/04/2023]
Abstract
Analyzing proteins from single cells by tandem mass spectrometry (MS) has recently become technically feasible. While such analysis has the potential to accurately quantify thousands of proteins across thousands of single cells, the accuracy and reproducibility of the results may be undermined by numerous factors affecting experimental design, sample preparation, data acquisition and data analysis. We expect that broadly accepted community guidelines and standardized metrics will enhance rigor, data quality and alignment between laboratories. Here we propose best practices, quality controls and data-reporting recommendations to assist in the broad adoption of reliable quantitative workflows for single-cell proteomics. Resources and discussion forums are available at https://single-cell.net/guidelines .
Collapse
Affiliation(s)
- Laurent Gatto
- Computational Biology and Bioinformatics Unit, de Duve Institute, Université Catholique de Louvain, Brussels, Belgium
| | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Juergen Cox
- Max Planck Institute of Biochemistry, Martinsried, Germany
| | | | - Jason Derks
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single-Cell Proteomics Center and Barnett Institute, Northeastern University, Boston, MA, USA
| | - Edward Emmott
- Centre for Proteome Research, Department of Biochemistry and Systems Biology, University of Liverpool, Liverpool, UK
| | - Alexander M Franks
- Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, CA, USA
| | - Alexander R Ivanov
- Department of Chemistry and Chemical Biology, Barnett Institute of Chemical and Biological Analysis, Northeastern University, Boston, MA, USA
| | - Ryan T Kelly
- Department of Chemistry and Biochemistry, Brigham Young University, Provo, UT, USA
| | - Luke Khoury
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single-Cell Proteomics Center and Barnett Institute, Northeastern University, Boston, MA, USA
| | - Andrew Leduc
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single-Cell Proteomics Center and Barnett Institute, Northeastern University, Boston, MA, USA
| | | | - Peter Nemes
- Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA
| | - David H Perlman
- Merck Exploratory Science Center, Merck Sharp & Dohme Corp., Cambridge, MA, USA
| | - Aleksandra A Petelski
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single-Cell Proteomics Center and Barnett Institute, Northeastern University, Boston, MA, USA
- Parallel Squared Technology Institute, Watertown, MA, USA
| | - Christopher M Rose
- Department of Microchemistry, Proteomics and Lipidomics, Genentech Inc., South San Francisco, CA, USA
| | - Erwin M Schoof
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Lyngby, Denmark
| | | | - Christophe Vanderaa
- Computational Biology and Bioinformatics Unit, de Duve Institute, Université Catholique de Louvain, Brussels, Belgium
| | - John R Yates
- Departments of Molecular Medicine and Neurobiology, the Scripps Research Institute, La Jolla, CA, USA
| | - Nikolai Slavov
- Departments of Bioengineering, Biology, Chemistry and Chemical Biology, Single-Cell Proteomics Center and Barnett Institute, Northeastern University, Boston, MA, USA.
- Parallel Squared Technology Institute, Watertown, MA, USA.
| |
Collapse
|
17
|
Flores JE, Claborne DM, Weller ZD, Webb-Robertson BJM, Waters KM, Bramer LM. Missing data in multi-omics integration: Recent advances through artificial intelligence. Front Artif Intell 2023; 6:1098308. [PMID: 36844425 PMCID: PMC9949722 DOI: 10.3389/frai.2023.1098308] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/23/2023] [Indexed: 02/11/2023] Open
Abstract
Biological systems function through complex interactions between various 'omics (biomolecules), and a more complete understanding of these systems is only possible through an integrated, multi-omic perspective. This has presented the need for the development of integration approaches that are able to capture the complex, often non-linear, interactions that define these biological systems and are adapted to the challenges of combining the heterogenous data across 'omic views. A principal challenge to multi-omic integration is missing data because all biomolecules are not measured in all samples. Due to either cost, instrument sensitivity, or other experimental factors, data for a biological sample may be missing for one or more 'omic techologies. Recent methodological developments in artificial intelligence and statistical learning have greatly facilitated the analyses of multi-omics data, however many of these techniques assume access to completely observed data. A subset of these methods incorporate mechanisms for handling partially observed samples, and these methods are the focus of this review. We describe recently developed approaches, noting their primary use cases and highlighting each method's approach to handling missing data. We additionally provide an overview of the more traditional missing data workflows and their limitations; and we discuss potential avenues for further developments as well as how the missing data issue and its current solutions may generalize beyond the multi-omics context.
Collapse
Affiliation(s)
- Javier E. Flores
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Daniel M. Claborne
- Pacific Northwest National Laboratory, Artificial Intelligence and Data Analytics Division, National Security Directorate, Richland, WA, United States
| | - Zachary D. Weller
- Pacific Northwest National Laboratory, Artificial Intelligence and Data Analytics Division, National Security Directorate, Richland, WA, United States
| | - Bobbie-Jo M. Webb-Robertson
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Katrina M. Waters
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Lisa M. Bramer
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| |
Collapse
|
18
|
Chen Y, Lonergan S, Lim KS, Cheng J, Putz AM, Dyck MK, Canada P, Fortin F, Harding JCS, Plastow GS, Dekkers JCM. Plasma protein levels of young healthy pigs as indicators of disease resilience. J Anim Sci 2023; 101:6987177. [PMID: 36638126 PMCID: PMC9977353 DOI: 10.1093/jas/skad014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 01/11/2023] [Indexed: 01/14/2023] Open
Abstract
Selection for disease resilience, which refers to the ability of an animal to maintain performance when exposed to disease, can reduce the impact of infectious diseases. However, direct selection for disease resilience is challenging because nucleus herds must maintain a high health status. A possible solution is indirect selection of indicators of disease resilience. To search for such indicators, we conducted phenotypic and genetic quantitative analyses of the abundances of 377 proteins in plasma samples from 912 young and visually healthy pigs and their relationships with performance and subsequent disease resilience after natural exposure to a polymicrobial disease challenge. Abundances of 100 proteins were significantly heritable (false discovery rate (FDR) <0.10). The abundance of some proteins was or tended to be genetically correlated (rg) with disease resilience, including complement system proteins (rg = -0.24, FDR = 0.001) and IgG heavy chain proteins (rg = -0.68, FDR = 0.22). Gene set enrichment analyses (FDR < 0.2) based on phenotypic and genetic associations of protein abundances with subsequent disease resilience revealed many pathways related to the immune system that were unfavorably associated with subsequent disease resilience, especially the innate immune system. It was not possible to determine whether the observed levels of these proteins reflected baseline levels in these young and visually healthy pigs or were the result of a response to environmental disturbances that the pigs were exposed to before sample collection. Nevertheless, results show that, under these conditions, the abundance of proteins in some immune-related pathways can be used as phenotypic and genetic predictors of disease resilience and have the potential for use in pig breeding and management.
Collapse
Affiliation(s)
- Yulu Chen
- Department of Animal Science, Iowa State University, Ames, IA, USA
| | - Steven Lonergan
- Department of Animal Science, Iowa State University, Ames, IA, USA
| | - Kyu-Sang Lim
- Department of Animal Science, Iowa State University, Ames, IA, USA,Department of Animal Resources Science, Kongju National University, Yesan, Republic of Korea
| | - Jian Cheng
- Department of Animal Science, Iowa State University, Ames, IA, USA
| | - Austin M Putz
- Department of Animal Science, Iowa State University, Ames, IA, USA,Hendrix Genetics, Swine Business Unit, Boxmeer, The Netherlands
| | - Michael K Dyck
- Department of Agriculture, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - PigGen Canada
- PigGen Canada Research Consortium, Guelph, Ontario, Canada
| | - Frederic Fortin
- Centre de Développement du Porc du Québec Inc., Québec City, Canada
| | - John C S Harding
- Department of Large Animal Clinical Science, University of Saskatchewan, Saskatoon, SK, Canada
| | - Graham S Plastow
- Department of Agriculture, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | | |
Collapse
|
19
|
Vanderaa C, Gatto L. The Current State of Single-Cell Proteomics Data Analysis. Curr Protoc 2023; 3:e658. [PMID: 36633424 DOI: 10.1002/cpz1.658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Sound data analysis is essential to retrieve meaningful biological information from single-cell proteomics experiments. This analysis is carried out by computational methods that are assembled into workflows, and their implementations influence the conclusions that can be drawn from the data. In this work, we explore and compare the computational workflows that have been used over the last four years and identify a profound lack of consensus on how to analyze single-cell proteomics data. We highlight the need for benchmarking of computational workflows and standardization of computational tools and data, as well as carefully designed experiments. Finally, we cover the current standardization efforts that aim to fill the gap, list the remaining missing pieces, and conclude with lessons learned from the replication of published single-cell proteomics analyses. © 2023 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Christophe Vanderaa
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, Université catholique de Louvain, Belgium
| | - Laurent Gatto
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, Université catholique de Louvain, Belgium
| |
Collapse
|
20
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
21
|
Smith TS, Andrejeva A, Christopher J, Crook OM, Elzek M, Lilley KS. Prior Signal Acquisition Software Versions for Orbitrap Underestimate Low Isobaric Mass Tag Intensities, Without Detriment to Differential Abundance Experiments. ACS MEASUREMENT SCIENCE AU 2022; 2:233-240. [PMID: 35726249 PMCID: PMC9204819 DOI: 10.1021/acsmeasuresciau.1c00053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 01/24/2022] [Accepted: 01/25/2022] [Indexed: 06/15/2023]
Abstract
Tandem mass tags (TMTs) enable simple and accurate quantitative proteomics for multiplexed samples by relative quantification of tag reporter ions. Orbitrap quantification of reporter ions has been associated with a characteristic notch region in intensity distribution, within which few reporter intensities are recorded. This has been resolved in version 3 of the instrument acquisition software Tune. However, 47% of Orbitrap Fusion, Lumos, or Eclipse submissions to PRIDE were generated using prior software versions. To quantify the impact of the notch on existing quantitative proteomics data, we generated a mixed species benchmark and acquired quantitative data using Tune versions 2 and 3. Intensities below the notch are predominantly underestimated with Tune version 2, leading to overestimation of the true differences in intensities between samples. However, when summarizing reporter ion intensities to higher-level features, such as peptides and proteins, few features are significantly affected. Targeted removal of spectra with reporter ion intensities below the notch is not beneficial for differential peptide or protein testing. Overall, we find that the systematic quantification bias associated with the notch is not detrimental for a typical proteomics experiment.
Collapse
Affiliation(s)
- Tom S. Smith
- MRC
Toxicology Unit, University of Cambridge, Cambridge CB2 1QR, U.K.
| | - Anna Andrejeva
- Department
of Biochemistry, University of Cambridge, Cambridge CB2 1QW, U.K.
| | - Josie Christopher
- Department
of Biochemistry, University of Cambridge, Cambridge CB2 1QW, U.K.
| | - Oliver M. Crook
- Department
of Statistics, University of Oxford, Oxford OX1 3LB, U.K.
| | - Mohamed Elzek
- MRC
Toxicology Unit, University of Cambridge, Cambridge CB2 1QR, U.K.
| | - Kathryn S. Lilley
- Department
of Biochemistry, University of Cambridge, Cambridge CB2 1QW, U.K.
| |
Collapse
|
22
|
De La Toba EA, Bell SE, Romanova EV, Sweedler JV. Mass Spectrometry Measurements of Neuropeptides: From Identification to Quantitation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2022; 15:83-106. [PMID: 35324254 DOI: 10.1146/annurev-anchem-061020-022048] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Neuropeptides (NPs), a unique class of neuronal signaling molecules, participate in a variety of physiological processes and diseases. Quantitative measurements of NPs provide valuable information regarding how these molecules are differentially regulated in a multitude of neurological, metabolic, and mental disorders. Mass spectrometry (MS) has evolved to become a powerful technique for measuring trace levels of NPs in complex biological tissues and individual cells using both targeted and exploratory approaches. There are inherent challenges to measuring NPs, including their wide endogenous concentration range, transport and postmortem degradation, complex sample matrices, and statistical processing of MS data required for accurate NP quantitation. This review highlights techniques developed to address these challenges and presents an overview of quantitative MS-based measurement approaches for NPs, including the incorporation of separation methods for high-throughput analysis, MS imaging for spatial measurements, and methods for NP quantitation in single neurons.
Collapse
Affiliation(s)
- Eduardo A De La Toba
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, Illinois, USA;
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Sara E Bell
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, Illinois, USA;
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Elena V Romanova
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, Illinois, USA;
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Jonathan V Sweedler
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, Illinois, USA;
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
23
|
Hamood F, Bayer FP, Wilhelm M, Kuster B, The M. SIMSI-Transfer: Software-assisted reduction of missing values in phosphoproteomic and proteomic isobaric labeling data using tandem mass spectrum clustering. Mol Cell Proteomics 2022; 21:100238. [PMID: 35462064 PMCID: PMC9389303 DOI: 10.1016/j.mcpro.2022.100238] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 03/18/2022] [Accepted: 03/27/2022] [Indexed: 12/11/2022] Open
Abstract
Isobaric stable isotope labeling techniques such as tandem mass tags (TMTs) have become popular in proteomics because they enable the relative quantification of proteins with high precision from up to 18 samples in a single experiment. While missing values in peptide quantification are rare in a single TMT experiment, they rapidly increase when combining multiple TMT experiments. As the field moves toward analyzing ever higher numbers of samples, tools that reduce missing values also become more important for analyzing TMT datasets. To this end, we developed SIMSI-Transfer (Similarity-based Isobaric Mass Spectra 2 [MS2] Identification Transfer), a software tool that extends our previously developed software MaRaCluster (© Matthew The) by clustering similar tandem MS2 from multiple TMT experiments. SIMSI-Transfer is based on the assumption that similarity-clustered MS2 spectra represent the same peptide. Therefore, peptide identifications made by database searching in one TMT batch can be transferred to another TMT batch in which the same peptide was fragmented but not identified. To assess the validity of this approach, we tested SIMSI-Transfer on masked search engine identification results and recovered >80% of the masked identifications while controlling errors in the transfer procedure to below 1% false discovery rate. Applying SIMSI-Transfer to six published full proteome and phosphoproteome datasets from the Clinical Proteomic Tumor Analysis Consortium led to an increase of 26 to 45% of identified MS2 spectra with TMT quantifications. This significantly decreased the number of missing values across batches and, in turn, increased the number of peptides and proteins identified in all TMT batches by 43 to 56% and 13 to 16%, respectively. Spectrum clustering enables peptide identification transfer between LC–MS/MS runs. The SIMSI pipeline supports processing full proteome and phosphoproteome data. SIMSI increases the number of quantifiable PSMs by 26 to 45%. SIMSI reduces missing values in multibatch TMT labeling experiments by up to 21%.
Collapse
Affiliation(s)
- Firas Hamood
- Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany
| | - Florian P Bayer
- Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany
| | - Mathias Wilhelm
- Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany.
| | - Matthew The
- Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany.
| |
Collapse
|
24
|
Plubell DL, Käll L, Webb-Robertson BJM, Bramer LM, Ives A, Kelleher NL, Smith LM, Montine TJ, Wu CC, MacCoss MJ. Putting Humpty Dumpty Back Together Again: What Does Protein Quantification Mean in Bottom-Up Proteomics? J Proteome Res 2022; 21:891-898. [PMID: 35220718 PMCID: PMC8976764 DOI: 10.1021/acs.jproteome.1c00894] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Bottom-up proteomics provides peptide measurements and has been invaluable for moving proteomics into large-scale analyses. Commonly, a single quantitative value is reported for each protein-coding gene by aggregating peptide quantities into protein groups following protein inference or parsimony. However, given the complexity of both RNA splicing and post-translational protein modification, it is overly simplistic to assume that all peptides that map to a singular protein-coding gene will demonstrate the same quantitative response. By assuming that all peptides from a protein-coding sequence are representative of the same protein, we may miss the discovery of important biological differences. To capture the contributions of existing proteoforms, we need to reconsider the practice of aggregating protein values to a single quantity per protein-coding gene.
Collapse
Affiliation(s)
- Deanna L. Plubell
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195 USA
| | - Lukas Käll
- Science for Life Laboratory, KTH - Royal Institute of Technology, Box 1031, 17121, Solna, Sweden
| | | | - Lisa M. Bramer
- Pacific Northwest National Laboratory, Richland, WA 99352
| | - Ashley Ives
- Proteomics Center of Excellence & Departments of Chemistry and Molecular Biosciences, Northwestern University, Evanston, IL 60208
| | - Neil L. Kelleher
- Proteomics Center of Excellence & Departments of Chemistry and Molecular Biosciences, Northwestern University, Evanston, IL 60208
| | - Lloyd M. Smith
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI, 53706
| | | | - Christine C. Wu
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195 USA
| | - Michael J. MacCoss
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195 USA
| |
Collapse
|
25
|
Vanderaa C, Gatto L. Replication of single-cell proteomics data reveals important computational challenges. Expert Rev Proteomics 2021; 18:835-843. [PMID: 34602016 DOI: 10.1080/14789450.2021.1988571] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
INTRODUCTION Mass spectrometry-based proteomics is actively embracing quantitative, single-cell level analyses. Indeed, recent advances in sample preparation and mass spectrometry (MS) have enabled the emergence of quantitative MS-based single-cell proteomics (SCP). While exciting and promising, SCP still has many rough edges. The current analysis workflows are custom and built from scratch. The field is therefore craving for standardized software that promotes principled and reproducible SCP data analyses. AREAS COVERED This special report is the first step toward the formalization and standardization of SCP data analysis. scp, the software that accompanies this work, successfully replicates one of the landmark SCP studies and is applicable to other experiments and designs. We created a repository containing the replicated workflow with comprehensive documentation in order to favor further dissemination and improvements of SCP data analyses. EXPERT OPINION Replicating SCP data analyses uncovers important challenges in SCP data analysis. We describe two such challenges in detail: batch correction and data missingness. We provide the current state-of-the-art and illustrate the associated limitations. We also highlight the intimate dependence that exists between batch effects and data missingness and offer avenues for dealing with these exciting challenges.
Collapse
Affiliation(s)
- Christophe Vanderaa
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, Belgium
| | - Laurent Gatto
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, Belgium
| |
Collapse
|
26
|
Egert J, Brombacher E, Warscheid B, Kreutz C. DIMA: Data-Driven Selection of an Imputation Algorithm. J Proteome Res 2021; 20:3489-3496. [PMID: 34062065 DOI: 10.1021/acs.jproteome.1c00119] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5-50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.
Collapse
Affiliation(s)
- Janine Egert
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
| | - Eva Brombacher
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Spemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Bettina Warscheid
- Biochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Center for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| |
Collapse
|
27
|
Hutchinson-Bunch C, Sanford JA, Hansen JR, Gritsenko MA, Rodland KD, Piehowski PD, Qian WJ, Adkins JN. Assessment of TMT Labeling Efficiency in Large-Scale Quantitative Proteomics: The Critical Effect of Sample pH. ACS OMEGA 2021; 6:12660-12666. [PMID: 34056417 PMCID: PMC8154127 DOI: 10.1021/acsomega.1c00776] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 04/26/2021] [Indexed: 06/12/2023]
Abstract
Isobaric labeling via tandem mass tag (TMT) reagents enables sample multiplexing prior to LC-MS/MS, facilitating high-throughput large-scale quantitative proteomics. Consistent and efficient labeling reactions are essential to achieve robust quantification; therefore, embedded in our clinical proteomic protocol is a quality control (QC) sample that contains a small aliquot from each sample within a TMT set, referred to as "Mixing QC." This Mixing QC enables the detection of TMT labeling issues by LC-MS/MS before combining the full samples to allow for salvaging of poor TMT labeling reactions. While TMT labeling is a valuable tool, factors leading to poor reactions are not fully studied. We observed that relabeling does not necessarily rescue TMT reactions and that peptide samples sometimes remained acidic after resuspending in 50 mM HEPES buffer (pH 8.5), which coincided with low labeling efficiency (LE) and relatively low median reporter ion intensities (MRIIs). To obtain a more resilient TMT labeling procedure, we investigated LE, reporter ion missingness, the ratio of mean TMT set MRII to individual channel MRII, and the distribution of log 2 reporter ion ratios of Mixing QC samples. We discovered that sample pH is a critical factor in LE, and increasing the buffer concentration in poorly labeled samples before relabeling resulted in the successful rescue of TMT labeling reactions. Moreover, resuspending peptides in 500 mM HEPES buffer for TMT labeling resulted in consistently higher LE and lower missing data. By better controlling the sample pH for labeling and implementing multiple methods for assessing labeling quality before combining samples, we demonstrate that robust TMT labeling for large-scale quantitative studies is achievable.
Collapse
Affiliation(s)
- Chelsea Hutchinson-Bunch
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - James A. Sanford
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Joshua R. Hansen
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Marina A. Gritsenko
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Karin D. Rodland
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Paul D. Piehowski
- Environmental
Molecular Sciences Division, Pacific Northwest
National Laboratory, Richland, Washington 99352, United States
| | - Wei-Jun Qian
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Joshua N. Adkins
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| |
Collapse
|
28
|
Arioli A, Dagliati A, Geary B, Peek N, Kalra PA, Whetton AD, Geifman N. OptiMissP: A dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry. PLoS One 2021; 16:e0249771. [PMID: 33857200 PMCID: PMC8049317 DOI: 10.1371/journal.pone.0249771] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 03/24/2021] [Indexed: 11/24/2022] Open
Abstract
Background Missing values are a key issue in the statistical analysis of proteomic data. Defining the strategy to address missing values is a complex task in each study, potentially affecting the quality of statistical analyses. Results We have developed OptiMissP, a dashboard to visually and qualitatively evaluate missingness and guide decision making in the handling of missing values in proteomics studies that use data-independent acquisition mass spectrometry. It provides a set of visual tools to retrieve information about missingness through protein densities and topology-based approaches, and facilitates exploration of different imputation methods and missingness thresholds. Conclusions OptiMissP provides support for researchers’ and clinicians’ qualitative assessment of missingness in proteomic datasets in order to define study-specific strategies for the handling of missing values. OptiMissP considers biases in protein distributions related to the choice of imputation method and helps analysts to balance the information loss caused by low missingness thresholds and the noise introduced by selecting high missingness thresholds. This is complemented by topological data analysis which provides additional insight to the structure of the data and their missingness. We use an example in Chronic Kidney Disease to illustrate the main functionalities of OptiMissP.
Collapse
Affiliation(s)
- Angelica Arioli
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Arianna Dagliati
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
- Division of Informatics, Imaging, and Data Science, School of Health Sciences, The University of Manchester, Manchester, United Kingdom
| | - Bethany Geary
- Division of Cancer Sciences, Stoller Biomarker Discovery Centre, Manchester, United Kingdom
| | - Niels Peek
- Division of Informatics, Imaging, and Data Science, School of Health Sciences, The University of Manchester, Manchester, United Kingdom
- NIHR Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, The University of Manchester, Manchester, United Kingdom
| | | | - Anthony D. Whetton
- Division of Cancer Sciences, Stoller Biomarker Discovery Centre, Manchester, United Kingdom
- NIHR Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, The University of Manchester, Manchester, United Kingdom
- School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, United Kingdom
| | - Nophar Geifman
- Division of Informatics, Imaging, and Data Science, School of Health Sciences, The University of Manchester, Manchester, United Kingdom
- * E-mail:
| |
Collapse
|
29
|
Griss J, Schwämmle V. Analysis of Label-Based Quantitative Proteomics Data Using IsoProt. Methods Mol Biol 2021; 2361:61-73. [PMID: 34236655 DOI: 10.1007/978-1-0716-1641-3_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Isobaric labeling has become an essential method for quantitative mass spectrometry based experiments. This technique allows high-throughput proteomics while providing reasonable coverage of protein measurements across multiple samples. Here, the analysis of isobarically labeled mass spectrometry data with a special focus on quality control and potential pitfalls is discussed. The protocol is based on our fully integrated IsoProt workflow. The concepts discussed are nevertheless applicable to the analysis of any isobarically labeled experiment using alternative computational tools and algorithms.
Collapse
Affiliation(s)
- Johannes Griss
- Department of Dermatology, Medical University of Vienna, Vienna, Austria.
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark
| |
Collapse
|