1
|
Wei Q, Li J, Ma J, He QY, Zhang G. DeepMS: super-fast peptide identification using end-to-end deep learning method. J Mol Biol 2025:169237. [PMID: 40449612 DOI: 10.1016/j.jmb.2025.169237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Revised: 05/05/2025] [Accepted: 05/26/2025] [Indexed: 06/03/2025]
Abstract
Mass spectrometry (MS) has emerged as a powerful omics analysis technique, particularly in proteomics, where the initial step involves identifying MS spectra as peptide sequences. However, this process often requires substantial computational resources and expertise, taking hours or even days to complete, thereby limiting the widespread adoption of MS-based omics technologies. To overcome this challenge, we have developed DeepMS, a deep learning-based spectra identification algorithm that overcomes the speed limitations of traditional spectra identification methods. We conducted comprehensive benchmark tests, comparing six deep learning algorithms. Based on the results, we selected the VGG16 algorithm as the core model for DeepMS. This algorithm enables super-fast, end-to-end identification of peptide sequences from MS spectra with high accuracy. DeepMS is adaptable to post-translational modifications, enhancing its versatility. In fact, its identification speed surpasses the generation rate of MS spectra, enabling super-fast identification. Furthermore, we demonstrate the practical application of DeepMS in microorganism detection, highlighting its utility in clinical testing. Through the implementation of DeepMS, our aim is to revolutionize the field of MS-based proteomics and facilitate the broader application of omics technologies, opening new avenues for rapid and efficient analysis in various research and clinical domains.
Collapse
Affiliation(s)
- Qianzhou Wei
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, 510632, China.
| | - Jiamin Li
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, 510632, China.
| | - Jin Ma
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, 510632, China.
| | - Qing-Yu He
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, 510632, China.
| | - Gong Zhang
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes and MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, 510632, China.
| |
Collapse
|
2
|
Bushuiev R, Bushuiev A, Samusevich R, Brungs C, Sivic J, Pluskal T. Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nat Biotechnol 2025:10.1038/s41587-025-02663-3. [PMID: 40410407 DOI: 10.1038/s41587-025-02663-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 03/31/2025] [Indexed: 05/25/2025]
Abstract
Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we named Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas-a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.
Collapse
Affiliation(s)
- Roman Bushuiev
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic
- Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Prague, Czech Republic
| | - Anton Bushuiev
- Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Prague, Czech Republic
| | - Raman Samusevich
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic
- Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Prague, Czech Republic
| | - Corinna Brungs
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic
| | - Josef Sivic
- Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Prague, Czech Republic.
| | - Tomáš Pluskal
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic.
| |
Collapse
|
3
|
Angelis J, Schröder EA, Xiao Z, Gabriel W, Wilhelm M. Peptide Property Prediction for Mass Spectrometry Using AI: An Introduction to State of the Art Models. Proteomics 2025; 25:e202400398. [PMID: 40211610 PMCID: PMC12076536 DOI: 10.1002/pmic.202400398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 03/14/2025] [Accepted: 03/17/2025] [Indexed: 05/15/2025]
Abstract
This review explores state of the art machine learning and deep learning models for peptide property prediction in mass spectrometry-based proteomics, including, but not limited to, models for predicting digestibility, retention time, charge state distribution, collisional cross section, fragmentation ion intensities, and detectability. The combination of these models enables not only the in silico generation of spectral libraries but also finds many additional use cases in the design of targeted assays or data-driven rescoring. This review serves as both an introduction for newcomers and an update for experienced researchers aiming to develop accessible and reproducible models for peptide property predictions. Key limitations of the current models, including difficulties in handling diverse post-translational modifications and instrument variability, highlight the need for large-scale, harmonized datasets, and standardized evaluation metrics for benchmarking.
Collapse
Affiliation(s)
- Jesse Angelis
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Eva Ayla Schröder
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Zixuan Xiao
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Wassim Gabriel
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Mathias Wilhelm
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
- Munich Data Science Institute (MDSI)Technical University of MunichGarchingGermany
| |
Collapse
|
4
|
Yang K, Paulo JA, Gygi SP, Yu Q. Enhanced Sample Multiplexing-Based Targeted Proteomics with Intelligent Data Acquisition. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:2420-2428. [PMID: 39254261 PMCID: PMC11967381 DOI: 10.1021/jasms.4c00234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Targeted proteomics has been playing an increasingly important role in hypothesis-driven protein research and clinical biomarker discovery. We previously created a workflow, Tomahto, to enable real-time targeted pathway proteomics assays using two-dimensional multiplexing technology. Coupled with the TMT 11-plex reagent, hundreds of proteins of interest from up to 11 samples can be targeted and accurately quantified in a single-shot experiment with remarkable sensitivity. However, room remains to further improve the sensitivity, accuracy, and throughput, especially for targeted studies demanding a high peptide-level success rate. Here, bearing in mind the goal to improve peptide-level targeting, we introduce several new functionalities in Tomahto, featuring the integration of gas-phase fractionation using the FAIMS device, an accompanying software program (TomahtoPrimer) to customize fragmentation for each peptide target, and support for higher multiplexing capacity with the latest TMTpro reagent. We demonstrate that adding these features to the Tomahto platform significantly improves overall success rate from 89% to 98% in a single 60 min targeted assay of 290 peptides across human cell lines, while boosting quantitative accuracy via reducing TMT reporter ion interference.
Collapse
Affiliation(s)
- Ka Yang
- Department of cell biology, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Joao A Paulo
- Department of cell biology, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Steven P Gygi
- Department of cell biology, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Qing Yu
- Department of cell biology, Harvard Medical School, Boston, Massachusetts 02115, United States
- Department of biochemistry and molecular biotechnology, University of Massachusetts Chan Medical School, Worcester, Massachusetts 01605, United States
| |
Collapse
|
5
|
Dens C, Adams C, Laukens K, Bittremieux W. Machine Learning Strategies to Tackle Data Challenges in Mass Spectrometry-Based Proteomics. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:2143-2155. [PMID: 39074335 DOI: 10.1021/jasms.4c00180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In computational proteomics, machine learning (ML) has emerged as a vital tool for enhancing data analysis. Despite significant advancements, the diversity of ML model architectures and the complexity of proteomics data present substantial challenges in the effective development and evaluation of these tools. Here, we highlight the necessity for high-quality, comprehensive data sets to train ML models and advocate for the standardization of data to support robust model development. We emphasize the instrumental role of key data sets like ProteomeTools and MassIVE-KB in advancing ML applications in proteomics and discuss the implications of data set size on model performance, highlighting that larger data sets typically yield more accurate models. To address data scarcity, we explore algorithmic strategies such as self-supervised pretraining and multitask learning. Ultimately, we hope that this discussion can serve as a call to action for the proteomics community to collaborate on data standardization and collection efforts, which are crucial for the sustainable advancement and refinement of ML methodologies in the field.
Collapse
Affiliation(s)
- Ceder Dens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| |
Collapse
|
6
|
McWhite CD, Sae-Lee W, Yuan Y, Mallam AL, Gort-Freitas NA, Ramundo S, Onishi M, Marcotte EM. Alternative proteoforms and proteoform-dependent assemblies in humans and plants. Mol Syst Biol 2024; 20:933-951. [PMID: 38918600 PMCID: PMC11297038 DOI: 10.1038/s44320-024-00048-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 06/04/2024] [Accepted: 06/06/2024] [Indexed: 06/27/2024] Open
Abstract
The variability of proteins at the sequence level creates an enormous potential for proteome complexity. Exploring the depths and limits of this complexity is an ongoing goal in biology. Here, we systematically survey human and plant high-throughput bottom-up native proteomics data for protein truncation variants, where substantial regions of the full-length protein are missing from an observed protein product. In humans, Arabidopsis, and the green alga Chlamydomonas, approximately one percent of observed proteins show a short form, which we can assign by comparison to RNA isoforms as either likely deriving from transcript-directed processes or limited proteolysis. While some detected protein fragments align with known splice forms and protein cleavage events, multiple examples are previously undescribed, such as our observation of fibrocystin proteolysis and nuclear translocation in a green alga. We find that truncations occur almost entirely between structured protein domains, even when short forms are derived from transcript variants. Intriguingly, multiple endogenous protein truncations of phase-separating translational proteins resemble cleaved proteoforms produced by enteroviruses during infection. Some truncated proteins are also observed in both humans and plants, suggesting that they date to the last eukaryotic common ancestor. Finally, we describe novel proteoform-specific protein complexes, where the loss of a domain may accompany complex formation.
Collapse
Affiliation(s)
- Claire D McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 08544, USA.
| | - Wisath Sae-Lee
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, 78712, USA
| | - Yaning Yuan
- Department of Biology, Duke University, Durham, NC, 27708, USA
| | - Anna L Mallam
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, 78712, USA
| | | | - Silvia Ramundo
- Gregor Mendel Institute of Molecular Plant Biology, 1030, Wien, Austria
| | - Masayuki Onishi
- Department of Biology, Duke University, Durham, NC, 27708, USA
| | - Edward M Marcotte
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, 78712, USA
| |
Collapse
|
7
|
Beck A, Muhoberac M, Randolph CE, Beveridge CH, Wijewardhane PR, Kenttämaa HI, Chopra G. Recent Developments in Machine Learning for Mass Spectrometry. ACS MEASUREMENT SCIENCE AU 2024; 4:233-246. [PMID: 38910862 PMCID: PMC11191731 DOI: 10.1021/acsmeasuresciau.3c00060] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/27/2023] [Accepted: 01/22/2024] [Indexed: 06/25/2024]
Abstract
Statistical analysis and modeling of mass spectrometry (MS) data have a long and rich history with several modern MS-based applications using statistical and chemometric methods. Recently, machine learning (ML) has experienced a renaissance due to advents in computational hardware and the development of new algorithms for artificial neural networks (ANN) and deep learning architectures. Moreover, recent successes of new ANN and deep learning architectures in several areas of science, engineering, and society have further strengthened the ML field. Importantly, modern ML methods and architectures have enabled new approaches for tasks related to MS that are now widely adopted in several popular MS-based subdisciplines, such as mass spectrometry imaging and proteomics. Herein, we aim to provide an introductory summary of the practical aspects of ML methodology relevant to MS. Additionally, we seek to provide an up-to-date review of the most recent developments in ML integration with MS-based techniques while also providing critical insights into the future direction of the field.
Collapse
Affiliation(s)
- Armen
G. Beck
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Matthew Muhoberac
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Caitlin E. Randolph
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Connor H. Beveridge
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Prageeth R. Wijewardhane
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Hilkka I. Kenttämaa
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Gaurav Chopra
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
- Department
of Computer Science (by courtesy), Purdue University, West Lafayette, Indiana 47907, United States
- Purdue
Institute for Drug Discovery, Purdue Institute for Cancer Research,
Regenstrief Center for Healthcare Engineering, Purdue Institute for
Inflammation, Immunology and Infectious Disease, Purdue Institute for Integrative Neuroscience, West Lafayette, Indiana 47907 United States
| |
Collapse
|
8
|
Hamaneh M, Ogurtsov AY, Obolensky OI, Yu YK. Systematic Assessment of Deep Learning-Based Predictors of Fragmentation Intensity Profiles. J Proteome Res 2024; 23:1983-1999. [PMID: 38728051 PMCID: PMC11165591 DOI: 10.1021/acs.jproteome.3c00857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 03/05/2024] [Accepted: 04/16/2024] [Indexed: 06/13/2024]
Abstract
In recent years, several deep learning-based methods have been proposed for predicting peptide fragment intensities. This study aims to provide a comprehensive assessment of six such methods, namely Prosit, DeepMass:Prism, pDeep3, AlphaPeptDeep, Prosit Transformer, and the method proposed by Guan et al. To this end, we evaluated the accuracy of the predicted intensity profiles for close to 1.7 million precursors (including both tryptic and HLA peptides) corresponding to more than 18 million experimental spectra procured from 40 independent submissions to the PRIDE repository that were acquired for different species using a variety of instruments and different dissociation types/energies. Specifically, for each method, distributions of similarity (measured by Pearson's correlation and normalized angle) between the predicted and the corresponding experimental b and y fragment intensities were generated. These distributions were used to ascertain the prediction accuracy and rank the prediction methods for particular types of experimental conditions. The effect of variables like precursor charge, length, and collision energy on the prediction accuracy was also investigated. In addition to prediction accuracy, the methods were evaluated in terms of prediction speed. The systematic assessment of these six methods may help in choosing the right method for MS/MS spectra prediction for particular needs.
Collapse
Affiliation(s)
- Mehdi
B. Hamaneh
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Aleksey Y. Ogurtsov
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | | | - Yi-Kuo Yu
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
9
|
Adams C, Laukens K, Bittremieux W, Boonen K. Machine learning-based peptide-spectrum match rescoring opens up the immunopeptidome. Proteomics 2024; 24:e2300336. [PMID: 38009585 DOI: 10.1002/pmic.202300336] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 10/18/2023] [Accepted: 10/23/2023] [Indexed: 11/29/2023]
Abstract
Immunopeptidomics is a key technology in the discovery of targets for immunotherapy and vaccine development. However, identifying immunopeptides remains challenging due to their non-tryptic nature, which results in distinct spectral characteristics. Moreover, the absence of strict digestion rules leads to extensive search spaces, further amplified by the incorporation of somatic mutations, pathogen genomes, unannotated open reading frames, and post-translational modifications. This inflation in search space leads to an increase in random high-scoring matches, resulting in fewer identifications at a given false discovery rate. Peptide-spectrum match rescoring has emerged as a machine learning-based solution to address challenges in mass spectrometry-based immunopeptidomics data analysis. It involves post-processing unfiltered spectrum annotations to better distinguish between correct and incorrect peptide-spectrum matches. Recently, features based on predicted peptidoform properties, including fragment ion intensities, retention time, and collisional cross section, have been used to improve the accuracy and sensitivity of immunopeptide identification. In this review, we describe the diverse bioinformatics pipelines that are currently available for peptide-spectrum match rescoring and discuss how they can be used for the analysis of immunopeptidomics data. Finally, we provide insights into current and future machine learning solutions to boost immunopeptide identification.
Collapse
Affiliation(s)
- Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kurt Boonen
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
- ImmuneSpec BV, Niel, Belgium
| |
Collapse
|
10
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
11
|
Geer LY, Lapin J, Slotta DJ, Mak TD, Stein SE. AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence. J Proteome Res 2023; 22:2246-2255. [PMID: 37232537 PMCID: PMC10542943 DOI: 10.1021/acs.jproteome.2c00807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The unbounded permutations of biological molecules, including proteins and their constituent peptides, present a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.
Collapse
Affiliation(s)
- Lewis Y. Geer
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Joel Lapin
- Department of Physics, Georgetown University, Washington, DC 20057, United States
- Associate, Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Douglas J. Slotta
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Tytus D. Mak
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Stephen E. Stein
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| |
Collapse
|
12
|
Neely BA, Dorfer V, Martens L, Bludau I, Bouwmeester R, Degroeve S, Deutsch EW, Gessulat S, Käll L, Palczynski P, Payne SH, Rehfeldt TG, Schmidt T, Schwämmle V, Uszkoreit J, Vizcaíno JA, Wilhelm M, Palmblad M. Toward an Integrated Machine Learning Model of a Proteomics Experiment. J Proteome Res 2023; 22:681-696. [PMID: 36744821 PMCID: PMC9990124 DOI: 10.1021/acs.jproteome.2c00711] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Indexed: 02/07/2023]
Abstract
In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research.
Collapse
Affiliation(s)
- Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Viktoria Dorfer
- Bioinformatics
Research Group, University of Applied Sciences
Upper Austria, Softwarepark
11, 4232 Hagenberg, Austria
| | - Lennart Martens
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Isabell Bludau
- Department
of Proteomics and Signal Transduction, Max
Planck Institute of Biochemistry, 82152 Martinsried, Germany
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | | | - Lukas Käll
- Science
for Life Laboratory, KTH - Royal Institute
of Technology, 171 21 Solna, Sweden
| | - Pawel Palczynski
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Samuel H. Payne
- Department
of Biology, Brigham Young University, Provo, Utah 84602, United States
| | - Tobias Greisager Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5230 Odense, Denmark
| | | | - Veit Schwämmle
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Julian Uszkoreit
- Medical
Proteome Analysis, Center for Protein Diagnostics (ProDi), Ruhr University Bochum, 44801 Bochum, Germany
- Medizinisches
Proteom-Center, Medical Faculty, Ruhr University
Bochum, 44801 Bochum, Germany
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory,
European Bioinformatics Institute
(EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United
Kingdom
| | - Mathias Wilhelm
- Computational
Mass Spectrometry, Technical University
of Munich (TUM), 85354 Freising, Germany
| | - Magnus Palmblad
- Leiden University Medical Center, Postbus 9600, 2300
RC Leiden, The Netherlands
| |
Collapse
|
13
|
Álvarez-Urdiola R, Borràs E, Valverde F, Matus JT, Sabidó E, Riechmann JL. Peptidomics Methods Applied to the Study of Flower Development. Methods Mol Biol 2023; 2686:509-536. [PMID: 37540375 DOI: 10.1007/978-1-0716-3299-4_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/05/2023]
Abstract
Understanding the global and dynamic nature of plant developmental processes requires not only the study of the transcriptome, but also of the proteome, including its largely uncharacterized peptidome fraction. Recent advances in proteomics and high-throughput analyses of translating RNAs (ribosome profiling) have begun to address this issue, evidencing the existence of novel, uncharacterized, and possibly functional peptides. To validate the accumulation in tissues of sORF-encoded polypeptides (SEPs), the basic setup of proteomic analyses (i.e., LC-MS/MS) can be followed. However, the detection of peptides that are small (up to ~100 aa, 6-7 kDa) and novel (i.e., not annotated in reference databases) presents specific challenges that need to be addressed both experimentally and with computational biology resources. Several methods have been developed in recent years to isolate and identify peptides from plant tissues. In this chapter, we outline two different peptide extraction protocols and the subsequent peptide identification by mass spectrometry using the database search or the de novo identification methods.
Collapse
Affiliation(s)
- Raquel Álvarez-Urdiola
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Edifici CRAG, Campus UAB, Cerdanyola del Vallès, Barcelona, Spain
| | - Eva Borràs
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Federico Valverde
- Institute for Plant Biochemistry and Photosynthesis CSIC - University of Seville, Seville, Spain
| | - José Tomás Matus
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Edifici CRAG, Campus UAB, Cerdanyola del Vallès, Barcelona, Spain
- Institute for Integrative Systems Biology (I2SysBio), Universitat de València-CSIC, Paterna, Valencia, Spain
| | - Eduard Sabidó
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - José Luis Riechmann
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Edifici CRAG, Campus UAB, Cerdanyola del Vallès, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
14
|
Zeng WF, Zhou XX, Willems S, Ammar C, Wahle M, Bludau I, Voytik E, Strauss MT, Mann M. AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics. Nat Commun 2022; 13:7238. [PMID: 36433986 PMCID: PMC9700817 DOI: 10.1038/s41467-022-34904-3] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 11/10/2022] [Indexed: 11/27/2022] Open
Abstract
Machine learning and in particular deep learning (DL) are increasingly important in mass spectrometry (MS)-based proteomics. Recent DL models can predict the retention time, ion mobility and fragment intensities of a peptide just from the amino acid sequence with good accuracy. However, DL is a very rapidly developing field with new neural network architectures frequently appearing, which are challenging to incorporate for proteomics researchers. Here we introduce AlphaPeptDeep, a modular Python framework built on the PyTorch DL library that learns and predicts the properties of peptides ( https://github.com/MannLabs/alphapeptdeep ). It features a model shop that enables non-specialists to create models in just a few lines of code. AlphaPeptDeep represents post-translational modifications in a generic manner, even if only the chemical composition is known. Extensive use of transfer learning obviates the need for large data sets to refine models for particular experimental conditions. The AlphaPeptDeep models for predicting retention time, collisional cross sections and fragment intensities are at least on par with existing tools. Additional sequence-based properties can also be predicted by AlphaPeptDeep, as demonstrated with a HLA peptide prediction model to improve HLA peptide identification for data-independent acquisition ( https://github.com/MannLabs/PeptDeep-HLA ).
Collapse
Affiliation(s)
- Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Xie-Xuan Zhou
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Sander Willems
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Constantin Ammar
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Maria Wahle
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Isabell Bludau
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Eugenia Voytik
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Maximillian T Strauss
- Proteomics Program, NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Matthias Mann
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- Proteomics Program, NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|