1
|
Chan CMJ, Madej D, Chung CKJ, Lam H. Deep Learning-Based Prediction of Decoy Spectra for False Discovery Rate Estimation in Spectral Library Searching. J Proteome Res 2025. [PMID: 40252226 DOI: 10.1021/acs.jproteome.4c00304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2025]
Abstract
With the advantage of extensive coverage, predicted spectral libraries are becoming an attractive alternative in proteomic data analysis. As a popular false discovery rate estimation method, target decoy search has been adopted in library search workflows. While existing decoy methods for curated experimental libraries have been tested, their performance in predicted library scenarios remains unknown. Current methods rely on perturbing real spectra templates, limiting the diversity and number of decoy spectra that can be generated for a given library. In this study, we explore the shuffle-and-predict decoy library generation approach, which can generate decoy spectra without the need for template spectra. Our experiments shed light on decoy method performance for predicted library scenarios and demonstrate the quality of predicted decoys in FDR estimation.
Collapse
Affiliation(s)
- Chak Ming Jerry Chan
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China 999077
| | - Dominik Madej
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China 999077
| | - Chun Kit Jason Chung
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China 999077
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China 999077
| |
Collapse
|
2
|
Wang K, Zhu M, Boulila W, Driss M, Gadekallu TR, Chen CM, Wang L, Kumari S, Yiu SM. SeqNovo: De Novo Peptide Sequencing Prediction in IoMT via Seq2Seq. IEEE J Biomed Health Inform 2025; 29:2377-2387. [PMID: 37792659 DOI: 10.1109/jbhi.2023.3321780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/06/2023]
Abstract
In the Internet of Medical Things (IoMT), de novo peptide sequencing prediction is one of the most important techniques for the fields of disease prediction, diagnosis, and treatment. Recently, deep-learning-based peptide sequencing prediction has been a new trend. However, most popular deep learning models for peptide sequencing prediction suffer from poor interpretability and poor ability to capture long-range dependencies. To solve these issues, we propose a model named SeqNovo, which has the encoding-decoding structure of sequence to sequence (Seq2Seq), the highly nonlinear properties of multilayer perceptron (MLP), and the ability of the attention mechanism to capture long-range dependencies. SeqNovo use MLP to improve the feature extraction and utilize the attention mechanism to discover key information. A series of experiments have been conducted to show that the SeqNovo is superior to the Seq2Seq benchmark model, DeepNovo. SeqNovo improves both the accuracy and interpretability of the predictions, which will be expected to support more related research.
Collapse
|
3
|
Wen B, Hsu C, Zeng WF, Riffle M, Chang A, Mudge M, Nunn B, Berg MD, Villén J, MacCoss MJ, Noble WS. Carafe enables high quality in silico spectral library generation for data-independent acquisition proteomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.15.618504. [PMID: 39463980 PMCID: PMC11507862 DOI: 10.1101/2024.10.15.618504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Data-independent acquisition (DIA)-based mass spectrometry is becoming an increasingly popular mass spectrometry acquisition strategy for carrying out quantitative proteomics experiments. Most of the popular DIA search engines make use of in silico generated spectral libraries. However, the generation of high-quality spectral libraries for DIA data analysis remains a challenge, particularly because most such libraries are generated directly from data-dependent acquisition (DDA) data or are from in silico prediction using models trained on DDA data. In this study, we developed Carafe, a tool that generates high-quality experiment-specific in silico spectral libraries by training deep learning models directly on DIA data. We demonstrate the performance of Carafe on a wide range of DIA datasets, where we observe improved fragment ion intensity prediction and peptide detection relative to existing pretrained DDA models.
Collapse
Affiliation(s)
- Bo Wen
- Department of Genome Sciences, University of Washington
| | - Chris Hsu
- Department of Genome Sciences, University of Washington
| | - Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Germany
| | | | - Alexis Chang
- Department of Genome Sciences, University of Washington
| | - Miranda Mudge
- Department of Genome Sciences, University of Washington
| | - Brook Nunn
- Department of Genome Sciences, University of Washington
| | | | - Judit Villén
- Department of Genome Sciences, University of Washington
| | | | - William S. Noble
- Department of Genome Sciences, University of Washington
- Paul G. Allen School of Computer Science and Engineering, University of Washington
| |
Collapse
|
4
|
Dens C, Adams C, Laukens K, Bittremieux W. Machine Learning Strategies to Tackle Data Challenges in Mass Spectrometry-Based Proteomics. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:2143-2155. [PMID: 39074335 DOI: 10.1021/jasms.4c00180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In computational proteomics, machine learning (ML) has emerged as a vital tool for enhancing data analysis. Despite significant advancements, the diversity of ML model architectures and the complexity of proteomics data present substantial challenges in the effective development and evaluation of these tools. Here, we highlight the necessity for high-quality, comprehensive data sets to train ML models and advocate for the standardization of data to support robust model development. We emphasize the instrumental role of key data sets like ProteomeTools and MassIVE-KB in advancing ML applications in proteomics and discuss the implications of data set size on model performance, highlighting that larger data sets typically yield more accurate models. To address data scarcity, we explore algorithmic strategies such as self-supervised pretraining and multitask learning. Ultimately, we hope that this discussion can serve as a call to action for the proteomics community to collaborate on data standardization and collection efforts, which are crucial for the sustainable advancement and refinement of ML methodologies in the field.
Collapse
Affiliation(s)
- Ceder Dens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| |
Collapse
|
5
|
He G, He Q, Cheng J, Yu R, Shuai J, Cao Y. ProPept-MT: A Multi-Task Learning Model for Peptide Feature Prediction. Int J Mol Sci 2024; 25:7237. [PMID: 39000344 PMCID: PMC11241495 DOI: 10.3390/ijms25137237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 06/26/2024] [Accepted: 06/28/2024] [Indexed: 07/16/2024] Open
Abstract
In the realm of quantitative proteomics, data-independent acquisition (DIA) has emerged as a promising approach, offering enhanced reproducibility and quantitative accuracy compared to traditional data-dependent acquisition (DDA) methods. However, the analysis of DIA data is currently hindered by its reliance on project-specific spectral libraries derived from DDA analyses, which not only limits proteome coverage but also proves to be a time-intensive process. To overcome these challenges, we propose ProPept-MT, a novel deep learning-based multi-task prediction model designed to accurately forecast key features such as retention time (RT), ion intensity, and ion mobility (IM). Leveraging advanced techniques such as multi-head attention and BiLSTM for feature extraction, coupled with Nash-MTL for gradient coordination, ProPept-MT demonstrates superior prediction performance. Integrating ion mobility alongside RT, mass-to-charge ratio (m/z), and ion intensity forms 4D proteomics. Then, we outline a comprehensive workflow tailored for 4D DIA proteomics research, integrating the use of 4D in silico libraries predicted by ProPept-MT. Evaluation on a benchmark dataset showcases ProPept-MT's exceptional predictive capabilities, with impressive results including a 99.9% Pearson correlation coefficient (PCC) for RT prediction, a median dot product (DP) of 96.0% for fragment ion intensity prediction, and a 99.3% PCC for IM prediction on the test set. Notably, ProPept-MT manifests efficacy in predicting both unmodified and phosphorylated peptides, underscoring its potential as a valuable tool for constructing high-quality 4D DIA in silico libraries.
Collapse
Affiliation(s)
- Guoqiang He
- Postgraduate Training Base Alliance, Wenzhou Medical University, Wenzhou 325000, China
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Qingzu He
- Department of Physics, and Fujian Provincial Key Laboratory for Soft Functional Materials Research, Xiamen University, Xiamen 361005, China
| | - Jinyan Cheng
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Rongwen Yu
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Jianwei Shuai
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Yi Cao
- Postgraduate Training Base Alliance, Wenzhou Medical University, Wenzhou 325000, China
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| |
Collapse
|
6
|
Liu K, Tao C, Ye Y, Tang H. SpecEncoder: deep metric learning for accurate peptide identification in proteomics. Bioinformatics 2024; 40:i257-i265. [PMID: 38940141 PMCID: PMC11211836 DOI: 10.1093/bioinformatics/btae220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. RESULTS We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses. AVAILABILITY AND IMPLEMENTATION The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.
Collapse
Affiliation(s)
- Kaiyuan Liu
- Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States
| | - Chenghua Tao
- Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States
| | - Yuzhen Ye
- Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States
| | - Haixu Tang
- Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States
| |
Collapse
|
7
|
Hamaneh M, Ogurtsov AY, Obolensky OI, Yu YK. Systematic Assessment of Deep Learning-Based Predictors of Fragmentation Intensity Profiles. J Proteome Res 2024; 23:1983-1999. [PMID: 38728051 PMCID: PMC11165591 DOI: 10.1021/acs.jproteome.3c00857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 03/05/2024] [Accepted: 04/16/2024] [Indexed: 06/13/2024]
Abstract
In recent years, several deep learning-based methods have been proposed for predicting peptide fragment intensities. This study aims to provide a comprehensive assessment of six such methods, namely Prosit, DeepMass:Prism, pDeep3, AlphaPeptDeep, Prosit Transformer, and the method proposed by Guan et al. To this end, we evaluated the accuracy of the predicted intensity profiles for close to 1.7 million precursors (including both tryptic and HLA peptides) corresponding to more than 18 million experimental spectra procured from 40 independent submissions to the PRIDE repository that were acquired for different species using a variety of instruments and different dissociation types/energies. Specifically, for each method, distributions of similarity (measured by Pearson's correlation and normalized angle) between the predicted and the corresponding experimental b and y fragment intensities were generated. These distributions were used to ascertain the prediction accuracy and rank the prediction methods for particular types of experimental conditions. The effect of variables like precursor charge, length, and collision energy on the prediction accuracy was also investigated. In addition to prediction accuracy, the methods were evaluated in terms of prediction speed. The systematic assessment of these six methods may help in choosing the right method for MS/MS spectra prediction for particular needs.
Collapse
Affiliation(s)
- Mehdi
B. Hamaneh
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Aleksey Y. Ogurtsov
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | | | - Yi-Kuo Yu
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
8
|
Adams C, Laukens K, Bittremieux W, Boonen K. Machine learning-based peptide-spectrum match rescoring opens up the immunopeptidome. Proteomics 2024; 24:e2300336. [PMID: 38009585 DOI: 10.1002/pmic.202300336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 10/18/2023] [Accepted: 10/23/2023] [Indexed: 11/29/2023]
Abstract
Immunopeptidomics is a key technology in the discovery of targets for immunotherapy and vaccine development. However, identifying immunopeptides remains challenging due to their non-tryptic nature, which results in distinct spectral characteristics. Moreover, the absence of strict digestion rules leads to extensive search spaces, further amplified by the incorporation of somatic mutations, pathogen genomes, unannotated open reading frames, and post-translational modifications. This inflation in search space leads to an increase in random high-scoring matches, resulting in fewer identifications at a given false discovery rate. Peptide-spectrum match rescoring has emerged as a machine learning-based solution to address challenges in mass spectrometry-based immunopeptidomics data analysis. It involves post-processing unfiltered spectrum annotations to better distinguish between correct and incorrect peptide-spectrum matches. Recently, features based on predicted peptidoform properties, including fragment ion intensities, retention time, and collisional cross section, have been used to improve the accuracy and sensitivity of immunopeptide identification. In this review, we describe the diverse bioinformatics pipelines that are currently available for peptide-spectrum match rescoring and discuss how they can be used for the analysis of immunopeptidomics data. Finally, we provide insights into current and future machine learning solutions to boost immunopeptide identification.
Collapse
Affiliation(s)
- Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kurt Boonen
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
- ImmuneSpec BV, Niel, Belgium
| |
Collapse
|
9
|
Lapin J, Yan X, Dong Q. UniSpec: Deep Learning for Predicting the Full Range of Peptide Fragment Ion Series to Enhance the Proteomics Data Analysis Workflow. Anal Chem 2024. [PMID: 38329031 DOI: 10.1021/acs.analchem.3c02321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
We present UniSpec, an attention-driven deep neural network designed to predict comprehensive collision-induced fragmentation spectra, thereby improving peptide identification in shotgun proteomics. Utilizing a training data set of 1.8 million unique high-quality tandem mass spectra (MS2) from 0.8 million unique peptide ions, UniSpec learned with a peptide fragmentation dictionary encompassing 7919 fragment peaks. Among these, 5712 are neutral loss peaks, with 2310 corresponding to modification-specific neutral losses. Remarkably, UniSpec can predict 73%-77% of fragment intensities based on our NIST reference library spectra, a significant leap from the 35%-45% coverage of only b and y ions. Comparative studies with Prosit elucidate that while both models are strong at predicting their respective fragment ion series, UniSpec particularly shines in generating more complex MS2 spectra with diverse ion annotations. The integration of UniSpec's predictions into shotgun proteomics data analysis boosts the identification rate of tryptic peptides by 48% at a 1% false discovery rate (FDR) and 60% at a more confident 0.1% FDR. Using UniSpec's predicted in-silico spectral library, the search results closely matched those from search engines and experimental spectral libraries used in peptide identification, highlighting its potential as a stand-alone identification tool. The source code and Python scripts are available on GitHub (https://github.com/usnistgov/UniSpec) and Zenodo (https://zenodo.org/records/10452792), and all data sets and analysis results generated in this work were deposited in Zenodo (https://zenodo.org/records/10052268).
Collapse
Affiliation(s)
- Joel Lapin
- Department of Physics, Georgetown University, Washington, D.C. 20057, United States
- Associate, Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Xinjian Yan
- Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Qian Dong
- Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
10
|
Park J, Jo J, Yoon S. Mass spectra prediction with structural motif-based graph neural networks. Sci Rep 2024; 14:1400. [PMID: 38228685 PMCID: PMC10792027 DOI: 10.1038/s41598-024-51760-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 01/09/2024] [Indexed: 01/18/2024] Open
Abstract
Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures. A prevalent analysis method involves spectral library searches, where unknown spectra are cross-referenced with a database. The effectiveness of such search-based approaches, however, is restricted by the scope of the existing mass spectra database, underscoring the need to expand the database via mass spectra prediction. In this research, we propose the Motif-based Mass Spectrum prediction Network (MoMS-Net), a GNN-based architecture to predict the mass spectra pattern utilizing the structural motif information of the molecule. MoMS-Net considers both a molecule and its substructures as a graph form, which facilitates the incorporation of long-range dependencies while using less memory compared to the graph transformer model. We evaluated our model over various types of mass spectra and showed the validity and superiority over the conventional models.
Collapse
Affiliation(s)
- Jiwon Park
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, 08826, Republic of Korea
- LG Chem, Seoul, 07795, Republic of Korea
| | - Jeonghee Jo
- Center for Neuromorphic Engineering, Korea Institute of Science and Technology (KIST), Seoul, 02792, Republic of Korea.
| | - Sungroh Yoon
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, 08826, Republic of Korea.
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, Republic of Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|
11
|
Klaproth-Andrade D, Hingerl J, Bruns Y, Smith NH, Träuble J, Wilhelm M, Gagneur J. Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing. Nat Commun 2024; 15:151. [PMID: 38167372 PMCID: PMC10762064 DOI: 10.1038/s41467-023-44323-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/08/2023] [Indexed: 01/05/2024] Open
Abstract
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.
Collapse
Affiliation(s)
- Daniela Klaproth-Andrade
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
| | - Johannes Hingerl
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Yanik Bruns
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Nicholas H Smith
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Jakob Träuble
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Mathias Wilhelm
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
- Computational Mass Spectrometry, School of Life Sciences, Technical University of Munich, Freising, Germany.
| | - Julien Gagneur
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| |
Collapse
|
12
|
Liu K, Ye Y, Li S, Tang H. Accurate de novo peptide sequencing using fully convolutional neural networks. Nat Commun 2023; 14:7974. [PMID: 38042873 PMCID: PMC10693636 DOI: 10.1038/s41467-023-43010-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 10/29/2023] [Indexed: 12/04/2023] Open
Abstract
De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we present PepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.
Collapse
Affiliation(s)
- Kaiyuan Liu
- Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, 47408, IN, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, 47408, IN, USA
| | - Sujun Li
- Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, 47408, IN, USA
- Dengding BioAI Co., Ltd., Bloomington, USA
| | - Haixu Tang
- Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, 47408, IN, USA.
| |
Collapse
|
13
|
Chan CMJ, Lam H. Merging Full-Spectrum and Fragment Ion Intensity Predictions from Deep Learning for High-Quality Spectral Libraries. J Proteome Res 2023; 22:3692-3702. [PMID: 37910637 DOI: 10.1021/acs.jproteome.3c00180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2023]
Abstract
Spectral libraries are useful resources in proteomic data analysis. Recent advances in deep learning allow tandem mass spectra of peptides to be predicted from their amino acid sequences. This enables predicted spectral libraries to be compiled, and searching against such libraries has been shown to improve the sensitivity in peptide identification over conventional sequence database searching. However, current prediction models lack support for longer peptides, and thus far, predicted library searching has only been demonstrated for backbone ion-only spectrum prediction methods. Here, we propose a deep learning-based full-spectrum prediction method to generate predicted spectral libraries for peptide identification. We demonstrated the superiority of using full-spectrum libraries over backbone ion-only prediction approaches in spectral library searching. Furthermore, merging spectra from different prediction models, as a form of ensemble learning, can produce improved spectral libraries, in terms of identification sensitivity. We also show that a hybrid library combining predicted and experimental spectra can lead to 20% more confident identifications over experimental library searching or sequence database searching.
Collapse
Affiliation(s)
- Chak Ming Jerry Chan
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong 999077, China
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong 999077, China
| |
Collapse
|
14
|
Yang KL, Yu F, Teo GC, Li K, Demichev V, Ralser M, Nesvizhskii AI. MSBooster: improving peptide identification rates using deep learning-based features. Nat Commun 2023; 14:4539. [PMID: 37500632 PMCID: PMC10374903 DOI: 10.1038/s41467-023-40129-9] [Citation(s) in RCA: 60] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 07/06/2023] [Indexed: 07/29/2023] Open
Abstract
Peptide identification in liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiments relies on computational algorithms for matching acquired MS/MS spectra against sequences of candidate peptides using database search tools, such as MSFragger. Here, we present a new tool, MSBooster, for rescoring peptide-to-spectrum matches using additional features incorporating deep learning-based predictions of peptide properties, such as LC retention time, ion mobility, and MS/MS spectra. We demonstrate the utility of MSBooster, in tandem with MSFragger and Percolator, in several different workflows, including nonspecific searches (immunopeptidomics), direct identification of peptides from data independent acquisition data, single-cell proteomics, and data generated on an ion mobility separation-enabled timsTOF MS platform. MSBooster is fast, robust, and fully integrated into the widely used FragPipe computational platform.
Collapse
Affiliation(s)
- Kevin L Yang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Fengchao Yu
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA.
| | - Guo Ci Teo
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Kai Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Vadim Demichev
- Department of Biochemistry, Charité Universitätsmedizin, Berlin, Germany
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Markus Ralser
- Department of Biochemistry, Charité Universitätsmedizin, Berlin, Germany
- Nuffield Department of Medicine, The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Alexey I Nesvizhskii
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
15
|
Abdul-Khalek N, Wimmer R, Overgaard MT, Gregersen Echers S. Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
Affiliation(s)
- Naim Abdul-Khalek
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | - Reinhard Wimmer
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | | | | |
Collapse
|
16
|
Geer LY, Lapin J, Slotta DJ, Mak TD, Stein SE. AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence. J Proteome Res 2023; 22:2246-2255. [PMID: 37232537 PMCID: PMC10542943 DOI: 10.1021/acs.jproteome.2c00807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The unbounded permutations of biological molecules, including proteins and their constituent peptides, present a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.
Collapse
Affiliation(s)
- Lewis Y. Geer
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Joel Lapin
- Department of Physics, Georgetown University, Washington, DC 20057, United States
- Associate, Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Douglas J. Slotta
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Tytus D. Mak
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Stephen E. Stein
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| |
Collapse
|
17
|
Kirkpatrick J, Stemmer PM, Searle BC, Herring LE, Martin L, Midha MK, Phinney BS, Shan B, Palmblad M, Wang Y, Jagtap PD, Neely BA. 2019 Association of Biomolecular Resource Facilities Multi-Laboratory Data-Independent Acquisition Proteomics Study. J Biomol Tech 2023; 34:3fc1f5fe.9b78d780. [PMID: 37435391 PMCID: PMC10332336 DOI: 10.7171/3fc1f5fe.9b78d780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2023]
Abstract
Despite the advantages of fewer missing values by collecting fragment ion data on all analytes in the sample as well as the potential for deeper coverage, the adoption of data-independent acquisition (DIA) in proteomics core facility settings has been slow. The Association of Biomolecular Resource Facilities conducted a large interlaboratory study to evaluate DIA performance in proteomics laboratories with various instrumentation. Participants were supplied with generic methods and a uniform set of test samples. The resulting 49 DIA datasets act as benchmarks and have utility in education and tool development. The sample set consisted of a tryptic HeLa digest spiked with high or low levels of 4 exogenous proteins. Data are available in MassIVE MSV000086479. Additionally, we demonstrate how the data can be analyzed by focusing on 2 datasets using different library approaches and show the utility of select summary statistics. These data can be used by DIA newcomers, software developers, or DIA experts evaluating performance with different platforms, acquisition settings, and skill levels.
Collapse
Affiliation(s)
- Joanna Kirkpatrick
- Leibniz Institute on AgingFritz Lipmann Institute07745JenaGermany
- The Francis Crick InstituteLondonNW1 1ATUnited Kingdom
| | | | - Brian C. Searle
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOhio43210USA
- Pelotonia Institute for Immuno-OncologyThe Ohio State University Comprehensive Cancer CenterColumbusOhio43210USA
| | - Laura E. Herring
- UNC Proteomics Core FacilityDepartment of PharmacologyUniversity of North Carolina at Chapel HillChapel HillNorth Carolina27514USA
| | | | | | | | - Baozhen Shan
- Bioinformatics Solutions Inc.WaterlooON N2L 3K8Canada
| | - Magnus Palmblad
- Center for Proteomics and MetabolomicsLeiden University Medical Center2333 ZC LeidenThe Netherlands
| | - Yan Wang
- National Institute of Dental and Craniofacial ResearchNational Institutes of HealthBethesdaMaryland20892USA
| | - Pratik D. Jagtap
- Department of BiochemistryMolecular Biology and BiophysicsUniversity of MinnesotaMinneapolisMinnesota55455USA
| | - Benjamin A. Neely
- National Institute of Standards and TechnologyCharlestonSouth Carolina29412USA
| |
Collapse
|
18
|
Du A, Jia W. New insights into the bioaccessibility and metabolic fates of short-chain bioactive peptides in goat milk using the INFOGEST static digestion model and an improved data acquisition strategy. Food Res Int 2023; 169:112948. [PMID: 37254372 DOI: 10.1016/j.foodres.2023.112948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 04/14/2023] [Accepted: 05/03/2023] [Indexed: 06/01/2023]
Abstract
The metabolic fates of potentially bioactive short-chain peptides (SCPs; amino acid numbers between 2 and 4) in gastrointestinal digestion have received little attention due to their low concentration and broad suppression during high resolution mass spectrometry (HRMS) analysis. A tailored workflow integrating mesoporous magnetic solid phase extraction and a novel ion transmission strategy (data-dependent acquisition combined with both an inclusion list and an exclusion list followed by a data-independent acquisition) was used to profile the composition of SCPs during in vitro simulated digestion (LOQ 0.02 to 0.1 μg L-1). A total of 47 dipeptides, 59 tripeptides, and 21 tetrapeptides were identified and quantified from 0.01 to 27.84 mg L-1 (RSD ≤ 9.1%) based on parallel reaction monitoring and an internal standard method. The structural properties of stable SCPs resistant to intestinal digestion were determined by analysis of variance (p < 0.05), with a Pro residue at the C-terminal or penultimate position, a slightly greater negative charge at pH 7.0, and fewer C-terminal aliphatic and polar amino acids. SCPs' metabolic fates varied during digestion, but the overall trend of content change for either total or individual SCP increased as the digestion proceeded, and they were further assessed by a database-driven bioactivity search, which matched a wide variety of bioactivities with the predominance of dipeptidyl peptidase (DPP) IV and angiotensin-converting enzyme (ACE) inhibitors. This study facilitated the understanding of bioaccessibility of the food-derived SCPs and provided essential guidelines for the properties of conserved structure in vivo.
Collapse
Affiliation(s)
- An Du
- School of Food and Biological Engineering, Shaanxi University of Science & Technology, Xi'an 710021, China
| | - Wei Jia
- School of Food and Biological Engineering, Shaanxi University of Science & Technology, Xi'an 710021, China; Shaanxi Research Institute of Agricultural Products Processing Technology, Xi'an 710021, China.
| |
Collapse
|
19
|
Affiliation(s)
- Bruna Gomes
- From the Departments of Medicine, Genetics, and Biomedical Data Science, Stanford University, Stanford, CA (B.G., E.A.A.); and the Department of Cardiology, Pneumology, and Angiology, Heidelberg University Hospital, Heidelberg, Germany (B.G.)
| | - Euan A Ashley
- From the Departments of Medicine, Genetics, and Biomedical Data Science, Stanford University, Stanford, CA (B.G., E.A.A.); and the Department of Cardiology, Pneumology, and Angiology, Heidelberg University Hospital, Heidelberg, Germany (B.G.)
| |
Collapse
|
20
|
Cox J. Prediction of peptide mass spectral libraries with machine learning. Nat Biotechnol 2023; 41:33-43. [PMID: 36008611 DOI: 10.1038/s41587-022-01424-w] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 07/11/2022] [Indexed: 01/21/2023]
Abstract
The recent development of machine learning methods to identify peptides in complex mass spectrometric data constitutes a major breakthrough in proteomics. Longstanding methods for peptide identification, such as search engines and experimental spectral libraries, are being superseded by deep learning models that allow the fragmentation spectra of peptides to be predicted from their amino acid sequence. These new approaches, including recurrent neural networks and convolutional neural networks, use predicted in silico spectral libraries rather than experimental libraries to achieve higher sensitivity and/or specificity in the analysis of proteomics data. Machine learning is galvanizing applications that involve large search spaces, such as immunopeptidomics and proteogenomics. Current challenges in the field include the prediction of spectra for peptides with post-translational modifications and for cross-linked pairs of peptides. Permeation of machine-learning-based spectral prediction into search engines and spectrum-centric data-independent acquisition workflows for diverse peptide classes and measurement conditions will continue to push sensitivity and dynamic range in proteomics applications in the coming years.
Collapse
Affiliation(s)
- Jürgen Cox
- Computational Systems Biochemistry Research Group, Max-Planck Institute of Biochemistry, Martinsried, Germany.
- Department of Biological and Medical Psychology, University of Bergen, Bergen, Norway.
| |
Collapse
|
21
|
McDonnell K, Howley E, Abram F. Critical evaluation of the use of artificial data for machine learning based de novo peptide identification. Comput Struct Biotechnol J 2023; 21:2732-2743. [PMID: 37168871 PMCID: PMC10165132 DOI: 10.1016/j.csbj.2023.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/16/2023] [Accepted: 04/16/2023] [Indexed: 05/13/2023] Open
Abstract
Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data.
Collapse
Affiliation(s)
- Kevin McDonnell
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- School of Computer Science, University of Galway, Ireland
- Corresponding author at: Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland.
| | - Enda Howley
- School of Computer Science, University of Galway, Ireland
| | - Florence Abram
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- Corresponding author.
| |
Collapse
|
22
|
Rehfeldt TG, Krawczyk K, Echers SG, Marcatili P, Palczynski P, Röttger R, Schwämmle V. Variability analysis of LC-MS experimental factors and their impact on machine learning. Gigascience 2022; 12:giad096. [PMID: 37983748 PMCID: PMC10659119 DOI: 10.1093/gigascience/giad096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 08/23/2023] [Accepted: 10/11/2023] [Indexed: 11/22/2023] Open
Abstract
BACKGROUND Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. RESULTS We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. CONCLUSIONS Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.
Collapse
Affiliation(s)
- Tobias Greisager Rehfeldt
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Konrad Krawczyk
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | | | - Paolo Marcatili
- Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Pawel Palczynski
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| |
Collapse
|
23
|
Tsimenidis S, Vrochidou E, Papakostas GA. Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:12272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
Affiliation(s)
| | | | - George A. Papakostas
- MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece
| |
Collapse
|
24
|
Yang Y, Qiao L. Data-independent acquisition proteomics methods for analyzing post-translational modifications. Proteomics 2022; 23:e2200046. [PMID: 36036492 DOI: 10.1002/pmic.202200046] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 08/20/2022] [Accepted: 08/23/2022] [Indexed: 11/06/2022]
Abstract
Protein post-translational modifications (PTMs) increase the functional diversity of the cellular proteome. Accurate and high throughput identification and quantification of protein PTMs is a key task in proteomics research. Recent advancements in data-independent acquisition (DIA) mass spectrometry (MS) technology have achieved deep coverage and accurate quantification of proteins and PTMs. This review provides an overview of DIA data processing methods that cover three aspects of PTMs analysis, i.e., detection of PTMs, site localization, and characterization of complex modification moieties, such as glycosylation. In addition, a survey of deep learning methods that boost DIA-based PTMs analysis is presented, including in silico spectral library generation, as well as feature scoring and error rate control. The limitations and future directions of DIA methods for PTMs analysis are also discussed. Novel data analysis methods will take advantage of advanced MS instrumentation techniques to empower DIA MS for in-depth and accurate PTMs measurements. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| |
Collapse
|
25
|
Boiko DA, Kozlov KS, Burykina JV, Ilyushenkova VV, Ananikov VP. Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning. J Am Chem Soc 2022; 144:14590-14606. [PMID: 35939718 DOI: 10.1021/jacs.2c03631] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Mass spectrometry (MS) is a convenient, highly sensitive, and reliable method for the analysis of complex mixtures, which is vital for materials science, life sciences fields such as metabolomics and proteomics, and mechanistic research in chemistry. Although it is one of the most powerful methods for individual compound detection, complete signal assignment in complex mixtures is still a great challenge. The unconstrained formula-generating algorithm, covering the entire spectra and revealing components, is a "dream tool" for researchers. We present the framework for efficient MS data interpretation, describing a novel approach for detailed analysis based on deisotoping performed by gradient-boosted decision trees and a neural network that generates molecular formulas from the fine isotopic structure, approaching the long-standing inverse spectral problem. The methods were successfully tested on three examples: fragment ion analysis in protein sequencing for proteomics, analysis of the natural samples for life sciences, and study of the cross-coupling catalytic system for chemistry.
Collapse
Affiliation(s)
- Daniil A Boiko
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow 119991, Russia
| | - Konstantin S Kozlov
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow 119991, Russia
| | - Julia V Burykina
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow 119991, Russia
| | - Valentina V Ilyushenkova
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow 119991, Russia
| | - Valentine P Ananikov
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow 119991, Russia
| |
Collapse
|
26
|
Zhang R, Peng W, Huang Y, Gautam S, Wang J, Mechref Y, Tang H. A Reciprocal Best-hit Approach to Characterize Isomeric N-Glycans Using Tandem Mass Spectrometry. Anal Chem 2022; 94:10003-10010. [PMID: 35776110 DOI: 10.1021/acs.analchem.2c00229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Glycosylation is a post-translational modification involved in many important biological functions. The aberrant alteration of glycan structure is implicit with malfunction of cells and possess potential significance in medical diagnosis of complex diseases such as cancer. Liquid chromatography tandem mass spectrometry (LC-MS/MS) has been commonly applied to the analysis of complex glycomic samples. However, the characterization of isomeric glycans from their MS/MS spectra in complex biological samples remains challenging. In this paper, we present a novel reciprocal best-hit glycan-spectrum matching (RB-GSM) approach toward characterizing N-glycans. In this method, the MS/MS spectra in the input data set are evaluated against all glycans with the matched precursor mass using customized scoring functions, where a glycan-spectrum matching (GSM) is considered to be true if it is a reciprocal best-hit, that is, it receives the highest score among not only the GSMs between the respective spectrum and all matched glycans, but also the GSMs between the respective glycan and all matched MS/MS spectra in the input data set. We evaluated this RB-GSM approach on N-glycan identification using MS/MS spectra acquired from glycan standards as well as those released from the model glycoprotein fetuin, immunoglobulin G, and human serum samples, which showed the RB-GSM is capable of distinguishing isomeric glycans.
Collapse
Affiliation(s)
- Rui Zhang
- Department of Computer Science, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington 47408, Indiana, United States
| | - Wenjing Peng
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock 79409, Texas, United States
| | - Yifan Huang
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock 79409, Texas, United States
| | - Sakshi Gautam
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock 79409, Texas, United States
| | - Junyao Wang
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock 79409, Texas, United States
| | - Yehia Mechref
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock 79409, Texas, United States
| | - Haixu Tang
- Department of Computer Science, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington 47408, Indiana, United States
| |
Collapse
|
27
|
Na S, Choi H, Paek E. Deephos: Predicted spectral database search for TMT-labeled phosphopeptides and its false discovery rate estimation. Bioinformatics 2022; 38:2980-2987. [PMID: 35441674 DOI: 10.1093/bioinformatics/btac280] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Revised: 03/26/2022] [Accepted: 04/14/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Tandem mass tag (TMT)-based tandem mass spectrometry (MS/MS) has become the method of choice for the quantification of post-translational modifications in complex mixtures. Many cancer proteogenomic studies have highlighted the importance of large-scale phosphopeptide quantification coupled with TMT labeling. Herein, we propose a predicted Spectral DataBase (pSDB) search strategy called Deephos that can improve both sensitivity and specificity in identifying MS/MS spectra of TMT-labeled phosphopeptides. RESULTS With deep learning-based fragment ion prediction, we compiled a pSDB of TMT-labeled phosphopeptides generated from ∼8,000 human phosphoproteins annotated in UniProt. Deep learning could successfully recognize the fragmentation patterns altered by both TMT labeling and phosphorylation. In addition, we discuss the decoy spectra for false discovery rate (FDR) estimation in the pSDB search. We show that FDR could be inaccurately estimated by the existing decoy spectra generation methods and propose an innovative method to generate decoy spectra for more accurate FDR estimation. The utilities of Deephos were demonstrated in multi-stage analyses (coupled with database searches) of glioblastoma, acute myeloid leukemia, and breast cancer phosphoproteomes. AVAILABILITY Deephos pSDB and the search software are available at https://github.com/seungjinna/deephos.
Collapse
Affiliation(s)
- Seungjin Na
- Institute for Artificial Intelligence Research, Hanyang University, Seoul, 04763, Republic of Korea
| | - Hyunjin Choi
- Department of Automotive Engineering, Hanyang University, Seoul, 04763, Republic of Korea
| | - Eunok Paek
- Institute for Artificial Intelligence Research, Hanyang University, Seoul, 04763, Republic of Korea.,Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
| |
Collapse
|
28
|
Urban J. A review on recent trends in the phosphoproteomics workflow. From sample preparation to data analysis. Anal Chim Acta 2022; 1199:338857. [PMID: 35227377 DOI: 10.1016/j.aca.2021.338857] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 07/14/2021] [Accepted: 07/15/2021] [Indexed: 12/12/2022]
|
29
|
Dickinson Q, Meyer JG. Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences. PLoS Comput Biol 2022; 18:e1009736. [PMID: 35089914 PMCID: PMC8797255 DOI: 10.1371/journal.pcbi.1009736] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 12/09/2021] [Indexed: 11/29/2022] Open
Abstract
Machine learning with multi-layered artificial neural networks, also known as "deep learning," is effective for making biological predictions. However, model interpretation is challenging, especially for sequential input data used with recurrent neural network architectures. Here, we introduce a framework called "Positional SHAP" (PoSHAP) to interpret models trained from biological sequences by utilizing SHapely Additive exPlanations (SHAP) to generate positional model interpretations. We demonstrate this using three long short-term memory (LSTM) regression models that predict peptide properties, including binding affinity to major histocompatibility complexes (MHC), and collisional cross section (CCS) measured by ion mobility spectrometry. Interpretation of these models with PoSHAP reproduced MHC class I (rhesus macaque Mamu-A1*001 and human A*11:01) peptide binding motifs, reflected known properties of peptide CCS, and provided new insights into interpositional dependencies of amino acid interactions. PoSHAP should have widespread utility for interpreting a variety of models trained from biological sequences.
Collapse
Affiliation(s)
- Quinn Dickinson
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| | - Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| |
Collapse
|
30
|
Iannetta AA, Hicks LM. Maximizing Depth of PTM Coverage: Generating Robust MS Datasets for Computational Prediction Modeling. Methods Mol Biol 2022; 2499:1-41. [PMID: 35696073 DOI: 10.1007/978-1-0716-2317-6_1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Post-translational modifications (PTMs) regulate complex biological processes through the modulation of protein activity, stability, and localization. Insights into the specific modification type and localization within a protein sequence can help ascertain functional significance. Computational models are increasingly demonstrated to offer a low-cost, high-throughput method for comprehensive PTM predictions. Algorithms are optimized using existing experimental PTM data, thus accurate prediction performance relies on the creation of robust datasets. Herein, advancements in mass spectrometry-based proteomics technologies to maximize PTM coverage are reviewed. Further, requisite experimental validation approaches for PTM predictions are explored to ensure that follow-up mechanistic studies are focused on accurate modification sites.
Collapse
Affiliation(s)
- Anthony A Iannetta
- Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Leslie M Hicks
- Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
31
|
Mann M, Kumar C, Zeng WF, Strauss MT. Artificial intelligence for proteomics and biomarker discovery. Cell Syst 2021; 12:759-770. [PMID: 34411543 DOI: 10.1016/j.cels.2021.06.006] [Citation(s) in RCA: 138] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 05/07/2021] [Accepted: 06/28/2021] [Indexed: 12/14/2022]
Abstract
There is an avalanche of biomedical data generation and a parallel expansion in computational capabilities to analyze and make sense of these data. Starting with genome sequencing and widely employed deep sequencing technologies, these trends have now taken hold in all omics disciplines and increasingly call for multi-omics integration as well as data interpretation by artificial intelligence technologies. Here, we focus on mass spectrometry (MS)-based proteomics and describe how machine learning and, in particular, deep learning now predicts experimental peptide measurements from amino acid sequences alone. This will dramatically improve the quality and reliability of analytical workflows because experimental results should agree with predictions in a multi-dimensional data landscape. Machine learning has also become central to biomarker discovery from proteomics data, which now starts to outperform existing best-in-class assays. Finally, we discuss model transparency and explainability and data privacy that are required to deploy MS-based biomarkers in clinical settings.
Collapse
Affiliation(s)
- Matthias Mann
- Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
| | - Chanchal Kumar
- Translational Science & Experimental Medicine, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden.
| | - Wen-Feng Zeng
- Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
| | | |
Collapse
|
32
|
Haseeb M, Saeed F. High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data. NATURE COMPUTATIONAL SCIENCE 2021; 1:550-561. [PMID: 34723198 PMCID: PMC8554525 DOI: 10.1038/s43588-021-00113-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 07/16/2021] [Indexed: 05/09/2023]
Abstract
Database peptide search algorithms deduce peptides from mass spectrometry (MS) data. There has been substantial effort in improving their computational efficiency to achieve larger and more complex systems biology studies. However, modern serial and high-performance computing (HPC) algorithms exhibit sub-optimal performance mainly due to their ineffective parallel designs (low resource utilization), and high overhead costs. We present an HPC framework, called HiCOPS, for efficient acceleration of the database peptide search algorithms on distributed-memory supercomputers. HiCOPS provides, on average, more than 10-fold improvement in speed, and superior parallel performance over several existing HPC database search software. We also formulate a mathematical model for performance analysis and optimization, and report near-optimal results for several key metrics including strong-scale efficiency, hardware utilization, load-balance, inter-process communication and I/O overheads. The core parallel design, techniques, and optimizations presented in HiCOPS are search-algorithm independent and can be extended to efficiently accelerate the existing and future algorithms and software.
Collapse
Affiliation(s)
- Muhammad Haseeb
- Knight Foundation School of Computing and Information
Sciences, Florida International University, Miami, FL, USA
| | - Fahad Saeed
- Knight Foundation School of Computing and Information
Sciences, Florida International University, Miami, FL, USA
- Biomolecular Sciences Institute (BSI), Florida
International University, Miami, FL, USA
- Department of Human and Molecular Genetics, Herbert
Wertheim School of Medicine, Florida International University, Miami, FL, USA
| |
Collapse
|
33
|
Lu YY, Bilmes J, Rodriguez-Mias RA, Villén J, Noble WS. DIAmeter: matching peptides to data-independent acquisition mass spectrometry data. Bioinformatics 2021; 37:i434-i442. [PMID: 34252924 PMCID: PMC8686675 DOI: 10.1093/bioinformatics/btab284] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Tandem mass spectrometry data acquired using data independent acquisition (DIA) is challenging to interpret because the data exhibits complex structure along both the mass-to-charge (m/z) and time axes. The most common approach to analyzing this type of data makes use of a library of previously observed DIA data patterns (a 'spectral library'), but this approach is expensive because the libraries do not typically generalize well across laboratories. RESULTS Here, we propose DIAmeter, a search engine that detects peptides in DIA data using only a peptide sequence database. Although some existing library-free DIA analysis methods (i) support data generated using both wide and narrow isolation windows, (ii) detect peptides containing post-translational modifications, (iii) analyze data from a variety of instrument platforms and (iv) are capable of detecting peptides even in the absence of detectable signal in the survey (MS1) scan, DIAmeter is the only method that offers all four capabilities in a single tool. AVAILABILITY AND IMPLEMENTATION The open source, Apache licensed source code is available as part of the Crux mass spectrometry analysis toolkit (http://crux.ms). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Young Lu
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Jeff Bilmes
- Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA.,Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | | | - Judit Villén
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.,Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
34
|
Abstract
Mass-spectrometry-based proteomics enables quantitative analysis of thousands of human proteins. However, experimental and computational challenges restrict progress in the field. This review summarizes the recent flurry of machine-learning strategies using artificial deep neural networks (or "deep learning") that have started to break barriers and accelerate progress in the field of shotgun proteomics. Deep learning now accurately predicts physicochemical properties of peptides from their sequence, including tandem mass spectra and retention time. Furthermore, deep learning methods exist for nearly every aspect of the modern proteomics workflow, enabling improved feature selection, peptide identification, and protein inference.
Collapse
Affiliation(s)
- Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| |
Collapse
|
35
|
Tarn C, Zeng WF. pDeep3: Toward More Accurate Spectrum Prediction with Fast Few-Shot Learning. Anal Chem 2021; 93:5815-5822. [PMID: 33797898 DOI: 10.1021/acs.analchem.0c05427] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Spectrum prediction using deep learning has attracted a lot of attention in recent years. Although existing deep learning methods have dramatically increased the prediction accuracy, there is still considerable space for improvement, which is presently limited by the difference of fragmentation types or instrument settings. In this work, we use the few-shot learning method to fit the data online to make up for the shortcoming. The method is evaluated using ten data sets, where the instruments includes Velos, QE, Lumos, and Sciex, with collision energies being differently set. Experimental results show that few-shot learning can achieve higher prediction accuracy with almost negligible computing resources. For example, on the data set from a untrained instrument Sciex-6600, within about 10 s, the prediction accuracy is increased from 69.7% to 86.4%; on the CID (collision-induced dissociation) data set, the prediction accuracy of the model trained by HCD (higher energy collision dissociation) spectra is increased from 48.0% to 83.9%. It is also shown that, the method is not critical to data quality and is sufficiently efficient to fill the accuracy gap. The source code of pDeep3 is available at http://pfind.ict.ac.cn/software/pdeep3.
Collapse
Affiliation(s)
- Ching Tarn
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| |
Collapse
|
36
|
Chen ZL, Mao PZ, Zeng WF, Chi H, He SM. pDeepXL: MS/MS Spectrum Prediction for Cross-Linked Peptide Pairs by Deep Learning. J Proteome Res 2021; 20:2570-2582. [PMID: 33821641 DOI: 10.1021/acs.jproteome.0c01004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In cross-linking mass spectrometry, the identification of cross-linked peptide pairs heavily relies on the ability of a database search engine to measure the similarities between experimental and theoretical MS/MS spectra. However, the lack of accurate ion intensities in theoretical spectra impairs the performance of search engines, in particular, on proteome scales. Here we introduce pDeepXL, a deep neural network to predict MS/MS spectra of cross-linked peptide pairs. To train pDeepXL, we used the transfer-learning technique because it facilitated the training with limited benchmark data of cross-linked peptide pairs. Test results on more than ten data sets showed that pDeepXL accurately predicted the spectra of both noncleavable DSS/BS3/Leiker cross-linked peptide pairs (>80% of predicted spectra have Pearson's r values higher than 0.9) and cleavable DSSO/DSBU cross-linked peptide pairs (>75% of predicted spectra have Pearson's r values higher than 0.9). pDeepXL also achieved the accurate prediction on unseen data sets using an online fine-tuning technique. Lastly, integrating pDeepXL into a database search engine increased the number of identified cross-link spectra by 18% on average.
Collapse
Affiliation(s)
- Zhen-Lin Chen
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Peng-Zhi Mao
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
37
|
Yang M, Zhu Z, Zhuang Z, Bai Y, Wang S, Ge F. Proteogenomic Characterization of the Pathogenic Fungus Aspergillus flavus Reveals Novel Genes Involved in Aflatoxin Production. Mol Cell Proteomics 2020; 20:100013. [PMID: 33568340 PMCID: PMC7950108 DOI: 10.1074/mcp.ra120.002144] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 10/06/2020] [Accepted: 11/24/2020] [Indexed: 12/20/2022] Open
Abstract
Aspergillus flavus (A. flavus), a pathogenic fungus, can produce carcinogenic and toxic aflatoxins that are a serious agricultural and medical threat worldwide. Attempts to decipher the aflatoxin biosynthetic pathway have been hampered by the lack of a high-quality genome annotation for A. flavus. To address this gap, we performed a comprehensive proteogenomic analysis using high-accuracy mass spectrometry data for this pathogen. The resulting high-quality data set confirmed the translation of 8724 previously predicted genes and identified 732 novel proteins, 269 splice variants, 447 single amino acid variants, 188 revised genes. A subset of novel proteins was experimentally validated by RT-PCR and synthetic peptides. Further functional annotation suggested that a number of the identified novel proteins may play roles in aflatoxin biosynthesis and stress responses in A. flavus. This comprehensive strategy also identified a wide range of posttranslational modifications (PTMs), including 3461 modification sites from 1765 proteins. Functional analysis suggested the involvement of these modified proteins in the regulation of cellular metabolic and aflatoxin biosynthetic pathways. Together, we provided a high-quality annotation of A. flavus genome and revealed novel insights into the mechanisms of aflatoxin production and pathogenicity in this pathogen.
Collapse
Affiliation(s)
- Mingkun Yang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China; State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Zhuo Zhu
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Zhenhong Zhuang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Youhuang Bai
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Shihua Wang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China.
| | - Feng Ge
- State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China.
| |
Collapse
|
38
|
Wang L, Liu K, Li S, Tang H. A Fast and Memory-Efficient Spectral Library Search Algorithm Using Locality-Sensitive Hashing. Proteomics 2020; 20:e2000002. [PMID: 32415809 PMCID: PMC7669687 DOI: 10.1002/pmic.202000002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/17/2020] [Indexed: 01/07/2023]
Abstract
With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well-studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality-Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit-strings using LSH functions, and then computes the similarity between the spectra with highly similar bit-string. Using the spectral library searching of large real-world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2-9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.
Collapse
Affiliation(s)
- Lei Wang
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Kaiyuan Liu
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Sujun Li
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| |
Collapse
|
39
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|