1
|
Marchetto A, Tirapelle M, Mazzei L, Sorensen E, Besenhard MO. In Silico High-Performance Liquid Chromatography Method Development via Machine Learning. Anal Chem 2025; 97:6991-7001. [PMID: 40152207 PMCID: PMC11983366 DOI: 10.1021/acs.analchem.4c03466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 03/11/2025] [Accepted: 03/13/2025] [Indexed: 03/29/2025]
Abstract
High-performance liquid chromatography (HPLC) remains the gold standard for analyzing and purifying molecular components in solutions. However, developing HPLC methods is material- and time-consuming, so computer-aided shortcuts are highly desirable. In line with the digitalization of process development and the growth of HPLC databases, we propose a data-driven methodology to predict molecule retention factors as a function of mobile phase composition without the need for any new experiments, solely relying on molecular descriptors (MDs) obtained via simplified molecular input line entry system (SMILES) string representations of molecules. This new approach combines: (a) quantitative structure-property relationships (QSPR) using MDs to predict solute-dependent parameters in (b) linear solvation energy relationships (LSER) and (c) linear solvent strength (LSS) theory. We demonstrate the potential of this computational methodology using experimental data for retention factors of small molecules made available by the research community for which the MDs were obtained via SMILES string representations determined by the structural formulas of the molecules. This method can be adopted directly to predict elution times of molecular components; however, in combination with first-principle-based mechanistic transport models, the method can also be employed to optimize HPLC methods in-silico. Both options can reduce the experimental load and accelerate HPLC method development significantly, lowering the time and cost of the drug manufacturing cycle and reducing the time to market. Given the growing number and quality of HPLC databases, the predictive power of this methodology will only increase in the coming years.
Collapse
Affiliation(s)
- Alberto Marchetto
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
- Department
of Management, Economics and Industrial
Engineering, Politecnico di Milano, Via Raffaele Lambruschini 4/B, Milano 20156, Italy
| | - Monica Tirapelle
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Luca Mazzei
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Eva Sorensen
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Maximilian O. Besenhard
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| |
Collapse
|
2
|
McKetney J, Miller IJ, Hutton A, Sinitcyn P, Serrano LR, Coon JJ, Meyer JG. Deep Learning Predicts Non-Normal Transmission Distributions in High-Field Asymmetric Waveform Ion Mobility (FAIMS) Directly from Peptide Sequence. Anal Chem 2025; 97:2254-2263. [PMID: 39865577 PMCID: PMC11800176 DOI: 10.1021/acs.analchem.4c05359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 01/06/2025] [Accepted: 01/13/2025] [Indexed: 01/28/2025]
Abstract
Peptide ion mobility adds an extra dimension of separation to mass spectrometry-based proteomics. The ability to accurately predict peptide ion mobility would be useful to expedite assay development and to discriminate true answers in a database search. There are methods to accurately predict peptide ion mobility through drift tube devices, but methods to predict mobility through high-field asymmetric waveform ion mobility (FAIMS) are underexplored. Here, we successfully model peptide ions' FAIMS mobility using a multi-label classification scheme to account for non-normal transmission distributions. We trained two models from over 100,000 human peptide precursors: a random forest and a long-term short-term memory (LSTM) neural network. Both models had different strengths, and the ensemble average of model predictions produced a higher F2 score than either model alone. Finally, we explored cases where the models make mistakes and demonstrate the predictive performance of F2 = 0.66 (AUROC = 0.928) on a new test data set of nearly 40,000 E. coli peptide ions. The deep learning model is easily accessible via https://faims.xods.org.
Collapse
Affiliation(s)
- Justin McKetney
- Department
of Biomolecular Chemistry, University of
Wisconsin-Madison, Madison, Wisconsin 53706, United States
- National
Center for Quantitative Biology of Complex Systems, Madison, Wisconsin 53706, United States
- Gladstone
Data Science and Biotechnology Institute, The J. David Gladstone Institutes, San Francisco, California 94158, United States
- Quantitative
Bioscience Institute, University of California, San Francisco, California 94158, United States
- Department
of Cellular and Molecular Pharmacology, University of California, San
Francisco, California 94158, United States
| | - Ian J. Miller
- Department
of Biomolecular Chemistry, University of
Wisconsin-Madison, Madison, Wisconsin 53706, United States
- National
Center for Quantitative Biology of Complex Systems, Madison, Wisconsin 53706, United States
| | - Alexandre Hutton
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt
Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Pavel Sinitcyn
- Morgridge
Institute for Research, Madison, Wisconsin 53715, United States
| | - Lia R Serrano
- Department
of Biomolecular Chemistry, University of
Wisconsin-Madison, Madison, Wisconsin 53706, United States
- Department
of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Joshua J. Coon
- Department
of Biomolecular Chemistry, University of
Wisconsin-Madison, Madison, Wisconsin 53706, United States
- National
Center for Quantitative Biology of Complex Systems, Madison, Wisconsin 53706, United States
- Morgridge
Institute for Research, Madison, Wisconsin 53715, United States
- Department
of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Jesse G. Meyer
- Department
of Biomolecular Chemistry, University of
Wisconsin-Madison, Madison, Wisconsin 53706, United States
- National
Center for Quantitative Biology of Complex Systems, Madison, Wisconsin 53706, United States
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt
Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| |
Collapse
|
3
|
Ma C, Wolfinger R. A prediction model for blood-brain barrier penetrating peptides based on masked peptide transformers with dynamic routing. Brief Bioinform 2023; 24:bbad399. [PMID: 37985456 DOI: 10.1093/bib/bbad399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 09/26/2023] [Accepted: 10/17/2023] [Indexed: 11/22/2023] Open
Abstract
Blood-brain barrier penetrating peptides (BBBPs) are short peptide sequences that possess the ability to traverse the selective blood-brain interface, making them valuable drug candidates or carriers for various payloads. However, the in vivo or in vitro validation of BBBPs is resource-intensive and time-consuming, driving the need for accurate in silico prediction methods. Unfortunately, the scarcity of experimentally validated BBBPs hinders the efficacy of current machine-learning approaches in generating reliable predictions. In this paper, we present DeepB3P3, a novel framework for BBBPs prediction. Our contribution encompasses four key aspects. Firstly, we propose a novel deep learning model consisting of a transformer encoder layer, a convolutional network backbone, and a capsule network classification head. This integrated architecture effectively learns representative features from peptide sequences. Secondly, we introduce masked peptides as a powerful data augmentation technique to compensate for small training set sizes in BBBP prediction. Thirdly, we develop a novel threshold-tuning method to handle imbalanced data by approximating the optimal decision threshold using the training set. Lastly, DeepB3P3 provides an accurate estimation of the uncertainty level associated with each prediction. Through extensive experiments, we demonstrate that DeepB3P3 achieves state-of-the-art accuracy of up to 98.31% on a benchmarking dataset, solidifying its potential as a promising computational tool for the prediction and discovery of BBBPs.
Collapse
Affiliation(s)
- Chunwei Ma
- JMP Statistical Discovery, LLC, Cary, 27513, NC, USA
- Department of Computer Science and Engineering, University at Buffalo, Buffalo, 14260, NY, USA
| | | |
Collapse
|
4
|
Neely BA, Dorfer V, Martens L, Bludau I, Bouwmeester R, Degroeve S, Deutsch EW, Gessulat S, Käll L, Palczynski P, Payne SH, Rehfeldt TG, Schmidt T, Schwämmle V, Uszkoreit J, Vizcaíno JA, Wilhelm M, Palmblad M. Toward an Integrated Machine Learning Model of a Proteomics Experiment. J Proteome Res 2023; 22:681-696. [PMID: 36744821 PMCID: PMC9990124 DOI: 10.1021/acs.jproteome.2c00711] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Indexed: 02/07/2023]
Abstract
In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research.
Collapse
Affiliation(s)
- Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Viktoria Dorfer
- Bioinformatics
Research Group, University of Applied Sciences
Upper Austria, Softwarepark
11, 4232 Hagenberg, Austria
| | - Lennart Martens
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Isabell Bludau
- Department
of Proteomics and Signal Transduction, Max
Planck Institute of Biochemistry, 82152 Martinsried, Germany
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | | | - Lukas Käll
- Science
for Life Laboratory, KTH - Royal Institute
of Technology, 171 21 Solna, Sweden
| | - Pawel Palczynski
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Samuel H. Payne
- Department
of Biology, Brigham Young University, Provo, Utah 84602, United States
| | - Tobias Greisager Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5230 Odense, Denmark
| | | | - Veit Schwämmle
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Julian Uszkoreit
- Medical
Proteome Analysis, Center for Protein Diagnostics (ProDi), Ruhr University Bochum, 44801 Bochum, Germany
- Medizinisches
Proteom-Center, Medical Faculty, Ruhr University
Bochum, 44801 Bochum, Germany
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory,
European Bioinformatics Institute
(EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United
Kingdom
| | - Mathias Wilhelm
- Computational
Mass Spectrometry, Technical University
of Munich (TUM), 85354 Freising, Germany
| | - Magnus Palmblad
- Leiden University Medical Center, Postbus 9600, 2300
RC Leiden, The Netherlands
| |
Collapse
|
5
|
Chen W, McCool EN, Sun L, Zang Y, Ning X, Liu X. Evaluation of Machine Learning Models for Proteoform Retention and Migration Time Prediction in Top-Down Mass Spectrometry. J Proteome Res 2022; 21:1736-1747. [PMID: 35616364 PMCID: PMC9250612 DOI: 10.1021/acs.jproteome.2c00124] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
Reversed-phase liquid
chromatography (RPLC) and capillary zone
electrophoresis (CZE) are two primary proteoform separation methods
in mass spectrometry (MS)-based top-down proteomics. Proteoform retention
time (RT) prediction in RPLC and migration time (MT) prediction in
CZE provide additional information for accurate proteoform identification
and quantification. While existing methods are mainly focused on peptide
RT and MT prediction in bottom-up MS, there is still a lack of methods
for proteoform RT and MT prediction in top-down MS. We systematically
evaluated eight machine learning models and a transfer learning method
for proteoform RT prediction and five models and the transfer learning
method for proteoform MT prediction. Experimental results showed that
a gated recurrent unit (GRU)-based model with transfer learning achieved
a high accuracy (R = 0.978) for proteoform RT prediction
and that the GRU-based model and a fully connected neural network
model obtained a high accuracy of R = 0.982 and 0.981
for proteoform MT prediction, respectively.
Collapse
Affiliation(s)
- Wenrong Chen
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, Indiana 46202, United Staes
| | - Elijah N McCool
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United Staes
| | - Liangliang Sun
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United Staes
| | - Yong Zang
- Department of Biostatics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana 46202, United Staes
| | - Xia Ning
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, United Staes.,Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, United Staes.,Translational Data Analytics Institute, The Ohio State University, Columbus, Ohio 43210, United Staes
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, Louisiana 70112, United Staes.,Deming Department of Medicine, Tulane University, New Orleans, Louisiana 70112, United Staes
| |
Collapse
|
6
|
Crook OM, Chung CW, Deane CM. Challenges and Opportunities for Bayesian Statistics in Proteomics. J Proteome Res 2022; 21:849-864. [PMID: 35258980 PMCID: PMC8982455 DOI: 10.1021/acs.jproteome.1c00859] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Indexed: 12/27/2022]
Abstract
Proteomics is a data-rich science with complex experimental designs and an intricate measurement process. To obtain insights from the large data sets produced, statistical methods, including machine learning, are routinely applied. For a quantity of interest, many of these approaches only produce a point estimate, such as a mean, leaving little room for more nuanced interpretations. By contrast, Bayesian statistics allows quantification of uncertainty through the use of probability distributions. These probability distributions enable scientists to ask complex questions of their proteomics data. Bayesian statistics also offers a modular framework for data analysis by making dependencies between data and parameters explicit. Hence, specifying complex hierarchies of parameter dependencies is straightforward in the Bayesian framework. This allows us to use a statistical methodology which equals, rather than neglects, the sophistication of experimental design and instrumentation present in proteomics. Here, we review Bayesian methods applied to proteomics, demonstrating their potential power, alongside the challenges posed by adopting this new statistical framework. To illustrate our review, we give a walk-through of the development of a Bayesian model for dynamic organic orthogonal phase-separation (OOPS) data.
Collapse
Affiliation(s)
- Oliver M. Crook
- Department
of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| | - Chun-wa Chung
- Structural
and Biophysical Sciences, GlaxoSmithKline
R&D, Stevenage SG1 2NY, United Kingdom
| | - Charlotte M. Deane
- Department
of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| |
Collapse
|
7
|
Hruska M, Holub D. Evaluation of an integrative Bayesian peptide detection approach on a combinatorial peptide library. EUROPEAN JOURNAL OF MASS SPECTROMETRY (CHICHESTER, ENGLAND) 2021; 27:217-234. [PMID: 34989269 DOI: 10.1177/14690667211066725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Detection of peptides lies at the core of bottom-up proteomics analyses. We examined a Bayesian approach to peptide detection, integrating match-based models (fragments, retention time, isotopic distribution, and precursor mass) and peptide prior probability models under a unified probabilistic framework. To assess the relevance of these models and their various combinations, we employed a complete- and a tail-complete search of a low-precursor-mass synthetic peptide library based on oncogenic KRAS peptides. The fragment match was by far the most informative match-based model, while the retention time match was the only remaining such model with an appreciable impact--increasing correct detections by around 8 %. A peptide prior probability model built from a reference proteome greatly improved the detection over a uniform prior, essentially transforming de novo sequencing into a reference-guided search. The knowledge of a correct sequence tag in advance to peptide-spectrum matching had only a moderate impact on peptide detection unless the tag was long and of high certainty. The approach also derived more precise error rates on the analyzed combinatorial peptide library than those estimated using PeptideProphet and Percolator, showing its potential applicability for the detection of homologous peptides. Although the approach requires further computational developments for routine data analysis, it illustrates the value of peptide prior probabilities and presents a Bayesian approach for their incorporation into peptide detection.
Collapse
Affiliation(s)
- Miroslav Hruska
- Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, 98735Palacky University, Olomouc, Czech Republic
- Department of Computer Science, Faculty of Science, 98735Palacky University, Olomouc, Czech Republic
| | - Dusan Holub
- Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, 98735Palacky University, Olomouc, Czech Republic
| |
Collapse
|
8
|
Abstract
Mass-spectrometry-based proteomics enables quantitative analysis of thousands of human proteins. However, experimental and computational challenges restrict progress in the field. This review summarizes the recent flurry of machine-learning strategies using artificial deep neural networks (or "deep learning") that have started to break barriers and accelerate progress in the field of shotgun proteomics. Deep learning now accurately predicts physicochemical properties of peptides from their sequence, including tandem mass spectra and retention time. Furthermore, deep learning methods exist for nearly every aspect of the modern proteomics workflow, enabling improved feature selection, peptide identification, and protein inference.
Collapse
Affiliation(s)
- Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| |
Collapse
|
9
|
Li K, Jain A, Malovannaya A, Wen B, Zhang B. DeepRescore: Leveraging Deep Learning to Improve Peptide Identification in Immunopeptidomics. Proteomics 2020; 20:e1900334. [PMID: 32864883 PMCID: PMC7718998 DOI: 10.1002/pmic.201900334] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 08/27/2020] [Indexed: 12/23/2022]
Abstract
The identification of major histocompatibility complex (MHC)-binding peptides in mass spectrometry (MS)-based immunopeptideomics relies largely on database search engines developed for proteomics data analysis. However, because immunopeptidomics experiments do not involve enzymatic digestion at specific residues, an inflated search space leads to a high false positive rate and low sensitivity in peptide identification. In order to improve the sensitivity and reliability of peptide identification, a post-processing tool named DeepRescore is developed. DeepRescore combines peptide features derived from deep learning predictions, namely accurate retention timeand MS/MS spectra predictions, with previously used features to rescore peptide-spectrum matches. Using two public immunopeptidomics datasets, it is shown that rescoring by DeepRescore increases both the sensitivity and reliability of MHC-binding peptide and neoantigen identifications compared to existing methods. It is also shown that the performance improvement is, to a large extent, driven by the deep learning-derived features. DeepRescore is developed using NextFlow and Docker and is available at https://github.com/bzhanglab/DeepRescore.
Collapse
Affiliation(s)
- Kai Li
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Antrix Jain
- Mass Spectrometry Proteomics Core, Baylor College of Medicine, Houston, TX 77030, USA
| | - Anna Malovannaya
- Mass Spectrometry Proteomics Core, Baylor College of Medicine, Houston, TX 77030, USA
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Bo Wen
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
10
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|
11
|
Machine learning to predict retention time of small molecules in nano-HPLC. Anal Bioanal Chem 2020; 412:7767-7776. [PMID: 32860519 DOI: 10.1007/s00216-020-02905-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/29/2020] [Accepted: 08/20/2020] [Indexed: 01/22/2023]
Abstract
Retention time is an important parameter for identification in untargeted LC-MS screening. Precise retention time prediction facilitates the annotation process and is well known for proteomics. However, the lack of available experimental information for a long time has limited the prediction accuracy for small molecules. Recently introduced large databases for small-molecule retention times make possible reliable machine learning-based predictions for the whole diversity of compounds. Applying simple projections may expand these predictions on various LC systems and conditions. In our work, we describe a complex approach to predict retention times for nano-HPLC that includes the consequent deployment of binary and regression gradient boosting models trained on the METLIN small-molecule dataset and simple projection of the results with a small number of easily available compounds onto nano-HPLC separations. The proposed model outperforms previous attempts to use machine learning for predictions with a 46-s mean absolute error. The overall performance after transfer to nano-LC conditions is less than 155 s (10.8%) in terms of the median absolute (relative) error. To illustrate the applicability of the described approach, we successfully managed to eliminate averagely 25 to 42% of false-positives with a filter threshold derived from ROC curves. Thus, the proposed approach should be used in addition to other well-established in silico methods and their integration may broaden the range of correctly identified molecules.
Collapse
|
12
|
Wen B, Li K, Zhang Y, Zhang B. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nat Commun 2020; 11:1759. [PMID: 32273506 PMCID: PMC7145864 DOI: 10.1038/s41467-020-15456-w] [Citation(s) in RCA: 89] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 03/10/2020] [Indexed: 01/01/2023] Open
Abstract
Genomics-based neoantigen discovery can be enhanced by proteomic evidence, but there remains a lack of consensus on the performance of different quality control methods for variant peptide identification in proteogenomics. We propose to use the difference between accurately predicted and observed retention times for each peptide as a metric to evaluate different quality control methods. To this end, we develop AutoRT, a deep learning algorithm with high accuracy in retention time prediction. Analysis of three cancer data sets with a total of 287 tumor samples using different quality control strategies results in substantially different numbers of identified variant peptides and putative neoantigens. Our systematic evaluation, using the proposed retention time metric, provides insights and practical guidance on the selection of quality control strategies. We implement the recommended strategy in a computational workflow named NeoFlow to support proteogenomics-based neoantigen prioritization, enabling more sensitive discovery of putative neoantigens. Identifying mutation-derived neoantigens by proteogenomics requires robust strategies for quality control. Here, the authors propose peptide retention time as an evaluation metric for proteogenomics quality control methods, and develop a deep learning algorithm for accurate retention time prediction.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Kai Li
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Yun Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
13
|
Ma C, Ren Y, Yang J, Ren Z, Yang H, Liu S. Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning. Anal Chem 2018; 90:10881-10888. [PMID: 30114359 DOI: 10.1021/acs.analchem.8b02386] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
The accuracy of peptide retention time (RT) prediction model in liquid chromatography (LC) is still not sufficient for wider implementation in proteomics practice. Herein, we propose deep learning as an ideal tool to considerably improve this prediction. A new peptide RT prediction tool, DeepRT, was designed using a capsule network model, and the public data sets containing peptides separated by reverse-phase liquid chromatography were used to evaluate the DeepRT performance. Compared with other prevailing RT predictors, DeepRT attained overall improvement in the prediction of peptide RTs with an R2 of ∼0.994. Moreover, DeepRT was able to accommodate to the peptides that were separated by different types of LC, such as strong cation exchange (SCX) and hydrophilic interaction liquid chromatography (HILIC) and to reach the RT prediction with R2 values of ∼0.996 for SCX and ∼0.993 for HILIC, respectively. If a large peptide data set is available for one type of LC, DeepRT can be promoted to DeepRT(+) using transfer learning. Based on a large peptide data set gained from SWATH, DeepRT(+) further elevated the accuracy of RT prediction for peptides in a small data set and enabled a satisfactory prediction upon limited peptides approximating hundreds. Further, DeepRT automatically learns retention-related properties of amino acids under different separation mechanisms, which are well consistent with retention coefficients (Rc) of the amino acids. DeepRT was thus proven to be an improved RT predictor with high flexibility and efficiency. DeepRT is available at https://github.com/horsepurve/DeepRTplus .
Collapse
Affiliation(s)
- Chunwei Ma
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,China National GeneBank , BGI-Shenzhen , Shenzhen 518120 , China
| | - Yan Ren
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,China National GeneBank , BGI-Shenzhen , Shenzhen 518120 , China
| | - Jiarui Yang
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,China National GeneBank , BGI-Shenzhen , Shenzhen 518120 , China
| | - Zhe Ren
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,China National GeneBank , BGI-Shenzhen , Shenzhen 518120 , China
| | - Huanming Yang
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,James D. Watson Institute of Genome Sciences , Hangzhou 310008 , China
| | - Siqi Liu
- BGI-Shenzhen , Beishan Industrial Zone 11th Building, Yantian District, Shenzhen , Guangdong 518083 , China.,China National GeneBank , BGI-Shenzhen , Shenzhen 518120 , China
| |
Collapse
|