1
|
Pavek JG, Bollis NE, Grimes J, Shortreed MR, Smith LM, Marty MT. A Fast Neural Network for Isotopic Charge State Assignment. J Am Chem Soc 2025. [PMID: 40493377 DOI: 10.1021/jacs.5c03162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2025]
Abstract
Electrospray ionization (ESI) mass spectrometry is an essential technique for chemical analysis in a range of fields. In ESI, analytes can produce multiple charge states, which must be correctly assigned for identification. Existing approaches to charge state assignment can suffer from limited accuracy or poor speed. Here, we developed a fast neural network to perform isotopic envelope charge assignment. The performance of our algorithm, IsoDec, was demonstrated on top-down proteomics spectra collected on diverse instruments. On these highly complex individual spectra, we found that IsoDec correctly assigns more features compared to existing software tools while simultaneously providing improved speed and accuracy. Importantly, this performance enhancement stems directly from the neural network charge assignment approach and not simply from improved scoring and filtering of isotopic envelopes. Finally, when applied to large top-down proteomics data sets, we discovered that database searching of the IsoDec deconvolution output produces proteoform-spectrum matches with a better combination of coverage and accuracy. Overall, IsoDec provides a compelling demonstration of the potential of lightweight neural networks in mass spectrometry data analysis for diverse applications.
Collapse
Affiliation(s)
- John G Pavek
- Department of Chemistry and Biochemistry, University of Arizona, Tucson, Arizona 85721, United States
| | - Nicholas E Bollis
- Department of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Josiah Grimes
- Department of Chemistry and Biochemistry, University of Arizona, Tucson, Arizona 85721, United States
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States
| | - Michael T Marty
- Department of Chemistry and Biochemistry, University of Arizona, Tucson, Arizona 85721, United States
| |
Collapse
|
2
|
Angelis J, Schröder EA, Xiao Z, Gabriel W, Wilhelm M. Peptide Property Prediction for Mass Spectrometry Using AI: An Introduction to State of the Art Models. Proteomics 2025; 25:e202400398. [PMID: 40211610 PMCID: PMC12076536 DOI: 10.1002/pmic.202400398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 03/14/2025] [Accepted: 03/17/2025] [Indexed: 05/15/2025]
Abstract
This review explores state of the art machine learning and deep learning models for peptide property prediction in mass spectrometry-based proteomics, including, but not limited to, models for predicting digestibility, retention time, charge state distribution, collisional cross section, fragmentation ion intensities, and detectability. The combination of these models enables not only the in silico generation of spectral libraries but also finds many additional use cases in the design of targeted assays or data-driven rescoring. This review serves as both an introduction for newcomers and an update for experienced researchers aiming to develop accessible and reproducible models for peptide property predictions. Key limitations of the current models, including difficulties in handling diverse post-translational modifications and instrument variability, highlight the need for large-scale, harmonized datasets, and standardized evaluation metrics for benchmarking.
Collapse
Affiliation(s)
- Jesse Angelis
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Eva Ayla Schröder
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Zixuan Xiao
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Wassim Gabriel
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Mathias Wilhelm
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
- Munich Data Science Institute (MDSI)Technical University of MunichGarchingGermany
| |
Collapse
|
3
|
Javid S, Ramya K, Gulzar Ahmed M, Zahiya N, Thansheefa N, Anas AK, Mahfeela F, Reeha, Sha F, Sultana R. Bioanalysis of antihypertensive drugs by LC-MS: a fleeting look at the regulatory guidelines and artificial intelligence. Bioanalysis 2025; 17:471-487. [PMID: 40256889 PMCID: PMC12026161 DOI: 10.1080/17576180.2025.2489917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Accepted: 04/03/2025] [Indexed: 04/22/2025] Open
Abstract
Hypertension is a multifaceted cardiovascular disease, a significant risk factor for stroke, heart attack, heart failure, and renal damage. An essential phase in the drug development process is the exploration of effective bioanalytical approaches to investigate drug metabolism and pharmacokinetics precisely. The use of LC-MS has increased significantly over the last 10 years; numerous researchers have made contributions to the field and enhanced the technical capabilities of workflows based on LC-MS. This review provides a critical analysis of the method development and validation of bioanalytical methods using Liquid Chromatography-Mass Spectrometry (LC-MS) of a few antihypertensive drugs, focusing on extraction techniques and validation parameters. Furthermore, a fleeting look at the GLP, regulatory guidelines, machine learning and artificial intelligence in bioanalysis. Despite these advancements, the document identifies gaps in current regulatory guidelines and advocates areas for further research, predominantly concerning matrix effects and the impact of co-medications. The integration of AI tools in LC-MS has shown the potential to revolutionize bioanalytical methods, yet there is still an imperative for global harmonization. We assume that this review will offer a foundation for the research of new strategies and assist in the identification of the optimum relevant methodology parameters for known and emerging antihypertensive drugs.
Collapse
Affiliation(s)
- Saleem Javid
- Department of Pharmaceutical Chemistry, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - K. Ramya
- Department of Pharmaceutical Chemistry, Oxbridge College of Pharmacy, Bengaluru, India
| | - Mohammed Gulzar Ahmed
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Nafeesath Zahiya
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Nafisa Thansheefa
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Abdul Kadar Anas
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Fathimath Mahfeela
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Reeha
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Fariz Sha
- Department of Pharmaceutics, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| | - Rokeya Sultana
- Department of Pharmacognosy, Yenepoya Pharmacy College & Research Centre, Mangalore, India
| |
Collapse
|
4
|
Jiang G, Gao W, Xu M, Tong M. The analysis of rural revitalization serviceplatform in smart city under back propagation neural network. PLoS One 2025; 20:e0317702. [PMID: 40100863 PMCID: PMC11918382 DOI: 10.1371/journal.pone.0317702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 12/24/2024] [Indexed: 03/20/2025] Open
Abstract
To achieve rural revitalization and enhance the development of rural tourism, this study employs a back propagation neural network (BPNN) to construct a rural revitalization development model. Additionally, the Grey Relation Analysis (GRA) algorithm is used to classify rural revitalization efforts across different cities. Consistency testing is applied to analyze rural revitalization indicators, and a tourism service evaluation model is established to assess rural revitalization tourism services from the perspective of smart cities. The research results indicate that: (1) the training results and expected values of the ten cities are relatively consistent, and the classification of rural revitalization development is good; (2) The five major indicators of tourism information services, tourism security services, tourism transportation services, tourism environment services, and tourism management services all meet the consistency test, and the consistency test results are all less than 0.1, confirming the reliability and effectiveness of the research data; (3) The tourism information and management services are mainly evaluated at level C, accounting for 62% and 62.5% respectively. The tourism transportation and safety services are mainly evaluated at level D, and the model can indicate the level of rural revitalization tourism service; (4) Compared with other algorithms, the GRA-BPNN algorithm performs the best in rural revitalization evaluation, with an accuracy of 92.3%, precision of 91.8%, recall rate of 93.7%, and F1 score of 92.7%. This study optimizes the rural revitalization tourism service platform, enhances the quality of rural tourism, promotes the development of the rural tourism industry, and contributes to the realization of rural revitalization.
Collapse
Affiliation(s)
- Gongyi Jiang
- Foreign Languages Department, Tourism College of Zhejiang, Hangzhou, China,
| | - Weijun Gao
- Faculty of Environmental Engineering, The University of Kitakyushu, Kitakyushu, Japan
- Innovation Institute for Sustainable Maritime Architecture Research and Technology, Qingdao University of Technology, Qingdao, China,
| | - Meng Xu
- Faculty of Chemical Engineering and Technology, Zhejiang University of Technology, Hangzhou, China
| | - Mingjia Tong
- Foreign Languages Department, Tourism College of Zhejiang, Hangzhou, China,
| |
Collapse
|
5
|
Movassaghi CS, Sun J, Jiang Y, Turner N, Chang V, Chung N, Chen RJ, Browne EN, Lin C, Schweppe DK, Malaker SA, Meyer JG. Recent Advances in Mass Spectrometry-Based Bottom-Up Proteomics. Anal Chem 2025; 97:4728-4749. [PMID: 40000226 DOI: 10.1021/acs.analchem.4c06750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2025]
Abstract
Mass spectrometry-based proteomics is about 35 years old, and recent progress appears to be speeding up across all subfields. In this review, we focus on advances over the last two years in select areas within bottom-up proteomics, including approaches to high-throughput experiments, data analysis using machine learning, drug discovery, glycoproteomics, extracellular vesicle proteomics, and structural proteomics.
Collapse
Affiliation(s)
- Cameron S Movassaghi
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Jie Sun
- Department of Biochemistry & Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee 37996, United States
| | - Yuming Jiang
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Natalie Turner
- Departments of Molecular Medicine and Neurobiology, Scripps Research Institute, La Jolla, California 92037, United States
| | - Vincent Chang
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Nara Chung
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Ryan J Chen
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Elizabeth N Browne
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Chuwei Lin
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Devin K Schweppe
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Stacy A Malaker
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Jesse G Meyer
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| |
Collapse
|
6
|
Schneider M, Zolg DP, Samaras P, Ben Fredj S, Bold D, Guevende A, Hogrebe A, Berger MT, Graber M, Sukumar V, Mamisashvili L, Bronsthein I, Eljagh L, Gessulat S, Seefried F, Schmidt T, Frejno M. A Scalable, Web-Based Platform for Proteomics Data Processing, Result Storage and Analysis. J Proteome Res 2025; 24:1241-1249. [PMID: 39982847 PMCID: PMC11894649 DOI: 10.1021/acs.jproteome.4c00871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 12/20/2024] [Accepted: 01/23/2025] [Indexed: 02/23/2025]
Abstract
The exponential increase in proteomics data presents critical challenges for conventional processing workflows. These pipelines often consist of fragmented software packages, glued together using complex in-house scripts or error-prone manual workflows running on local hardware, which are costly to maintain and scale. The MSAID Platform offers a fully automated, managed proteomics data pipeline, consolidating formerly disjointed functions into unified, API-driven services that cover the entire process from raw data to biological insights. Backed by the cloud-native search algorithm CHIMERYS, as well as scalable cloud compute instances and data lakes, the platform facilitates efficient processing of large data sets, automation of processing via the command line, systematic result storage, analysis, and visualization. The data lake supports elastically growing storage and unified query capabilities, facilitating large-scale analyses and efficient reuse of previously processed data, such as aggregating longitudinally acquired studies. Users interact with the platform via a web interface, CLI client, or API, providing flexible, automated access. Readily available tools for accessing result data include browser-based interrogation and one-click visualizations for statistical analysis. The platform streamlines research processes, making advanced and automated proteomic workflows accessible to a broader range of scientists. The MSAID Platform is globally available via https://platform.msaid.io.
Collapse
|
7
|
Huang J, Li Y, Meng B, Zhang Y, Wei Y, Dai X, An D, Zhao Y, Fang X. ProteoNet: A CNN-based framework for analyzing proteomics MS-RGB images. iScience 2024; 27:111362. [PMID: 39679296 PMCID: PMC11638609 DOI: 10.1016/j.isci.2024.111362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 06/15/2024] [Accepted: 11/07/2024] [Indexed: 12/17/2024] Open
Abstract
Proteomics is crucial in clinical research, yet the clinical application of proteomic data remains challenging. Transforming proteomic mass spectrometry (MS) data into red, green, and blue color (MS-RGB) image formats and applying deep learning (DL) techniques has shown great potential to enhance analysis efficiency. However, current DL models often fail to extract subtle, crucial features from MS-RGB data. To address this, we developed ProteoNet, a deep learning framework that refines MS-RGB data analysis. ProteoNet incorporates semantic partitioning, adaptive average pooling, and weighted factors into the Convolutional Neural Network (CNN) model, thus enhancing data analysis accuracy. Our experiments with proteomics data from urine, blood, and tissue samples related to liver, kidney, and thyroid diseases demonstrate that ProteoNet outperforms existing models in accuracy. ProteoNet also provides a direct conversion method for MS-RGB data, enabling a seamless workflow. Moreover, its compatibility with various CNN architectures, including lightweight models like MobileNetV2, underscores its scalability and clinical potential.
Collapse
Affiliation(s)
- Jinze Huang
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Yimin Li
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Bo Meng
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Yong Zhang
- Institutes for Systems Genetics, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Yaoguang Wei
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Xinhua Dai
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Dong An
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Yang Zhao
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Xiang Fang
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| |
Collapse
|
8
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
9
|
Tariq U, Saeed F. Predicting peptide properties from mass spectrometry data using deep attention-based multitask network and uncertainty quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.21.609035. [PMID: 39229185 PMCID: PMC11370541 DOI: 10.1101/2024.08.21.609035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Database search algorithms reduce the number of potential candidate peptides against which scoring needs to be performed using a single (i.e. mass) property for filtering. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides - potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra. We demonstrate that ProteoRift can predict these properties with up to 97% accuracy resulting in search-space reduction by more than 90%. As a result, our end-to-end pipeline is shown to exhibit 8x to 12x speedups with peptide deduction accuracy comparable to algorithmic techniques. We also formulate two uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end ML pipeline available at https://github.com/pcdslab/ProteoRift.
Collapse
Affiliation(s)
- Usman Tariq
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
| | - Fahad Saeed
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
- Biomolecular Sciences Institute (BSI), Florida International University, Miami, FL, USA
- Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA
| |
Collapse
|
10
|
Overstreet R, King E, Clopton G, Nguyen J, Ciesielski D. QC-GN 2oMS 2: a Graph Neural Net for High Resolution Mass Spectra Prediction. J Chem Inf Model 2024; 64:5806-5816. [PMID: 39013165 DOI: 10.1021/acs.jcim.4c00446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]
Abstract
Predicting the mass spectrum of a molecular ion is often accomplished via three generalized approaches: rules-based methods for bond breaking, deep learning, or quantum chemical (QC) modeling. Rules-based approaches are often limited by the conditions for different chemical subspaces and perform poorly under chemical regimes with few defined rules. QC modeling is theoretically robust but requires significant amounts of computational time to produce a spectrum for a given target. Among deep learning techniques, graph neural networks (GNNs) have performed better than previous work with fingerprint-based neural networks in mass spectra prediction. To explore this technique further, we investigate the effects of including quantum chemically derived information as edge features in the GNN to increase predictive accuracy. The models we investigated include categorical bond order, bond force constants derived from extended tight-binding (xTB) quantum chemistry, and acyclic bond dissociation energies. We evaluated these models against a control GNN with no edge features in the input graphs. Bond dissociation enthalpies yielded the best improvement with a cosine similarity score of 0.462 relative to the baseline model (0.437). In this work we also apply dynamic graph attention which improves performance on benchmark problems and supports the inclusion of edge features. Between implementations, we investigate the nature of the molecular embedding for spectra prediction and discuss the recognition of fragment topographies in distinct chemistries for further development in tandem mass spectrometry prediction.
Collapse
Affiliation(s)
- Richard Overstreet
- Signature Science and Technology Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Ethan King
- Computing and Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Grady Clopton
- Department of Chemistry, Tennessee State University, Nashville, Tennessee 37209, United States
| | - Julia Nguyen
- Computing and Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Danielle Ciesielski
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| |
Collapse
|
11
|
Peng S, Rajjou L. Advancing plant biology through deep learning-powered natural language processing. PLANT CELL REPORTS 2024; 43:208. [PMID: 39102077 DOI: 10.1007/s00299-024-03294-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/19/2024] [Indexed: 08/06/2024]
Abstract
The application of deep learning methods, specifically the utilization of Large Language Models (LLMs), in the field of plant biology holds significant promise for generating novel knowledge on plant cell systems. The LLM framework exhibits exceptional potential, particularly with the development of Protein Language Models (PLMs), allowing for in-depth analyses of nucleic acid and protein sequences. This analytical capacity facilitates the discernment of intricate patterns and relationships within biological data, encompassing multi-scale information within DNA or protein sequences. The contribution of PLMs extends beyond mere sequence patterns and structure--function recognition; it also supports advancements in genetic improvements for agriculture. The integration of deep learning approaches into the domain of plant sciences offers opportunities for major breakthroughs in basic research across multi-scale plant traits. Consequently, the strategic application of deep learning methodologies, particularly leveraging the potential of LLMs, will undoubtedly play a pivotal role in advancing plant sciences, plant production, plant uses and propelling the trajectory toward sustainable agroecological and agro-food transitions.
Collapse
Affiliation(s)
- Shuang Peng
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Loïc Rajjou
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France.
| |
Collapse
|
12
|
Wasilewski T, Kamysz W, Gębicki J. AI-Assisted Detection of Biomarkers by Sensors and Biosensors for Early Diagnosis and Monitoring. BIOSENSORS 2024; 14:356. [PMID: 39056632 PMCID: PMC11274923 DOI: 10.3390/bios14070356] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 06/25/2024] [Accepted: 06/28/2024] [Indexed: 07/28/2024]
Abstract
The steady progress in consumer electronics, together with improvement in microflow techniques, nanotechnology, and data processing, has led to implementation of cost-effective, user-friendly portable devices, which play the role of not only gadgets but also diagnostic tools. Moreover, numerous smart devices monitor patients' health, and some of them are applied in point-of-care (PoC) tests as a reliable source of evaluation of a patient's condition. Current diagnostic practices are still based on laboratory tests, preceded by the collection of biological samples, which are then tested in clinical conditions by trained personnel with specialistic equipment. In practice, collecting passive/active physiological and behavioral data from patients in real time and feeding them to artificial intelligence (AI) models can significantly improve the decision process regarding diagnosis and treatment procedures via the omission of conventional sampling and diagnostic procedures while also excluding the role of pathologists. A combination of conventional and novel methods of digital and traditional biomarker detection with portable, autonomous, and miniaturized devices can revolutionize medical diagnostics in the coming years. This article focuses on a comparison of traditional clinical practices with modern diagnostic techniques based on AI and machine learning (ML). The presented technologies will bypass laboratories and start being commercialized, which should lead to improvement or substitution of current diagnostic tools. Their application in PoC settings or as a consumer technology accessible to every patient appears to be a real possibility. Research in this field is expected to intensify in the coming years. Technological advancements in sensors and biosensors are anticipated to enable the continuous real-time analysis of various omics fields, fostering early disease detection and intervention strategies. The integration of AI with digital health platforms would enable predictive analysis and personalized healthcare, emphasizing the importance of interdisciplinary collaboration in related scientific fields.
Collapse
Affiliation(s)
- Tomasz Wasilewski
- Department of Inorganic Chemistry, Faculty of Pharmacy, Medical University of Gdansk, Hallera 107, 80-416 Gdansk, Poland
| | - Wojciech Kamysz
- Department of Inorganic Chemistry, Faculty of Pharmacy, Medical University of Gdansk, Hallera 107, 80-416 Gdansk, Poland
| | - Jacek Gębicki
- Department of Process Engineering and Chemical Technology, Faculty of Chemistry, Gdansk University of Technology, Narutowicza 11/12, 80-233 Gdansk, Poland;
| |
Collapse
|
13
|
Peters-Clarke TM, Coon JJ, Riley NM. Instrumentation at the Leading Edge of Proteomics. Anal Chem 2024; 96:7976-8010. [PMID: 38738990 PMCID: PMC11996003 DOI: 10.1021/acs.analchem.3c04497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Affiliation(s)
- Trenton M. Peters-Clarke
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Joshua J. Coon
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | | |
Collapse
|
14
|
Siraj A, Bouwmeester R, Declercq A, Welp L, Chernev A, Wulf A, Urlaub H, Martens L, Degroeve S, Kohlbacher O, Sachsenberg T. Intensity and retention time prediction improves the rescoring of protein-nucleic acid cross-links. Proteomics 2024; 24:e2300144. [PMID: 38629965 DOI: 10.1002/pmic.202300144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 12/29/2023] [Accepted: 01/05/2024] [Indexed: 04/19/2024]
Abstract
In protein-RNA cross-linking mass spectrometry, UV or chemical cross-linking introduces stable bonds between amino acids and nucleic acids in protein-RNA complexes that are then analyzed and detected in mass spectra. This analytical tool delivers valuable information about RNA-protein interactions and RNA docking sites in proteins, both in vitro and in vivo. The identification of cross-linked peptides with oligonucleotides of different length leads to a combinatorial increase in search space. We demonstrate that the peptide retention time prediction tasks can be transferred to the task of cross-linked peptide retention time prediction using a simple amino acid composition encoding, yielding improved identification rates when the prediction error is included in rescoring. For the more challenging task of including fragment intensity prediction of cross-linked peptides in the rescoring, we obtain, on average, a similar improvement. Further improvement in the encoding and fine-tuning of retention time and intensity prediction models might lead to further gains, and merit further research.
Collapse
Affiliation(s)
- Arslan Siraj
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Robbin Bouwmeester
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Arthur Declercq
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Luisa Welp
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- Bioanalytics, Institute of Clinical Chemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Aleksandar Chernev
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Alexander Wulf
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Henning Urlaub
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- Bioanalytics, Institute of Clinical Chemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Lennart Martens
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Sven Degroeve
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Oliver Kohlbacher
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Timo Sachsenberg
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
15
|
Adams C, Laukens K, Bittremieux W, Boonen K. Machine learning-based peptide-spectrum match rescoring opens up the immunopeptidome. Proteomics 2024; 24:e2300336. [PMID: 38009585 DOI: 10.1002/pmic.202300336] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 10/18/2023] [Accepted: 10/23/2023] [Indexed: 11/29/2023]
Abstract
Immunopeptidomics is a key technology in the discovery of targets for immunotherapy and vaccine development. However, identifying immunopeptides remains challenging due to their non-tryptic nature, which results in distinct spectral characteristics. Moreover, the absence of strict digestion rules leads to extensive search spaces, further amplified by the incorporation of somatic mutations, pathogen genomes, unannotated open reading frames, and post-translational modifications. This inflation in search space leads to an increase in random high-scoring matches, resulting in fewer identifications at a given false discovery rate. Peptide-spectrum match rescoring has emerged as a machine learning-based solution to address challenges in mass spectrometry-based immunopeptidomics data analysis. It involves post-processing unfiltered spectrum annotations to better distinguish between correct and incorrect peptide-spectrum matches. Recently, features based on predicted peptidoform properties, including fragment ion intensities, retention time, and collisional cross section, have been used to improve the accuracy and sensitivity of immunopeptide identification. In this review, we describe the diverse bioinformatics pipelines that are currently available for peptide-spectrum match rescoring and discuss how they can be used for the analysis of immunopeptidomics data. Finally, we provide insights into current and future machine learning solutions to boost immunopeptide identification.
Collapse
Affiliation(s)
- Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kurt Boonen
- Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
- ImmuneSpec BV, Niel, Belgium
| |
Collapse
|
16
|
Strauss MT, Bludau I, Zeng WF, Voytik E, Ammar C, Schessner JP, Ilango R, Gill M, Meier F, Willems S, Mann M. AlphaPept: a modern and open framework for MS-based proteomics. Nat Commun 2024; 15:2168. [PMID: 38461149 PMCID: PMC10924963 DOI: 10.1038/s41467-024-46485-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/20/2024] [Indexed: 03/11/2024] Open
Abstract
In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making efficient analysis a principal challenge. A plethora of different computational tools can process the MS data to derive peptide and protein identification and quantification. However, during the last years there has been dramatic progress in computer science, including collaboration tools that have transformed research and industry. To leverage these advances, we develop AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Numba for just-in-time compilation on CPU and GPU achieves hundred-fold speed improvements. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while accessing the latest advances. We provide an easy on-ramp for community contributions through the concept of literate programming, implemented in Jupyter Notebooks. Large datasets can rapidly be processed as shown by the analysis of hundreds of proteomes in minutes per file, many-fold faster than acquisition. AlphaPept can be used to build automated processing pipelines with web-serving functionality and compatibility with downstream analysis tools. It provides easy access via one-click installation, a modular Python library for advanced users, and via an open GitHub repository for developers.
Collapse
Affiliation(s)
- Maximilian T Strauss
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Isabell Bludau
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Eugenia Voytik
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Constantin Ammar
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Julia P Schessner
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | | | | | - Florian Meier
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
- Functional Proteomics, Jena University Hospital, Jena, Germany
| | - Sander Willems
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Matthias Mann
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
17
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
18
|
Will A, Oliinyk D, Bleiholder C, Meier F. Peptide collision cross sections of 22 post-translational modifications. Anal Bioanal Chem 2023; 415:6633-6645. [PMID: 37758903 PMCID: PMC10598134 DOI: 10.1007/s00216-023-04957-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/13/2023] [Accepted: 08/23/2023] [Indexed: 09/29/2023]
Abstract
Recent advances have rekindled the interest in ion mobility as an additional dimension of separation in mass spectrometry (MS)-based proteomics. Ion mobility separates ions according to their size and shape in the gas phase. Here, we set out to investigate the effect of 22 different post-translational modifications (PTMs) on the collision cross section (CCS) of peptides. In total, we analyzed ~4300 pairs of matching modified and unmodified peptide ion species by trapped ion mobility spectrometry (TIMS). Linear alignment based on spike-in reference peptides resulted in highly reproducible CCS values with a median coefficient of variation of 0.26%. On a global level, we observed a redistribution in the m/z vs. ion mobility space for modified peptides upon changes in their charge state. Pairwise comparison between modified and unmodified peptides of the same charge state revealed median shifts in CCS between -1.4% (arginine citrullination) and +4.5% (O-GlcNAcylation). In general, increasing modified peptide masses were correlated with higher CCS values, in particular within homologous PTM series. However, investigating the ion populations in more detail, we found that the change in CCS can vary substantially for a given PTM and is partially correlated with the gas phase structure of its unmodified counterpart. In conclusion, our study shows PTM- and sequence-specific effects on the cross section of peptides, which could be further leveraged for proteome-wide PTM analysis.
Collapse
Affiliation(s)
- Andreas Will
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany
| | - Denys Oliinyk
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany
| | - Christian Bleiholder
- Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL, 32304, USA
| | - Florian Meier
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany.
| |
Collapse
|
19
|
Tüting C, Schmidt L, Skalidis I, Sinz A, Kastritis PL. Enabling cryo-EM density interpretation from yeast native cell extracts by proteomics data and AlphaFold structures. Proteomics 2023; 23:e2200096. [PMID: 37016452 DOI: 10.1002/pmic.202200096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 03/23/2023] [Accepted: 03/24/2023] [Indexed: 04/06/2023]
Abstract
In the cellular context, proteins participate in communities to perform their function. The detection and identification of these communities as well as in-community interactions has long been the subject of investigation, mainly through proteomics analysis with mass spectrometry. With the advent of cryogenic electron microscopy and the "resolution revolution," their visualization has recently been made possible, even in complex, native samples. The advances in both fields have resulted in the generation of large amounts of data, whose analysis requires advanced computation, often employing machine learning approaches to reach the desired outcome. In this work, we first performed a robust proteomics analysis of mass spectrometry (MS) data derived from a yeast native cell extract and used this information to identify protein communities and inter-protein interactions. Cryo-EM analysis of the cell extract provided a reconstruction of a biomolecule at medium resolution (∼8 Å (FSC = 0.143)). Utilizing MS-derived proteomics data and systematic fitting of AlphaFold-predicted atomic models, this density was assigned to the 2.6 MDa complex of yeast fatty acid synthase. Our proposed workflow identifies protein complexes in native cell extracts from Saccharomyces cerevisiae by combining proteomics, cryo-EM, and AI-guided protein structure prediction.
Collapse
Affiliation(s)
- Christian Tüting
- Interdisciplinary Research Center HALOmem, Charles Tanford Protein Center, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Institute of Biochemistry and Biotechnology, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Biozentrum, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Lisa Schmidt
- Interdisciplinary Research Center HALOmem, Charles Tanford Protein Center, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Institute of Biochemistry and Biotechnology, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Ioannis Skalidis
- Interdisciplinary Research Center HALOmem, Charles Tanford Protein Center, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Institute of Biochemistry and Biotechnology, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Andrea Sinz
- Institute of Pharmacy, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Center for Structural Mass Spectrometry, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Panagiotis L Kastritis
- Interdisciplinary Research Center HALOmem, Charles Tanford Protein Center, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Institute of Biochemistry and Biotechnology, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Biozentrum, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
- Institute of Chemical Biology, National Hellenic Research Foundation, Athens, Greece
| |
Collapse
|
20
|
Ng CCA, Zhou Y, Yao ZP. Algorithms for de-novo sequencing of peptides by tandem mass spectrometry: A review. Anal Chim Acta 2023; 1268:341330. [PMID: 37268337 DOI: 10.1016/j.aca.2023.341330] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 05/04/2023] [Accepted: 05/06/2023] [Indexed: 06/04/2023]
Abstract
Peptide sequencing is of great significance to fundamental and applied research in the fields such as chemical, biological, medicinal and pharmaceutical sciences. With the rapid development of mass spectrometry and sequencing algorithms, de-novo peptide sequencing using tandem mass spectrometry (MS/MS) has become the main method for determining amino acid sequences of novel and unknown peptides. Advanced algorithms allow the amino acid sequence information to be accurately obtained from MS/MS spectra in short time. In this review, algorithms from exhaustive search to the state-of-art machine learning and neural network for high-throughput and automated de-novo sequencing are introduced and compared. Impacts of datasets on algorithm performance are highlighted. The current limitations and promising direction of de-novo peptide sequencing are also discussed in this review.
Collapse
Affiliation(s)
- Cheuk Chi A Ng
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China
| | - Yin Zhou
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China
| | - Zhong-Ping Yao
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China.
| |
Collapse
|
21
|
Abdul-Khalek N, Wimmer R, Overgaard MT, Gregersen Echers S. Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
Affiliation(s)
- Naim Abdul-Khalek
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | - Reinhard Wimmer
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | | | | |
Collapse
|
22
|
Wilburn DB, Shannon AE, Spicer V, Richards AL, Yeung D, Swaney DL, Krokhin OV, Searle BC. Deep learning from harmonized peptide libraries enables retention time prediction of diverse post translational modifications. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542978. [PMID: 37398395 PMCID: PMC10312522 DOI: 10.1101/2023.05.30.542978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
In proteomics experiments, peptide retention time (RT) is an orthogonal property to fragmentation when assessing detection confidence. Advances in deep learning enable accurate RT prediction for any peptide from sequence alone, including those yet to be experimentally observed. Here we present Chronologer, an open-source software tool for rapid and accurate peptide RT prediction. Using new approaches to harmonize and false-discovery correct across independently collected datasets, Chronologer is built on a massive database with >2.2 million peptides including 10 common post-translational modification (PTM) types. By linking knowledge learned across diverse peptide chemistries, Chronologer predicts RTs with less than two-thirds the error of other deep learning tools. We show how RT for rare PTMs, such as OGlcNAc, can be learned with high accuracy using as few as 10-100 example peptides in newly harmonized datasets. This iteratively updatable workflow enables Chronologer to comprehensively predict RTs for PTM-marked peptides across entire proteomes.
Collapse
|
23
|
Zhang Y, Jian X, Xu L, Zhao J, Lu M, Lin Y, Xie L. iTCep: a deep learning framework for identification of T cell epitopes by harnessing fusion features. Front Genet 2023; 14:1141535. [PMID: 37229205 PMCID: PMC10203616 DOI: 10.3389/fgene.2023.1141535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 04/20/2023] [Indexed: 05/27/2023] Open
Abstract
Neoantigens recognized by cytotoxic T cells are effective targets for tumor-specific immune responses for personalized cancer immunotherapy. Quite a few neoantigen identification pipelines and computational strategies have been developed to improve the accuracy of the peptide selection process. However, these methods mainly consider the neoantigen end and ignore the interaction between peptide-TCR and the preference of each residue in TCRs, resulting in the filtered peptides often fail to truly elicit an immune response. Here, we propose a novel encoding approach for peptide-TCR representation. Subsequently, a deep learning framework, namely iTCep, was developed to predict the interactions between peptides and TCRs using fusion features derived from a feature-level fusion strategy. The iTCep achieved high predictive performance with AUC up to 0.96 on the testing dataset and above 0.86 on independent datasets, presenting better prediction performance compared with other predictors. Our results provided strong evidence that model iTCep can be a reliable and robust method for predicting TCR binding specificities of given antigen peptides. One can access the iTCep through a user-friendly web server at http://biostatistics.online/iTCep/, which supports prediction modes of peptide-TCR pairs and peptide-only. A stand-alone software program for T cell epitope prediction is also available for convenient installing at https://github.com/kbvstmd/iTCep/.
Collapse
Affiliation(s)
- Yu Zhang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
| | - Xingxing Jian
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
- Bioinformatics Center, National Clinical Research Centre for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Linfeng Xu
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, Institute of Bio-Diversity Science, School of Life Sciences, Fudan University, Shanghai, China
| | - Jingjing Zhao
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
| | - Manman Lu
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
| | - Yong Lin
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Lu Xie
- Shanghai-MOST Key Laboratory of Health and Disease Genomics, Institute of Genome and Bioinformatics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
| |
Collapse
|
24
|
Rafay A, Aziz M, Zia A, Asif AR. Automated Retrieval of Heterogeneous Proteomic Data for Machine Learning. J Pers Med 2023; 13:790. [PMID: 37240960 PMCID: PMC10222177 DOI: 10.3390/jpm13050790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 04/28/2023] [Accepted: 04/28/2023] [Indexed: 05/28/2023] Open
Abstract
Proteomics instrumentation and the corresponding bioinformatics tools have evolved at a rapid pace in the last 20 years, whereas the exploitation of deep learning techniques in proteomics is on the horizon. The ability to revisit proteomics raw data, in particular, could be a valuable resource for machine learning applications seeking new insight into protein expression and functions of previously acquired data from different instruments under various lab conditions. We map publicly available proteomics repositories (such as ProteomeXchange) and relevant publications to extract MS/MS data to form one large database that contains the patient history and mass spectrometric data acquired for the patient sample. The extracted mapped dataset should enable the research to overcome the issues attached to the dispersions of proteomics data on the internet, which makes it difficult to apply emerging new bioinformatics tools and deep learning algorithms. The workflow proposed in this study enables a linked large dataset of heart-related proteomics data, which could be easily and efficiently applied to machine learning and deep learning algorithms for futuristic predictions of heart diseases and modeling. Data scraping and crawling offer a powerful tool to harvest and prepare the training and test datasets; however, the authors advocate caution because of ethical and legal issues, as well as the need to ensure the quality and accuracy of the data that are being collected.
Collapse
Affiliation(s)
- Abdul Rafay
- Department for Clinical Chemistry/Interdisciplinary UMG Laboratories, University Medical Center, 37075 Göttingen, Germany
- Future Networks, eScience Group, Gesellschaft für Wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG), 37077 Göttingen, Germany
| | - Muzzamil Aziz
- Future Networks, eScience Group, Gesellschaft für Wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG), 37077 Göttingen, Germany
| | - Amjad Zia
- Department for Clinical Chemistry/Interdisciplinary UMG Laboratories, University Medical Center, 37075 Göttingen, Germany
| | - Abdul R. Asif
- Department for Clinical Chemistry/Interdisciplinary UMG Laboratories, University Medical Center, 37075 Göttingen, Germany
- German Centre for Cardiovascular Research (DZHK), Partner Site Göttingen, 37075 Göttingen, Germany
| |
Collapse
|
25
|
Letunica N, McCafferty C, Swaney E, Cai T, Monagle P, Ignjatovic V, Attard C. Proteomic Applications and Considerations: From Research to Patient Care. Methods Mol Biol 2023; 2628:181-192. [PMID: 36781786 DOI: 10.1007/978-1-0716-2978-9_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
Abstract
Despite technological advancements in the field of proteomics, the rate at which serum and plasma biomarkers identified using proteomic approaches are translated into clinical use remains extremely low. In this chapter, we describe recent technological advancements and analytical strategies in proteomic methods. We also describe the progress of proteomic blood-based biomarkers to date and discuss what the future of proteomics might entail with the use of multi-omic approaches and implementing machine learning on large proteomic datasets. Lastly, we provide several key considerations for biomarker studies, ranging from sample type to the use of reference samples, in order to achieve progress from bench to bedside, ultimately improving patient diagnosis, disease, and/or therapeutic monitoring and care.
Collapse
Affiliation(s)
- Natasha Letunica
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Conor McCafferty
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Ella Swaney
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Tengyi Cai
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Paul Monagle
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia.,Department of Clinical Haematology, Royal Children's Hospital, Melbourne, VIC, Australia.,Kids Cancer Centre, Sydney Children's Hospital, Randwick, NSW, Australia
| | - Vera Ignjatovic
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia.,Institute for Clinical and Translational Research, Johns Hopkins All Children's Hospital, St. Petersburg, USA.,Department of Pediatrics, Johns Hopkins University, Baltimore, USA
| | - Chantal Attard
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia. .,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia. .,The Royal Children's Hospital, Parkville, VIC, Australia.
| |
Collapse
|
26
|
Rehfeldt T, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizcaíno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res 2023; 22:632-636. [PMID: 36693629 PMCID: PMC9903315 DOI: 10.1021/acs.jproteome.2c00629] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Indexed: 01/26/2023]
Abstract
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.
Collapse
Affiliation(s)
- Tobias
G. Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5000 Odense, Denmark
| | - Ralf Gabriels
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | | | - Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Magnus Palmblad
- Center for
Proteomics and Metabolomics, Leiden University
Medical Center, 2300 RC Leiden, The Netherlands
| | - Yasset Perez-Riverol
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | | | - Juan Antonio Vizcaíno
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
27
|
Carrillo-Rodriguez P, Selheim F, Hernandez-Valladares M. Mass Spectrometry-Based Proteomics Workflows in Cancer Research: The Relevance of Choosing the Right Steps. Cancers (Basel) 2023; 15:555. [PMID: 36672506 PMCID: PMC9856946 DOI: 10.3390/cancers15020555] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 01/12/2023] [Indexed: 01/19/2023] Open
Abstract
The qualitative and quantitative evaluation of proteome changes that condition cancer development can be achieved with liquid chromatography-mass spectrometry (LC-MS). LC-MS-based proteomics strategies are carried out according to predesigned workflows that comprise several steps such as sample selection, sample processing including labeling, MS acquisition methods, statistical treatment, and bioinformatics to understand the biological meaning of the findings and set predictive classifiers. As the choice of best options might not be straightforward, we herein review and assess past and current proteomics approaches for the discovery of new cancer biomarkers. Moreover, we review major bioinformatics tools for interpreting and visualizing proteomics results and suggest the most popular machine learning techniques for the selection of predictive biomarkers. Finally, we consider the approximation of proteomics strategies for clinical diagnosis and prognosis by discussing current barriers and proposals to circumvent them.
Collapse
Affiliation(s)
- Paula Carrillo-Rodriguez
- Proteomics Unit of University of Bergen (PROBE), University of Bergen, Jonas Lies vei 91, 5009 Bergen, Norway
- Vall d’Hebron Institute of Oncology (VHIO), 08035 Barcelona, Spain
| | - Frode Selheim
- Proteomics Unit of University of Bergen (PROBE), University of Bergen, Jonas Lies vei 91, 5009 Bergen, Norway
| | - Maria Hernandez-Valladares
- Proteomics Unit of University of Bergen (PROBE), University of Bergen, Jonas Lies vei 91, 5009 Bergen, Norway
- Department of Physical Chemistry, University of Granada, Avenida de la Fuente Nueva S/N, 18071 Granada, Spain
- Instituto de Investigación Biosanitaria ibs.GRANADA, 18012 Granada, Spain
| |
Collapse
|
28
|
Gutiérrez-Mondragón MA, König C, Vellido A. Layer-Wise Relevance Analysis for Motif Recognition in the Activation Pathway of the β2- Adrenergic GPCR Receptor. Int J Mol Sci 2023; 24:ijms24021155. [PMID: 36674669 PMCID: PMC9865744 DOI: 10.3390/ijms24021155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/22/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
G-protein-coupled receptors (GPCRs) are cell membrane proteins of relevance as therapeutic targets, and are associated to the development of treatments for illnesses such as diabetes, Alzheimer's, or even cancer. Therefore, comprehending the underlying mechanisms of the receptor functional properties is of particular interest in pharmacoproteomics and in disease therapy at large. Their interaction with ligands elicits multiple molecular rearrangements all along their structure, inducing activation pathways that distinctly influence the cell response. In this work, we studied GPCR signaling pathways from molecular dynamics simulations as they provide rich information about the dynamic nature of the receptors. We focused on studying the molecular properties of the receptors using deep-learning-based methods. In particular, we designed and trained a one-dimensional convolution neural network and illustrated its use in a classification of conformational states: active, intermediate, or inactive, of the β2-adrenergic receptor when bound to the full agonist BI-167107. Through a novel explainability-oriented investigation of the prediction results, we were able to identify and assess the contribution of individual motifs (residues) influencing a particular activation pathway. Consequently, we contribute a methodology that assists in the elucidation of the underlying mechanisms of receptor activation-deactivation.
Collapse
Affiliation(s)
- Mario A. Gutiérrez-Mondragón
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| | - Caroline König
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Correspondence:
| | - Alfredo Vellido
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| |
Collapse
|
29
|
Cox J. Prediction of peptide mass spectral libraries with machine learning. Nat Biotechnol 2023; 41:33-43. [PMID: 36008611 DOI: 10.1038/s41587-022-01424-w] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 07/11/2022] [Indexed: 01/21/2023]
Abstract
The recent development of machine learning methods to identify peptides in complex mass spectrometric data constitutes a major breakthrough in proteomics. Longstanding methods for peptide identification, such as search engines and experimental spectral libraries, are being superseded by deep learning models that allow the fragmentation spectra of peptides to be predicted from their amino acid sequence. These new approaches, including recurrent neural networks and convolutional neural networks, use predicted in silico spectral libraries rather than experimental libraries to achieve higher sensitivity and/or specificity in the analysis of proteomics data. Machine learning is galvanizing applications that involve large search spaces, such as immunopeptidomics and proteogenomics. Current challenges in the field include the prediction of spectra for peptides with post-translational modifications and for cross-linked pairs of peptides. Permeation of machine-learning-based spectral prediction into search engines and spectrum-centric data-independent acquisition workflows for diverse peptide classes and measurement conditions will continue to push sensitivity and dynamic range in proteomics applications in the coming years.
Collapse
Affiliation(s)
- Jürgen Cox
- Computational Systems Biochemistry Research Group, Max-Planck Institute of Biochemistry, Martinsried, Germany.
- Department of Biological and Medical Psychology, University of Bergen, Bergen, Norway.
| |
Collapse
|
30
|
Computational Thinking Training and Deep Learning Evaluation Model Construction Based on Scratch Modular Programming Course. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:3760957. [PMID: 36873382 PMCID: PMC9977527 DOI: 10.1155/2023/3760957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 08/11/2022] [Accepted: 08/17/2022] [Indexed: 02/25/2023]
Abstract
To improve the algorithmic dimension, critical thinking, and problem-solving ability of computational thinking (CT) in students' programming courses, first, a programming teaching model is constructed based on the scratch modular programming course. Secondly, the design process of the teaching model and the problem-solving model of visual programming are studied. Finally, a deep learning (DL) evaluation model is constructed, and the effectiveness of the designed teaching model is analyzed and evaluated. The T-test result of paired samples of CT is t = -2.08, P < 0.05. There are significant differences in the results of the two tests, and the designed teaching model can cause changes in students' CT abilities. The results reveal that the effectiveness of the teaching model based on scratch modular programming has been verified on the basis of experiments. The post-test values of the dimensions of algorithmic thinking, critical thinking, collaborative thinking, and problem-solving thinking are all higher than the pretest values, and there are individual differences. The P values are all less than 0.05, which testifies that the CT training of the designed teaching model has the algorithm dimension, critical thinking, collaborative thinking, and problem-solving ability of students' CT. The post-test values of cognitive load are all lower than the pretest values, indicating that the model has a certain positive effect on reducing cognitive load, and there is a significant difference between the pretest and post-test. In the dimension of creative thinking, the P value is 0.218, and there is no obvious difference in the dimensions of creativity and self-efficacy. It can be found from the DL evaluation that the average value of the DL knowledge and skills dimensions is greater than 3.5, and college students can reach a certain standard level in terms of knowledge and skills. The mean value of the process and method dimensions is about 3.1, and the mean value of the emotional attitudes and values is 2.77. The process and method, as well as emotional attitude and values, need to be strengthened. The DL level of college students is relatively low, and it is necessary to improve their DL level from the perspective of knowledge and skills, processes and methods, emotional attitudes and values. This research makes up for the shortcomings of traditional programming and design software to a certain extent. It has a certain reference value for researchers and teachers to carry out programming teaching practice.
Collapse
|
31
|
Kong S, Gong P, Zeng WF, Jiang B, Hou X, Zhang Y, Zhao H, Liu M, Yan G, Zhou X, Qiao X, Wu M, Yang P, Liu C, Cao W. pGlycoQuant with a deep residual network for quantitative glycoproteomics at intact glycopeptide level. Nat Commun 2022; 13:7539. [PMID: 36477196 PMCID: PMC9729625 DOI: 10.1038/s41467-022-35172-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 11/17/2022] [Indexed: 12/12/2022] Open
Abstract
Large-scale intact glycopeptide identification has been advanced by software tools. However, tools for quantitative analysis remain lagging behind, which hinders exploring the differential site-specific glycosylation. Here, we report pGlycoQuant, a generic tool for both primary and tandem mass spectrometry-based intact glycopeptide quantitation. pGlycoQuant advances in glycopeptide matching through applying a deep learning model that reduces missing values by 19-89% compared with Byologic, MSFragger-Glyco, Skyline, and Proteome Discoverer, as well as a Match In Run algorithm for more glycopeptide coverage, greatly expanding the quantitative function of several widely used search engines, including pGlyco 2.0, pGlyco3, Byonic and MSFragger-Glyco. Further application of pGlycoQuant to the N-glycoproteomic study in three different metastatic HCC cell lines quantifies 6435 intact N-glycopeptides and, together with in vitro molecular biology experiments, illustrates site 979-core fucosylation of L1CAM as a potential regulator of HCC metastasis. We expected further applications of the freely available pGlycoQuant in glycoproteomic studies.
Collapse
Affiliation(s)
- Siyuan Kong
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Pengyun Gong
- School of Engineering Medicine & School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Wen-Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
- Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Biyun Jiang
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Xinhang Hou
- School of Engineering Medicine & School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Yang Zhang
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Huanhuan Zhao
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Mingqi Liu
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Guoquan Yan
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Xinwen Zhou
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Xihua Qiao
- School of Engineering Medicine & School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Mengxi Wu
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Pengyuan Yang
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- NHC Key Laboratory of Glycoconjugates Research, Fudan University, Shanghai, China
| | - Chao Liu
- School of Engineering Medicine & School of Biological Science and Medical Engineering, Beihang University, Beijing, China.
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China.
| | - Weiqian Cao
- Shanghai Fifth People's Hospital and Institutes of Biomedical Sciences, Fudan University, Shanghai, China.
- NHC Key Laboratory of Glycoconjugates Research, Fudan University, Shanghai, China.
| |
Collapse
|
32
|
Gill ML. The rise of the machines in chemistry. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1044-1051. [PMID: 35976263 DOI: 10.1002/mrc.5304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 08/07/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
The use of artificial intelligence and, more specifically, deep learning methods in chemistry is becoming increasingly common. Applications in informatics fields, such as cheminformatics and proteomics, structural biology, and spectroscopy, including NMR, are on the rise. Recent developments in model architectures, such as graph convolutional neural networks and transformers, have been enabled by advancements in computational hardware and software. However, model architectures with more predictive power often require larger amounts of training data, which can be challenging to acquire, but this requirement can be mitigated through techniques like pretraining and fine-tuning. In spite of these successes, challenges remain, such as normalization and scaling of data, availability of experimentally acquired data, and model explainability.
Collapse
|
33
|
Desaire H, Go EP, Hua D. Advances, obstacles, and opportunities for machine learning in proteomics. CELL REPORTS. PHYSICAL SCIENCE 2022; 3:101069. [PMID: 36381226 PMCID: PMC9648337 DOI: 10.1016/j.xcrp.2022.101069] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The fields of proteomics and machine learning are both large disciplines, each producing well over 5,000 publications per year. However, studies combining both fields are still relatively rare, with only about 2% of recent proteomics papers including machine learning. This review, which focuses on the intersection of the fields, is intended to inspire proteomics researchers to develop skills and knowledge in the application of machine learning. A brief tutorial introduction to machine learning is provided, and research advances that rely on both fields, particularly as they relate to proteomics tools development and biomarker discovery, are highlighted. Key knowledge gaps and opportunities for scientific advancement are also enumerated.
Collapse
Affiliation(s)
- Heather Desaire
- Department of Chemistry, University of Kansas, Lawrence, KS 66045, USA
| | - Eden P. Go
- Department of Chemistry, University of Kansas, Lawrence, KS 66045, USA
| | - David Hua
- Department of Chemistry, University of Kansas, Lawrence, KS 66045, USA
| |
Collapse
|
34
|
Yang Y, Qiao L. Data-independent acquisition proteomics methods for analyzing post-translational modifications. Proteomics 2022; 23:e2200046. [PMID: 36036492 DOI: 10.1002/pmic.202200046] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 08/20/2022] [Accepted: 08/23/2022] [Indexed: 11/06/2022]
Abstract
Protein post-translational modifications (PTMs) increase the functional diversity of the cellular proteome. Accurate and high throughput identification and quantification of protein PTMs is a key task in proteomics research. Recent advancements in data-independent acquisition (DIA) mass spectrometry (MS) technology have achieved deep coverage and accurate quantification of proteins and PTMs. This review provides an overview of DIA data processing methods that cover three aspects of PTMs analysis, i.e., detection of PTMs, site localization, and characterization of complex modification moieties, such as glycosylation. In addition, a survey of deep learning methods that boost DIA-based PTMs analysis is presented, including in silico spectral library generation, as well as feature scoring and error rate control. The limitations and future directions of DIA methods for PTMs analysis are also discussed. Novel data analysis methods will take advantage of advanced MS instrumentation techniques to empower DIA MS for in-depth and accurate PTMs measurements. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| |
Collapse
|
35
|
Perez-Riverol Y. Proteomic repository data submission, dissemination, and reuse: key messages. Expert Rev Proteomics 2022; 19:297-310. [PMID: 36529941 PMCID: PMC7614296 DOI: 10.1080/14789450.2022.2160324] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 12/07/2022] [Indexed: 12/23/2022]
Abstract
INTRODUCTION The creation of ProteomeXchange data workflows in 2012 transformed the field of proteomics, consisting of the standardization of data submission and dissemination and enabling the widespread reanalysis of public MS proteomics data worldwide. ProteomeXchange has triggered a growing trend toward public dissemination of proteomics data, facilitating the assessment, reuse, comparative analyses, and extraction of new findings from public datasets. By 2022, the consortium is integrated by PRIDE, PeptideAtlas, MassIVE, jPOST, iProX, and Panorama Public. AREAS COVERED Here, we review and discuss the current ecosystem of resources, guidelines, and file formats for proteomics data dissemination and reanalysis. Special attention is drawn to new exciting quantitative and post-translational modification-oriented resources. The challenges and future directions on data depositions including the lack of metadata and cloud-based and high-performance software solutions for fast and reproducible reanalysis of the available data are discussed. EXPERT OPINION The success of ProteomeXchange and the amount of proteomics data available in the public domain have triggered the creation and/or growth of other protein knowledgebase resources. Data reuse is a leading, active, and evolving field; supporting the creation of new formats, tools, and workflows to rediscover and reshape the public proteomics data.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
36
|
Chen W, McCool EN, Sun L, Zang Y, Ning X, Liu X. Evaluation of Machine Learning Models for Proteoform Retention and Migration Time Prediction in Top-Down Mass Spectrometry. J Proteome Res 2022; 21:1736-1747. [PMID: 35616364 PMCID: PMC9250612 DOI: 10.1021/acs.jproteome.2c00124] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
Reversed-phase liquid
chromatography (RPLC) and capillary zone
electrophoresis (CZE) are two primary proteoform separation methods
in mass spectrometry (MS)-based top-down proteomics. Proteoform retention
time (RT) prediction in RPLC and migration time (MT) prediction in
CZE provide additional information for accurate proteoform identification
and quantification. While existing methods are mainly focused on peptide
RT and MT prediction in bottom-up MS, there is still a lack of methods
for proteoform RT and MT prediction in top-down MS. We systematically
evaluated eight machine learning models and a transfer learning method
for proteoform RT prediction and five models and the transfer learning
method for proteoform MT prediction. Experimental results showed that
a gated recurrent unit (GRU)-based model with transfer learning achieved
a high accuracy (R = 0.978) for proteoform RT prediction
and that the GRU-based model and a fully connected neural network
model obtained a high accuracy of R = 0.982 and 0.981
for proteoform MT prediction, respectively.
Collapse
Affiliation(s)
- Wenrong Chen
- Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, Indiana 46202, United Staes
| | - Elijah N McCool
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United Staes
| | - Liangliang Sun
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United Staes
| | - Yong Zang
- Department of Biostatics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana 46202, United Staes
| | - Xia Ning
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, United Staes.,Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, United Staes.,Translational Data Analytics Institute, The Ohio State University, Columbus, Ohio 43210, United Staes
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, Louisiana 70112, United Staes.,Deming Department of Medicine, Tulane University, New Orleans, Louisiana 70112, United Staes
| |
Collapse
|
37
|
García-Consuegra I, Asensio-Peña S, Garrido-Moraga R, Pinós T, Domínguez-González C, Santalla A, Nogales-Gadea G, Serrano-Lorenzo P, Andreu AL, Arenas J, Zugaza JL, Lucia A, Martín MA. Identification of Potential Muscle Biomarkers in McArdle Disease: Insights from Muscle Proteome Analysis. Int J Mol Sci 2022; 23:4650. [PMID: 35563042 PMCID: PMC9100117 DOI: 10.3390/ijms23094650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 04/03/2022] [Accepted: 04/18/2022] [Indexed: 02/04/2023] Open
Abstract
Glycogen storage disease type V (GSDV, McArdle disease) is a rare genetic myopathy caused by deficiency of the muscle isoform of glycogen phosphorylase (PYGM). This results in a block in the use of muscle glycogen as an energetic substrate, with subsequent exercise intolerance. The pathobiology of GSDV is still not fully understood, especially with regard to some features such as persistent muscle damage (i.e., even without prior exercise). We aimed at identifying potential muscle protein biomarkers of GSDV by analyzing the muscle proteome and the molecular networks associated with muscle dysfunction in these patients. Muscle biopsies from eight patients and eight healthy controls showing none of the features of McArdle disease, such as frequent contractures and persistent muscle damage, were studied by quantitative protein expression using isobaric tags for relative and absolute quantitation (iTRAQ) followed by artificial neuronal networks (ANNs) and topology analysis. Protein candidate validation was performed by Western blot. Several proteins predominantly involved in the process of muscle contraction and/or calcium homeostasis, such as myosin, sarcoplasmic/endoplasmic reticulum calcium ATPase 1, tropomyosin alpha-1 chain, troponin isoforms, and alpha-actinin-3, showed significantly lower expression levels in the muscle of GSDV patients. These proteins could be potential biomarkers of the persistent muscle damage in the absence of prior exertion reported in GSDV patients. Further studies are needed to elucidate the molecular mechanisms by which PYGM controls the expression of these proteins.
Collapse
Affiliation(s)
- Inés García-Consuegra
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
| | - Sara Asensio-Peña
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
| | - Rocío Garrido-Moraga
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
| | - Tomàs Pinós
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
- Mitochondrial and Neuromuscular Disorders Unit, Vall d’Hebron Institut de Recerca, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
| | - Cristina Domínguez-González
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
| | - Alfredo Santalla
- Department of Computer and Sport Sciences, Universidad Pablo de Olavide, 41013 Sevilla, Spain;
| | - Gisela Nogales-Gadea
- Grup de Recerca en Malalties Neuromusculars i Neuropediàtriques, Department of Neurosciences, Institut d’Investigacio en Ciencies de la Salut Germans Trias i Pujol i Campus Can Ruti, Universitat Autònoma de Barcelona, 08916 Barcelona, Spain;
| | - Pablo Serrano-Lorenzo
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
| | - Antoni L. Andreu
- EATRIS, European Infrastructure for Translational Medicine, 1019 Amsterdam, The Netherlands;
| | - Joaquín Arenas
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
| | - José L. Zugaza
- Achucarro Basque Center for Neuroscience, Science Park of the UPV/EHU, and Department of Genetics, Physical Anthropology, and Animal Physiology, Faculty of Science and Technology, UPV/EHU, 48940 Leioa, Spain;
- IKERBASQUE, Basque Foundation for Science, Plaza Euskadi 5, 48009 Bilbao, Spain
| | - Alejandro Lucia
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Faculty of Sport Sciences, Universidad Europea de Madrid, 28670 Madrid, Spain
| | - Miguel A. Martín
- Mitochondrial and Neuromuscular Disorders Group, Hospital 12 de Octubre Health Research Institute (imas12), 28041 Madrid, Spain; (I.G.-C.); (S.A.-P.); (R.G.-M.); (C.D.-G.); (P.S.-L.); (J.A.); (A.L.)
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 28029 Madrid, Spain;
| |
Collapse
|
38
|
Ekvall M, Truong P, Gabriel W, Wilhelm M, Käll L. Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities. J Proteome Res 2022; 21:1359-1364. [PMID: 35413196 PMCID: PMC9087333 DOI: 10.1021/acs.jproteome.1c00870] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
Machine learning
has been an integral part of interpreting data
from mass spectrometry (MS)-based proteomics for a long time. Relatively
recently, a machine-learning structure appeared successful in other
areas of bioinformatics, Transformers. Furthermore, the implementation
of Transformers within bioinformatics has become relatively convenient
due to transfer learning, i.e., adapting a network trained for other
tasks to new functionality. Transfer learning makes these relatively
large networks more accessible as it generally requires less data,
and the training time improves substantially. We implemented a Transformer
based on the pretrained model TAPE to predict MS2 intensities. TAPE
is a general model trained to predict missing residues from protein
sequences. Despite being trained for a different task, we could modify
its behavior by adding a prediction head at the end of the TAPE model
and fine-tune it using the spectrum intensity from the training set
to the well-known predictor Prosit. We demonstrate that the predictor,
which we call Prosit Transformer, outperforms the recurrent neural-network-based
predictor Prosit, increasing the median angular similarity on its
hold-out set from 0.908 to 0.929. We believe that Transformers will
significantly increase prediction accuracy for other types of predictions
within MS-based proteomics.
Collapse
Affiliation(s)
- Markus Ekvall
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| | - Patrick Truong
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| | - Wassim Gabriel
- Computational Mass Spectrometry, Technical University of Munich (TUM), D-85354 Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, Technical University of Munich (TUM), D-85354 Freising, Germany
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| |
Collapse
|
39
|
Islam Khan MZ, Tam SY, Law HKW. Advances in High Throughput Proteomics Profiling in Establishing Potential Biomarkers for Gastrointestinal Cancer. Cells 2022; 11:973. [PMID: 35326424 PMCID: PMC8946849 DOI: 10.3390/cells11060973] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 03/05/2022] [Accepted: 03/08/2022] [Indexed: 12/24/2022] Open
Abstract
Gastrointestinal cancers (GICs) remain the most diagnosed cancers and accounted for the highest cancer-related death globally. The prognosis and treatment outcomes of many GICs are poor because most of the cases are diagnosed in advanced metastatic stages. This is primarily attributed to the deficiency of effective and reliable early diagnostic biomarkers. The existing biomarkers for GICs diagnosis exhibited inadequate specificity and sensitivity. To improve the early diagnosis of GICs, biomarkers with higher specificity and sensitivity are warranted. Proteomics study and its functional analysis focus on elucidating physiological and biological functions of unknown or annotated proteins and deciphering cellular mechanisms at molecular levels. In addition, quantitative analysis of translational proteomics is a promising approach in enhancing the early identification and proper management of GICs. In this review, we focus on the advances in mass spectrometry along with the quantitative and functional analysis of proteomics data that contributes to the establishment of biomarkers for GICs including, colorectal, gastric, hepatocellular, pancreatic, and esophageal cancer. We also discuss the future challenges in the validation of proteomics-based biomarkers for their translation into clinics.
Collapse
Affiliation(s)
| | | | - Helen Ka Wai Law
- Department of Health Technology and Informatics, Faculty of Health and Social Sciences, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China; (M.Z.I.K.); (S.Y.T.)
| |
Collapse
|
40
|
Abdelrahman A, Viriri S. Kidney Tumor Semantic Segmentation Using Deep Learning: A Survey of State-of-the-Art. J Imaging 2022; 8:55. [PMID: 35324610 PMCID: PMC8954467 DOI: 10.3390/jimaging8030055] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/26/2022] [Accepted: 02/10/2022] [Indexed: 01/27/2023] Open
Abstract
Cure rates for kidney cancer vary according to stage and grade; hence, accurate diagnostic procedures for early detection and diagnosis are crucial. Some difficulties with manual segmentation have necessitated the use of deep learning models to assist clinicians in effectively recognizing and segmenting tumors. Deep learning (DL), particularly convolutional neural networks, has produced outstanding success in classifying and segmenting images. Simultaneously, researchers in the field of medical image segmentation employ DL approaches to solve problems such as tumor segmentation, cell segmentation, and organ segmentation. Segmentation of tumors semantically is critical in radiation and therapeutic practice. This article discusses current advances in kidney tumor segmentation systems based on DL. We discuss the various types of medical images and segmentation techniques and the assessment criteria for segmentation outcomes in kidney tumor segmentation, highlighting their building blocks and various strategies.
Collapse
Affiliation(s)
| | - Serestina Viriri
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban 4000, South Africa;
| |
Collapse
|
41
|
Sankara Narayanan P, Runthala A. Accurate computational evolution of proteins and its dependence on deep learning and machine learning strategies. BIOCATAL BIOTRANSFOR 2022. [DOI: 10.1080/10242422.2022.2030317] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
42
|
Dickinson Q, Meyer JG. Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences. PLoS Comput Biol 2022; 18:e1009736. [PMID: 35089914 PMCID: PMC8797255 DOI: 10.1371/journal.pcbi.1009736] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 12/09/2021] [Indexed: 11/29/2022] Open
Abstract
Machine learning with multi-layered artificial neural networks, also known as "deep learning," is effective for making biological predictions. However, model interpretation is challenging, especially for sequential input data used with recurrent neural network architectures. Here, we introduce a framework called "Positional SHAP" (PoSHAP) to interpret models trained from biological sequences by utilizing SHapely Additive exPlanations (SHAP) to generate positional model interpretations. We demonstrate this using three long short-term memory (LSTM) regression models that predict peptide properties, including binding affinity to major histocompatibility complexes (MHC), and collisional cross section (CCS) measured by ion mobility spectrometry. Interpretation of these models with PoSHAP reproduced MHC class I (rhesus macaque Mamu-A1*001 and human A*11:01) peptide binding motifs, reflected known properties of peptide CCS, and provided new insights into interpositional dependencies of amino acid interactions. PoSHAP should have widespread utility for interpreting a variety of models trained from biological sequences.
Collapse
Affiliation(s)
- Quinn Dickinson
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| | - Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| |
Collapse
|
43
|
Yang Y, Lin L, Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteomics 2021; 18:1031-1043. [PMID: 34918987 DOI: 10.1080/14789450.2021.2020654] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
INTRODUCTION Data-independent acquisition (DIA) is an emerging technology for large-scale proteomic studies. DIA data analysis methods are evolving rapidly, and deep learning has cut a conspicuous figure in this field. AREAS COVERED This review discusses and provides an overview of the deep learning methods that are used for DIA data analysis, including spectral library prediction, feature scoring, and statistical control in peptide-centric analysis, as well as de novo peptide sequencing. Literature searches were performed for articles, including preprints, up to December 2021 from PubMed, Scopus, and Web of Science databases. EXPERT OPINION While spectral library prediction has broken through the limitation on proteome coverage of experimental libraries, the statistical burden due to the large query space is the remaining challenge of utilizing proteome-wide predicted libraries. Analysis of post-translational modifications is another promising direction of deep learning-based DIA methods.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Ling Lin
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Liang Qiao
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| |
Collapse
|
44
|
Samukhina YV, Matyushin DD, Grinevich OI, Buryak AK. A Deep Convolutional Neural Network for Prediction of Peptide Collision Cross Sections in Ion Mobility Spectrometry. Biomolecules 2021; 11:1904. [PMID: 34944547 PMCID: PMC8699202 DOI: 10.3390/biom11121904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/13/2021] [Accepted: 12/17/2021] [Indexed: 11/26/2022] Open
Abstract
Most frequently, the identification of peptides in mass spectrometry-based proteomics is carried out using high-resolution tandem mass spectrometry. In order to increase the accuracy of analysis, additional information on the peptides such as chromatographic retention time and collision cross section in ion mobility spectrometry can be used. An accurate prediction of the collision cross section values allows erroneous candidates to be rejected using a comparison of the observed values and the predictions based on the amino acids sequence. Recently, a massive high-quality data set of peptide collision cross sections was released. This opens up an opportunity to apply the most sophisticated deep learning techniques for this task. Previously, it was shown that a recurrent neural network allows for predicting these values accurately. In this work, we present a deep convolutional neural network that enables us to predict these values more accurately compared with previous studies. We use a neural network with complex architecture that contains both convolutional and fully connected layers and comprehensive methods of converting a peptide to multi-channel 1D spatial data and vector. The source code and pre-trained model are available online.
Collapse
Affiliation(s)
| | - Dmitriy D. Matyushin
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071 Moscow, Russia; (Y.V.S.); (O.I.G.); (A.K.B.)
| | | | | |
Collapse
|
45
|
Tng SS, Le NQK, Yeh HY, Chua MCH. Improved Prediction Model of Protein Lysine Crotonylation Sites Using Bidirectional Recurrent Neural Networks. J Proteome Res 2021; 21:265-273. [PMID: 34812044 DOI: 10.1021/acs.jproteome.1c00848] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Histone lysine crotonylation (Kcr) is a post-translational modification of histone proteins that is involved in the regulation of gene transcription, acute and chronic kidney injury, spermatogenesis, depression, cancer, and so forth. The identification of Kcr sites in proteins is important for characterizing and regulating primary biological mechanisms. The use of computational approaches such as machine learning and deep learning algorithms have emerged in recent years as the traditional wet-lab experiments are time-consuming and costly. We propose as part of this study a deep learning model based on a recurrent neural network (RNN) termed as Sohoko-Kcr for the prediction of Kcr sites. Through the embedded encoding of the peptide sequences, we investigate the efficiency of RNN-based models such as long short-term memory (LSTM), bidirectional LSTM (BiLSTM), and bidirectional gated recurrent unit (BiGRU) networks using cross-validation and independent tests. We also established the comparison between Sohoko-Kcr and other published tools to verify the efficiency of our model based on 3-fold, 5-fold, and 10-fold cross-validations using independent set tests. The results then show that the BiGRU model has consistently displayed outstanding performance and computational efficiency. Based on the proposed model, a webserver called Sohoko-Kcr was deployed for free use and is accessible at https://sohoko-research-9uu23.ondigitalocean.app.
Collapse
Affiliation(s)
- Sian Soo Tng
- Institute of Systems Science, National University of Singapore, 29 Heng Mui Keng Terrace, Singapore 119620, Singapore
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan.,Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Avenue, Singapore 639818, Singapore
| | - Matthew Chin Heng Chua
- Institute of Systems Science, National University of Singapore, 29 Heng Mui Keng Terrace, Singapore 119620, Singapore
| |
Collapse
|
46
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
47
|
Abstract
Direct infusion shotgun proteome analysis (DISPA) is a new paradigm for expedited mass spectrometry-based proteomics, but the original data analysis workflow was onerous. Here, we introduce CsoDIAq, a user-friendly software package for the identification and quantification of peptides and proteins from DISPA data. In addition to establishing a complete and automated analysis workflow with a graphical user interface, CsoDIAq introduces algorithmic concepts to spectrum-spectrum matching to improve peptide identification speed and sensitivity. These include spectra pooling to reduce search time complexity and a new spectrum-spectrum match score called match count and cosine, which improves target discrimination in a target-decoy analysis. Fragment mass tolerance correction also increased the number of peptide identifications. Finally, we adapt CsoDIAq to standard LC-MS DIA and show that it outperforms other spectrum-spectrum matching software.
Collapse
Affiliation(s)
- Caleb W Cranney
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States
| | - Jesse G Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States
| |
Collapse
|