1
|
Shi Y, Li K, Ding R, Li X, Cheng Z, Liu J, Liu S, Zhu H, Sun H. Untargeted metabolomics and machine learning unveil the exposome and metabolism linked with the risk of early pregnancy loss. JOURNAL OF HAZARDOUS MATERIALS 2025; 488:137362. [PMID: 39892135 DOI: 10.1016/j.jhazmat.2025.137362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 01/13/2025] [Accepted: 01/22/2025] [Indexed: 02/03/2025]
Abstract
Early pregnancy loss (EPL) may result from exposure to emerging contaminants (ECs), although the underlying mechanisms remain poorly understood. This case-control study measured over 2000 serum features, including 37 ECs, 6 biochemicals, and 2057 endogenous metabolites, in serum samples collected from 48 EPL patients and healthy pregnant women. The median total concentration of targeted EC in the EPL group (65.9 ng/mL) was significantly higher than in controls (43.0 ng/mL; p < 0.05). Four machine learning algorithms were employed to identify key molecular features and develop EPL risk prediction models. A random forest model based on chemical data achieved a predictive accuracy of 95 %, suggesting a potential association between EPL and chemical exposure, with phthalic acid esters identified as significant contributors. Ninety-five potential metabolite biomarkers were selected, which were predominantly enriched in pathways related to spermidine and spermine biosynthesis, ubiquinone biosynthesis, and pantothenate and coenzyme A biosynthesis. C17-sphinganine was identified as a leading biomarker with an area under the curve of 0.93. Furthermore, exposure to bis(2-ethylhexyl)phthalate was linked to an increased risk of EPL by disrupting lipid metabolism. These findings indicate that combining untargeted metabolomics with machine learning approaches offers novel insights into the mechanisms of EPL related to EC exposure.
Collapse
Affiliation(s)
- Yixuan Shi
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Keyi Li
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Ran Ding
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Xiaoying Li
- College of Environmental Science and Engineering, Dalian Maritime University, Dalian 116026, China.
| | - Zhipeng Cheng
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Jialan Liu
- Department of Obstetrics and Gynecology, Tianjin Jinnan Hospital, Tianjin 300350, China
| | - Shaoxia Liu
- Department of Obstetrics and Gynecology, Tianjin Jinnan Hospital, Tianjin 300350, China
| | - Hongkai Zhu
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China.
| | - Hongwen Sun
- MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| |
Collapse
|
2
|
Angelis J, Schröder EA, Xiao Z, Gabriel W, Wilhelm M. Peptide Property Prediction for Mass Spectrometry Using AI: An Introduction to State of the Art Models. Proteomics 2025; 25:e202400398. [PMID: 40211610 PMCID: PMC12076536 DOI: 10.1002/pmic.202400398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 03/14/2025] [Accepted: 03/17/2025] [Indexed: 05/15/2025]
Abstract
This review explores state of the art machine learning and deep learning models for peptide property prediction in mass spectrometry-based proteomics, including, but not limited to, models for predicting digestibility, retention time, charge state distribution, collisional cross section, fragmentation ion intensities, and detectability. The combination of these models enables not only the in silico generation of spectral libraries but also finds many additional use cases in the design of targeted assays or data-driven rescoring. This review serves as both an introduction for newcomers and an update for experienced researchers aiming to develop accessible and reproducible models for peptide property predictions. Key limitations of the current models, including difficulties in handling diverse post-translational modifications and instrument variability, highlight the need for large-scale, harmonized datasets, and standardized evaluation metrics for benchmarking.
Collapse
Affiliation(s)
- Jesse Angelis
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Eva Ayla Schröder
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Zixuan Xiao
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Wassim Gabriel
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Mathias Wilhelm
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
- Munich Data Science Institute (MDSI)Technical University of MunichGarchingGermany
| |
Collapse
|
3
|
Movassaghi CS, Sun J, Jiang Y, Turner N, Chang V, Chung N, Chen RJ, Browne EN, Lin C, Schweppe DK, Malaker SA, Meyer JG. Recent Advances in Mass Spectrometry-Based Bottom-Up Proteomics. Anal Chem 2025; 97:4728-4749. [PMID: 40000226 DOI: 10.1021/acs.analchem.4c06750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2025]
Abstract
Mass spectrometry-based proteomics is about 35 years old, and recent progress appears to be speeding up across all subfields. In this review, we focus on advances over the last two years in select areas within bottom-up proteomics, including approaches to high-throughput experiments, data analysis using machine learning, drug discovery, glycoproteomics, extracellular vesicle proteomics, and structural proteomics.
Collapse
Affiliation(s)
- Cameron S Movassaghi
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Jie Sun
- Department of Biochemistry & Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee 37996, United States
| | - Yuming Jiang
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Natalie Turner
- Departments of Molecular Medicine and Neurobiology, Scripps Research Institute, La Jolla, California 92037, United States
| | - Vincent Chang
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Nara Chung
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Ryan J Chen
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Elizabeth N Browne
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Chuwei Lin
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Devin K Schweppe
- Department of Genome Sciences, University of Washington, Seattle, Washington 98105, United States
| | - Stacy A Malaker
- Department of Chemistry, Yale University, 275 Prospect Street, New Haven, Connecticut 06511, United States
| | - Jesse G Meyer
- Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Smidt Heart Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| |
Collapse
|
4
|
Declercq A, Devreese R, Scheid J, Jachmann C, Van Den Bossche T, Preikschat A, Gomez-Zepeda D, Rijal JB, Hirschler A, Krieger JR, Srikumar T, Rosenberger G, Martelli C, Trede D, Carapito C, Tenzer S, Walz JS, Degroeve S, Bouwmeester R, Martens L, Gabriels R. TIMS 2Rescore: A Data Dependent Acquisition-Parallel Accumulation and Serial Fragmentation-Optimized Data-Driven Rescoring Pipeline Based on MS 2Rescore. J Proteome Res 2025; 24:1067-1076. [PMID: 39915959 PMCID: PMC11894666 DOI: 10.1021/acs.jproteome.4c00609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 11/08/2024] [Accepted: 01/27/2025] [Indexed: 03/08/2025]
Abstract
The high throughput analysis of proteins with mass spectrometry (MS) is highly valuable for understanding human biology, discovering disease biomarkers, identifying therapeutic targets, and exploring pathogen interactions. To achieve these goals, specialized proteomics subfields, including plasma proteomics, immunopeptidomics, and metaproteomics, must tackle specific analytical challenges, such as an increased identification ambiguity compared to routine proteomics experiments. Technical advancements in MS instrumentation can mitigate these issues by acquiring more discerning information at higher sensitivity levels. This is exemplified by the incorporation of ion mobility and parallel accumulation and serial fragmentation (PASEF) technologies in timsTOF instruments. In addition, AI-based bioinformatics solutions can help overcome ambiguity issues by integrating more data into the identification workflow. Here, we introduce TIMS2Rescore, a data-driven rescoring workflow optimized for DDA-PASEF data from timsTOF instruments. This platform includes new timsTOF MS2PIP spectrum prediction models and IM2Deep, a new deep learning-based peptide ion mobility predictor. Furthermore, to fully streamline data throughput, TIMS2Rescore directly accepts Bruker raw mass spectrometry data and search results from ProteoScape and many other search engines, including Sage and PEAKS. We showcase TIMS2Rescore performance on plasma proteomics, immunopeptidomics (HLA class I and II), and metaproteomics data sets. TIMS2Rescore is open-source and freely available at https://github.com/compomics/tims2rescore.
Collapse
Affiliation(s)
- Arthur Declercq
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Robbe Devreese
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Jonas Scheid
- Department
of Peptide-based Immunotherapy, Institute of Immunology, University and University Hospital Tübingen, Tübingen 72076, Germany
- Cluster of
Excellence iFIT (ECX2180) Image-Guided and Functionally Instructed
Tumor Therapies, University of Tuebingen, Tuebingen 72076, Germany
- Quantitative
Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
| | - Caroline Jachmann
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Tim Van Den Bossche
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Annica Preikschat
- Institute
of Immunology, University Medical Center
of the Johannes-Gutenberg University, Mainz 55131, Germany
| | - David Gomez-Zepeda
- Helmholtz
Institute for Translational Oncology Mainz (HI-TRON Mainz) −
A Helmholtz Institute of the DKFZ, Mainz 55131, Germany
- German Cancer
Research Center (DKFZ) Heidelberg, Division 191 & Immunopeptidomics
Platform, Heidelberg 69120, Germany
| | - Jeewan Babu Rijal
- BioOrganic
Mass Spectrometry Laboratory (LSMBO), IPHC UMR 7178, University of Strasbourg, CNRS, ProFI
FR2048, Strasbourg 67087, France
| | - Aurélie Hirschler
- BioOrganic
Mass Spectrometry Laboratory (LSMBO), IPHC UMR 7178, University of Strasbourg, CNRS, ProFI
FR2048, Strasbourg 67087, France
| | | | | | | | | | - Dennis Trede
- Bruker
Daltonics GmbH & Co. KG, Bremen 28359, Germany
| | - Christine Carapito
- BioOrganic
Mass Spectrometry Laboratory (LSMBO), IPHC UMR 7178, University of Strasbourg, CNRS, ProFI
FR2048, Strasbourg 67087, France
| | - Stefan Tenzer
- Institute
of Immunology, University Medical Center
of the Johannes-Gutenberg University, Mainz 55131, Germany
- Helmholtz
Institute for Translational Oncology Mainz (HI-TRON Mainz) −
A Helmholtz Institute of the DKFZ, Mainz 55131, Germany
- Research
Center for Immunotherapy (FZI), University
Medical Center of the Johannes-Gutenberg University, Mainz 55131, Germany
| | - Juliane S Walz
- Department
of Peptide-based Immunotherapy, Institute of Immunology, University and University Hospital Tübingen, Tübingen 72076, Germany
- Cluster of
Excellence iFIT (ECX2180) Image-Guided and Functionally Instructed
Tumor Therapies, University of Tuebingen, Tuebingen 72076, Germany
- Clinical
Collaboration Unit Translational Immunology, Department of Internal
Medicine, University Hospital Tuebingen, Tuebingen 72076, Germany
- German
Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ),
partner site Tübingen, Tübingen 72076, Germany
| | - Sven Degroeve
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Lennart Martens
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
- BioOrganic
Mass Spectrometry Laboratory (LSMBO), IPHC UMR 7178, University of Strasbourg, CNRS, ProFI
FR2048, Strasbourg 67087, France
| | - Ralf Gabriels
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| |
Collapse
|
5
|
Nagy K, Sándor P, Vékey K, Drahos L, Révész Á. The Enzyme Effect: Broadening the Horizon of MS Optimization to Nontryptic Digestion in Proteomics. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2025; 36:299-308. [PMID: 39803703 PMCID: PMC11808764 DOI: 10.1021/jasms.4c00396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 12/27/2024] [Accepted: 12/31/2024] [Indexed: 02/06/2025]
Abstract
In recent years, alternative enzymes with varied specificities have gained importance in MS-based bottom-up proteomics, offering orthogonal information about biological samples and advantages in certain applications. However, most mass spectrometric workflows are optimized for tryptic digests. This raises the questions of whether enzyme specificity impacts mass spectrometry and if current methods for nontryptic digests are suboptimal. The success of peptide and protein identifications relies on the information content of MS/MS spectra, influenced by collision energy in collision-induced dissociation. We investigated this by conducting LC-MS/MS measurements with different enzymes, including trypsin, Arg-C, Glu-C, Asp-N, and chymotrypsin, at varying collision energies. We analyzed peptide scores for thousands of peptides and determined optimal collision energy (CE) values. Our results showed a linear m/z dependence for all enzymes, with Glu-C, Asp-N, and chymotrypsin requiring significantly lower energies than trypsin and Arg-C. We proposed a tailored CE selection method for these alternative enzymes, applying ca. 20% lower energy compared to tryptic peptides. This would result in a 10-15 eV decrease on a Bruker QTof instrument and a 5-6 NCE% (normalized collision energy) difference on an Orbitrap. The optimized method improved bottom-up proteomics performance by 8-32%, as measured by peptide identification and sequence coverage. The different trends in fragmentation behavior were linked to the effects of C-terminal basic amino acids for Arg-C and trypsin, stabilizing y fragment ions. This optimized method boosts the performance and provides insight into the impact of enzyme specificity. Data sets are available in the MassIVE repository (MSV000095066).
Collapse
Affiliation(s)
- Kinga Nagy
- MS
Proteomics Research Group, HUN-REN Research
Centre for Natural Sciences, Magyar Tudósok körútja 2, H-1117 Budapest, Hungary
- Hevesy
György PhD School of Chemistry, ELTE
Eötvös Loránd University, Faculty of Science,
Institute of Chemistry, Pázmány Péter sétány 1/A, Budapest H-1117, Hungary
| | - Péter Sándor
- MS
Proteomics Research Group, HUN-REN Research
Centre for Natural Sciences, Magyar Tudósok körútja 2, H-1117 Budapest, Hungary
| | - Károly Vékey
- MS
Proteomics Research Group, HUN-REN Research
Centre for Natural Sciences, Magyar Tudósok körútja 2, H-1117 Budapest, Hungary
| | - László Drahos
- MS
Proteomics Research Group, HUN-REN Research
Centre for Natural Sciences, Magyar Tudósok körútja 2, H-1117 Budapest, Hungary
| | - Ágnes Révész
- MS
Proteomics Research Group, HUN-REN Research
Centre for Natural Sciences, Magyar Tudósok körútja 2, H-1117 Budapest, Hungary
| |
Collapse
|
6
|
Fountzilas E, Pearce T, Baysal MA, Chakraborty A, Tsimberidou AM. Convergence of evolving artificial intelligence and machine learning techniques in precision oncology. NPJ Digit Med 2025; 8:75. [PMID: 39890986 PMCID: PMC11785769 DOI: 10.1038/s41746-025-01471-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Accepted: 01/19/2025] [Indexed: 02/03/2025] Open
Abstract
The confluence of new technologies with artificial intelligence (AI) and machine learning (ML) analytical techniques is rapidly advancing the field of precision oncology, promising to improve diagnostic approaches and therapeutic strategies for patients with cancer. By analyzing multi-dimensional, multiomic, spatial pathology, and radiomic data, these technologies enable a deeper understanding of the intricate molecular pathways, aiding in the identification of critical nodes within the tumor's biology to optimize treatment selection. The applications of AI/ML in precision oncology are extensive and include the generation of synthetic data, e.g., digital twins, in order to provide the necessary information to design or expedite the conduct of clinical trials. Currently, many operational and technical challenges exist related to data technology, engineering, and storage; algorithm development and structures; quality and quantity of the data and the analytical pipeline; data sharing and generalizability; and the incorporation of these technologies into the current clinical workflow and reimbursement models.
Collapse
Affiliation(s)
- Elena Fountzilas
- Department of Medical Oncology, St Luke's Clinic, Panorama, Thessaloniki, Greece
| | | | - Mehmet A Baysal
- Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX, USA
| | - Abhijit Chakraborty
- Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX, USA
| | - Apostolia M Tsimberidou
- Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX, USA.
| |
Collapse
|
7
|
Lin J, Liang Z, Liang Y, Cao X, Tang X, Zhuang H, Yin X, Zhao D, Shen L. A systematically investigation of plasma complement and coagulation-related proteins and adiponectin in gestational diabetes mellitus by multiple reaction monitoring technology. Acta Diabetol 2025:10.1007/s00592-025-02451-0. [PMID: 39821309 DOI: 10.1007/s00592-025-02451-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 01/05/2025] [Indexed: 01/19/2025]
Abstract
BACKGROUND Gestational diabetes mellitus (GDM) is defined as a glucose intolerance resulting in hyperglycaemia of variable severity with onset during pregnancy, and is prevalent worldwide. The study of diagnostic markers of GDM in early pregnancy is important for early diagnosis and early intervention of GDM. The aim of this study was to search for biomarkers of GDM in early and mid-pregnancy using a targeted proteomics approach. METHODS Through multiple response monitoring (MRM) technology and bioinformatics analysis including machine learning, 44 proteins associated with complement and coagulation cascades, and one protein, adiponectin, which is frequently reported to be associated with GDM, were targeted for quantitative analysis, and potential biomarkers were screened. RESULTS The results showed that 7 and 6 proteins were identified as differentially expressed proteins (DEPs) between pregnant women subsequently diagnosed with GDM and controls during the first trimester, as well as between GDM cases and controls during the second trimester, respectively. Among them, C1QC and CFHR1 may serve as early predictive markers, and C1QC and adiponectin may serve as mid-term diagnostic markers. DISCUSSION Complement and coagulation-related proteins and adiponectin, have been implicated in the pathogenesis of GDM, and some of these proteins have the potential to serve as markers for the prediction or diagnosis of GDM.
Collapse
Affiliation(s)
- Jing Lin
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China
| | - Zhiyuan Liang
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China
| | - Yi Liang
- Department of Clinical Nutrition, Affiliated Hospital of Guizhou Medical University, Guiyang, P.R. China
| | - Xueshan Cao
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China
| | - Xiaoxiao Tang
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China
| | - Hongbin Zhuang
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China
| | - Xiaoping Yin
- Department of Obstetrics and Gynecology, Affiliated Hospital of Guizhou Medical University, Guiyang, 550004, P. R. China
| | - Danqing Zhao
- Department of Obstetrics and Gynecology, Affiliated Hospital of Guizhou Medical University, Guiyang, 550004, P. R. China.
| | - Liming Shen
- College of Life Science and Oceanography, Shenzhen University, Shenzhen, 518071, P. R. China.
- Shenzhen-Hong Kong Institute of Brain Science-Shenzhen Fundamental Research Institutions, Shenzhen, 518055, P. R. China.
| |
Collapse
|
8
|
Perez-Riverol Y, Bandla C, Kundu D, Kamatchinathan S, Bai J, Hewapathirana S, John N, Prakash A, Walzer M, Wang S, Vizcaíno J. The PRIDE database at 20 years: 2025 update. Nucleic Acids Res 2025; 53:D543-D553. [PMID: 39494541 PMCID: PMC11701690 DOI: 10.1093/nar/gkae1011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 10/11/2024] [Accepted: 10/16/2024] [Indexed: 11/05/2024] Open
Abstract
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's leading mass spectrometry (MS)-based proteomics data repository and one of the founding members of the ProteomeXchange consortium. This manuscript summarizes the developments in PRIDE resources and related tools for the last three years. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 534 datasets per month. This has been possible thanks to continuous improvements in infrastructure such as a new file transfer protocol for very large datasets (Globus), a new data resubmission pipeline and an automatic dataset validation process. Additionally, we will highlight novel activities such as the availability of the PRIDE chatbot (based on the use of open-source Large Language Models), and our work to improve support for MS crosslinking datasets. Furthermore, we will describe how we have increased our efforts to reuse, reanalyze and disseminate high-quality proteomics data into added-value resources such as UniProt, Ensembl and Expression Atlas.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Chakradhar Bandla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Deepti J Kundu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Selvakumar Kamatchinathan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Jingwen Bai
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Suresh Hewapathirana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Nithu Sara John
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Ananth Prakash
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Mathias Walzer
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Shengbo Wang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
9
|
Stastna M. Post-translational modifications of proteins in cardiovascular diseases examined by proteomic approaches. FEBS J 2025; 292:28-46. [PMID: 38440918 PMCID: PMC11705224 DOI: 10.1111/febs.17108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 01/22/2024] [Accepted: 02/20/2024] [Indexed: 03/06/2024]
Abstract
Over 400 different types of post-translational modifications (PTMs) have been reported and over 200 various types of PTMs have been discovered using mass spectrometry (MS)-based proteomics. MS-based proteomics has proven to be a powerful method capable of global PTM mapping with the identification of modified proteins/peptides, the localization of PTM sites and PTM quantitation. PTMs play regulatory roles in protein functions, activities and interactions in various heart related diseases, such as ischemia/reperfusion injury, cardiomyopathy and heart failure. The recognition of PTMs that are specific to cardiovascular pathology and the clarification of the mechanisms underlying these PTMs at molecular levels are crucial for discovery of novel biomarkers and application in a clinical setting. With sensitive MS instrumentation and novel biostatistical methods for precise processing of the data, low-abundance PTMs can be successfully detected and the beneficial or unfavorable effects of specific PTMs on cardiac function can be determined. Moreover, computational proteomic strategies that can predict PTM sites based on MS data have gained an increasing interest and can contribute to characterization of PTM profiles in cardiovascular disorders. More recently, machine learning- and deep learning-based methods have been employed to predict the locations of PTMs and explore PTM crosstalk. In this review article, the types of PTMs are briefly overviewed, approaches for PTM identification/quantitation in MS-based proteomics are discussed and recently published proteomic studies on PTMs associated with cardiovascular diseases are included.
Collapse
Affiliation(s)
- Miroslava Stastna
- Institute of Analytical Chemistry of the Czech Academy of SciencesBrnoCzech Republic
| |
Collapse
|
10
|
Huang J, Li Y, Meng B, Zhang Y, Wei Y, Dai X, An D, Zhao Y, Fang X. ProteoNet: A CNN-based framework for analyzing proteomics MS-RGB images. iScience 2024; 27:111362. [PMID: 39679296 PMCID: PMC11638609 DOI: 10.1016/j.isci.2024.111362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 06/15/2024] [Accepted: 11/07/2024] [Indexed: 12/17/2024] Open
Abstract
Proteomics is crucial in clinical research, yet the clinical application of proteomic data remains challenging. Transforming proteomic mass spectrometry (MS) data into red, green, and blue color (MS-RGB) image formats and applying deep learning (DL) techniques has shown great potential to enhance analysis efficiency. However, current DL models often fail to extract subtle, crucial features from MS-RGB data. To address this, we developed ProteoNet, a deep learning framework that refines MS-RGB data analysis. ProteoNet incorporates semantic partitioning, adaptive average pooling, and weighted factors into the Convolutional Neural Network (CNN) model, thus enhancing data analysis accuracy. Our experiments with proteomics data from urine, blood, and tissue samples related to liver, kidney, and thyroid diseases demonstrate that ProteoNet outperforms existing models in accuracy. ProteoNet also provides a direct conversion method for MS-RGB data, enabling a seamless workflow. Moreover, its compatibility with various CNN architectures, including lightweight models like MobileNetV2, underscores its scalability and clinical potential.
Collapse
Affiliation(s)
- Jinze Huang
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Yimin Li
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Bo Meng
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Yong Zhang
- Institutes for Systems Genetics, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Yaoguang Wei
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Xinhua Dai
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| | - Dong An
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Yang Zhao
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Xiang Fang
- Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing 100029, China
| |
Collapse
|
11
|
Ramanan M, Bettenhausen H, Grigorean G, Diepenbrock C, Fox GP. Barley Grain Proteome Assessment Using Multi-Environment Trial Data and Machine Learning. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2024; 72:26416-26430. [PMID: 39536264 DOI: 10.1021/acs.jafc.4c07017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
Proteomics can be used to assess individual protein abundances, which could reflect genotypic and environmental effects and potentially predict grain/malt quality. In this study, 79 barley grain samples (genotype-location-year combinations) from Californian multi-environment trials (2017-2022) were assessed using liquid chromatography-mass spectrometry. In total, 3104 proteins were identified across all of the samples. Location, genotype, and year explained 26.7, 17.1, and 14.3% of the variance in the relative abundance of individual proteins, respectively. Sixteen proteins with storage, DNA/RNA binding, or enzymatic functions were significantly higher/lower in abundance (compared to the overall mean) in the Yolo 3 and Imperial Valley locations, Butta 12 and LCS Odyssey genotypes, and the 2017-18 and 2021-22 years. Individual protein abundances were reasonably predictive (RMSECV = 1.25-2.04%) for total, alcohol-soluble, and malt protein content and malt fine extract. This study illustrates the role of the environment in the barley proteome and the utility of proteomics and machine learning to predict grain/malt quality.
Collapse
Affiliation(s)
- Maany Ramanan
- Department of Food Science & Technology, University of California, Davis, California 95616-5270, United States
| | - Harmonie Bettenhausen
- Hartwick College Center for Craft Food & Beverage, Hartwick College, Oneonta, New York 13820, United States
| | - Gabriela Grigorean
- Proteomics Core Facility, University of California, Davis, California 95616, United States
| | - Christine Diepenbrock
- Department of Plant Sciences, University of California, Davis, California 95616-5270, United States
| | - Glen Patrick Fox
- Department of Food Science & Technology, University of California, Davis, California 95616-5270, United States
| |
Collapse
|
12
|
Sanches PHG, de Melo NC, Porcari AM, de Carvalho LM. Integrating Molecular Perspectives: Strategies for Comprehensive Multi-Omics Integrative Data Analysis and Machine Learning Applications in Transcriptomics, Proteomics, and Metabolomics. BIOLOGY 2024; 13:848. [PMID: 39596803 PMCID: PMC11592251 DOI: 10.3390/biology13110848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 07/19/2024] [Accepted: 07/25/2024] [Indexed: 11/29/2024]
Abstract
With the advent of high-throughput technologies, the field of omics has made significant strides in characterizing biological systems at various levels of complexity. Transcriptomics, proteomics, and metabolomics are the three most widely used omics technologies, each providing unique insights into different layers of a biological system. However, analyzing each omics data set separately may not provide a comprehensive understanding of the subject under study. Therefore, integrating multi-omics data has become increasingly important in bioinformatics research. In this article, we review strategies for integrating transcriptomics, proteomics, and metabolomics data, including co-expression analysis, metabolite-gene networks, constraint-based models, pathway enrichment analysis, and interactome analysis. We discuss combined omics integration approaches, correlation-based strategies, and machine learning techniques that utilize one or more types of omics data. By presenting these methods, we aim to provide researchers with a better understanding of how to integrate omics data to gain a more comprehensive view of a biological system, facilitating the identification of complex patterns and interactions that might be missed by single-omics analyses.
Collapse
Affiliation(s)
- Pedro H. Godoy Sanches
- MS4Life Laboratory of Mass Spectrometry, Health Sciences Postgraduate Program, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Nicolly Clemente de Melo
- Graduate Program in Biomedicine, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Andreia M. Porcari
- MS4Life Laboratory of Mass Spectrometry, Health Sciences Postgraduate Program, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Lucas Miguel de Carvalho
- Post Graduate Program in Health Sciences, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| |
Collapse
|
13
|
Neely BA, Perez-Riverol Y, Palmblad M. Quality Control in the Mass Spectrometry Proteomics Core: A Practical Primer. J Biomol Tech 2024; 35:3fc1f5fe.42308a9a. [PMID: 40331211 PMCID: PMC12051443 DOI: 10.7171/3fc1f5fe.42308a9a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2025]
Abstract
The past decade has seen widespread advances in quality control (QC) materials and software tools focused specifically on mass spectrometry-based proteomics, yet the rate of adoption is inconsistent. Despite the fundamental importance of QC, it typically falls behind learning new techniques, instruments, or software. Considering how important QC is in a core setting where data is generated for non-mass spectrometry experts and confidence in delivered results is paramount, we have created this quick-start guide focusing on off-the-shelf QC materials and relatively easy-to-use QC software. We hope that by providing a background on the different levels of QC, different materials and their uses, describing QC design options, and highlighting some current QC software, implementing QC in a core setting will be easier than ever. There continues to be development in each of these areas (such as new materials and software), and the current generation of QC for mass spectrometry-based proteomics is more than capable of conveying confidence in results as well as minimizing laboratory downtime by guiding experimental, technical, and analytical troubleshooting from sample to results.
Collapse
Affiliation(s)
| | - Yasset Perez-Riverol
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome CampusHinxtonCambridgeUnited Kingdom
| | - Magnus Palmblad
- Center for Proteomics and MetabolomicsLeiden University Medical CenterLeidenThe Netherlands
| |
Collapse
|
14
|
Dens C, Adams C, Laukens K, Bittremieux W. Machine Learning Strategies to Tackle Data Challenges in Mass Spectrometry-Based Proteomics. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:2143-2155. [PMID: 39074335 DOI: 10.1021/jasms.4c00180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In computational proteomics, machine learning (ML) has emerged as a vital tool for enhancing data analysis. Despite significant advancements, the diversity of ML model architectures and the complexity of proteomics data present substantial challenges in the effective development and evaluation of these tools. Here, we highlight the necessity for high-quality, comprehensive data sets to train ML models and advocate for the standardization of data to support robust model development. We emphasize the instrumental role of key data sets like ProteomeTools and MassIVE-KB in advancing ML applications in proteomics and discuss the implications of data set size on model performance, highlighting that larger data sets typically yield more accurate models. To address data scarcity, we explore algorithmic strategies such as self-supervised pretraining and multitask learning. Ultimately, we hope that this discussion can serve as a call to action for the proteomics community to collaborate on data standardization and collection efforts, which are crucial for the sustainable advancement and refinement of ML methodologies in the field.
Collapse
Affiliation(s)
- Ceder Dens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Charlotte Adams
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| | - Wout Bittremieux
- Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium
| |
Collapse
|
15
|
Peng S, Rajjou L. Advancing plant biology through deep learning-powered natural language processing. PLANT CELL REPORTS 2024; 43:208. [PMID: 39102077 DOI: 10.1007/s00299-024-03294-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/19/2024] [Indexed: 08/06/2024]
Abstract
The application of deep learning methods, specifically the utilization of Large Language Models (LLMs), in the field of plant biology holds significant promise for generating novel knowledge on plant cell systems. The LLM framework exhibits exceptional potential, particularly with the development of Protein Language Models (PLMs), allowing for in-depth analyses of nucleic acid and protein sequences. This analytical capacity facilitates the discernment of intricate patterns and relationships within biological data, encompassing multi-scale information within DNA or protein sequences. The contribution of PLMs extends beyond mere sequence patterns and structure--function recognition; it also supports advancements in genetic improvements for agriculture. The integration of deep learning approaches into the domain of plant sciences offers opportunities for major breakthroughs in basic research across multi-scale plant traits. Consequently, the strategic application of deep learning methodologies, particularly leveraging the potential of LLMs, will undoubtedly play a pivotal role in advancing plant sciences, plant production, plant uses and propelling the trajectory toward sustainable agroecological and agro-food transitions.
Collapse
Affiliation(s)
- Shuang Peng
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Loïc Rajjou
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France.
| |
Collapse
|
16
|
Gyori BM, Vitek O. Beyond protein lists: AI-assisted interpretation of proteomic investigations in the context of evolving scientific knowledge. Nat Methods 2024; 21:1387-1389. [PMID: 39122950 DOI: 10.1038/s41592-024-02324-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Affiliation(s)
- Benjamin M Gyori
- Barnett Institute for Chemical and Biological Analysis, Northeastern University, Boston, MA, USA.
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
- Department of Bioengineering, College of Engineering, Northeastern University, Boston, MA, USA.
| | - Olga Vitek
- Barnett Institute for Chemical and Biological Analysis, Northeastern University, Boston, MA, USA.
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
| |
Collapse
|
17
|
Smith BJ, Guest PC, Martins-de-Souza D. Maximizing Analytical Performance in Biomolecular Discovery with LC-MS: Focus on Psychiatric Disorders. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2024; 17:25-46. [PMID: 38424029 DOI: 10.1146/annurev-anchem-061522-041154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/02/2024]
Abstract
In this review, we discuss the cutting-edge developments in mass spectrometry proteomics and metabolomics that have brought improvements for the identification of new disease-based biomarkers. A special focus is placed on psychiatric disorders, for example, schizophrenia, because they are considered to be not a single disease entity but rather a spectrum of disorders with many overlapping symptoms. This review includes descriptions of various types of commonly used mass spectrometry platforms for biomarker research, as well as complementary techniques to maximize data coverage, reduce sample heterogeneity, and work around potentially confounding factors. Finally, we summarize the different statistical methods that can be used for improving data quality to aid in reliability and interpretation of proteomics findings, as well as to enhance their translatability into clinical use and generalizability to new data sets.
Collapse
Affiliation(s)
- Bradley J Smith
- 1Laboratory of Neuroproteomics, Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas, São Paulo, Brazil;
| | - Paul C Guest
- 1Laboratory of Neuroproteomics, Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas, São Paulo, Brazil;
- 2Department of Psychiatry, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
- 3Laboratory of Translational Psychiatry, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
| | - Daniel Martins-de-Souza
- 1Laboratory of Neuroproteomics, Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas, São Paulo, Brazil;
- 4Experimental Medicine Research Cluster, University of Campinas, São Paulo, Brazil
- 5National Institute of Biomarkers in Neuropsychiatry, National Council for Scientific and Technological Development, São Paulo, Brazil
- 6D'Or Institute for Research and Education, São Paulo, Brazil
- 7INCT in Modelling Human Complex Diseases with 3D Platforms (Model3D), São Paulo, Brazil
| |
Collapse
|
18
|
Kalhor M, Lapin J, Picciani M, Wilhelm M. Rescoring Peptide Spectrum Matches: Boosting Proteomics Performance by Integrating Peptide Property Predictors Into Peptide Identification. Mol Cell Proteomics 2024; 23:100798. [PMID: 38871251 PMCID: PMC11269915 DOI: 10.1016/j.mcpro.2024.100798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 05/26/2024] [Accepted: 06/09/2024] [Indexed: 06/15/2024] Open
Abstract
Rescoring of peptide spectrum matches originating from database search engines enabled by peptide property predictors is exceeding the performance of peptide identification from traditional database search engines. In contrast to the peptide spectrum match scores calculated by traditional database search engines, rescoring peptide spectrum matches generates scores based on comparing observed and predicted peptide properties, such as fragment ion intensities and retention times. These newly generated scores enable a more efficient discrimination between correct and incorrect peptide spectrum matches. This approach was shown to lead to substantial improvements in the number of confidently identified peptides, facilitating the analysis of challenging datasets in various fields such as immunopeptidomics, metaproteomics, proteogenomics, and single-cell proteomics. In this review, we summarize the key elements leading up to the recent introduction of multiple data-driven rescoring pipelines. We provide an overview of relevant post-processing rescoring tools, introduce prominent data-driven rescoring pipelines for various applications, and highlight limitations, opportunities, and future perspectives of this approach and its impact on mass spectrometry-based proteomics.
Collapse
Affiliation(s)
- Mostafa Kalhor
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Joel Lapin
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mario Picciani
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany; Munich Data Science Institute, Technical University of Munich, Garching, Germany.
| |
Collapse
|
19
|
Ye J, He X, Wang S, Dong MQ, Wu F, Lu S, Feng F. Test-Time Training for Deep MS/MS Spectrum Prediction Improves Peptide Identification. J Proteome Res 2024; 23:550-559. [PMID: 38153036 DOI: 10.1021/acs.jproteome.3c00229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2023]
Abstract
In bottom-up proteomics, peptide-spectrum matching is critical for peptide and protein identification. Recently, deep learning models have been used to predict tandem mass spectra of peptides, enabling the calculation of similarity scores between the predicted and experimental spectra for peptide-spectrum matching. These models follow the supervised learning paradigm, which trains a general model using paired peptides and spectra from standard data sets and directly employs the model on experimental data. However, this approach can lead to inaccurate predictions due to differences between the training data and the experimental data, such as sample types, enzyme specificity, and instrument calibration. To tackle this problem, we developed a test-time training paradigm that adapts the pretrained model to generate experimental data-specific models, namely, PepT3. PepT3 yields a 10-40% increase in peptide identification depending on the variability in training and experimental data. Intriguingly, when applied to a patient-derived immunopeptidomic sample, PepT3 increases the identification of tumor-specific immunopeptide candidates by 60%. Two-thirds of the newly identified candidates are predicted to bind to the patient's human leukocyte antigen isoforms. To facilitate access of the model and all the results, we have archived all the intermediate files in Zenodo.org with identifier 8231084.
Collapse
Affiliation(s)
- Jianbai Ye
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Xiangnan He
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Shujuan Wang
- National Institute of Biological Sciences, Beijing 102206, China
| | - Meng-Qiu Dong
- National Institute of Biological Sciences, Beijing 102206, China
| | - Feng Wu
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Shan Lu
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, California 92093, United States
| | - Fuli Feng
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui 230026, China
| |
Collapse
|
20
|
Walzer M, Jeong K, Tabb DL, Vizcaíno JA. TopDownApp: An open and modular platform for analysis and visualisation of top-down proteomics data. Proteomics 2024; 24:e2200403. [PMID: 37787899 DOI: 10.1002/pmic.202200403] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/13/2023] [Accepted: 09/13/2023] [Indexed: 10/04/2023]
Abstract
Although Top-down (TD) proteomics techniques, aimed at the analysis of intact proteins and proteoforms, are becoming increasingly popular, efforts are needed at different levels to generalise their adoption. In this context, there are numerous improvements that are possible in the area of open science practices, including a greater application of the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. These include, for example, increased data sharing practices and readily available open data standards. Additionally, the field would benefit from the development of open data analysis workflows that can enable data reuse of public datasets, something that is increasingly common in other proteomics fields.
Collapse
Affiliation(s)
- Mathias Walzer
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Kyowon Jeong
- Applied Bioinformatics, Computer Science Department, University of Tübingen, Tübingen, Germany
| | - David L Tabb
- Institut Pasteur, Université Paris Cité, CNRS UAR 2024, Mass Spectrometry for Biology Unit, Paris, France
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| |
Collapse
|
21
|
Webel H, Perez-Riverol Y, Nielsen AB, Rasmussen S. Mass spectrometry-based proteomics data from thousands of HeLa control samples. Sci Data 2024; 11:112. [PMID: 38263211 PMCID: PMC10806275 DOI: 10.1038/s41597-024-02922-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 01/05/2024] [Indexed: 01/25/2024] Open
Abstract
Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and overview along the 7,444 raw files and MaxQuant search output. For convenience, we provide three filtered and aggregated development datasets on the protein groups, peptides and precursors level. Next to providing easy to access training data, we provide a SDRF file annotating each raw file with instrument settings allowing automated reprocessing. We encourage others to enlarge this data set by instrument runs of further HeLa samples from different machine types by providing our workflows and analysis scripts.
Collapse
Affiliation(s)
- Henry Webel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Annelaura Bach Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
22
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
23
|
Baker JL. Illuminating the oral microbiome and its host interactions: recent advancements in omics and bioinformatics technologies in the context of oral microbiome research. FEMS Microbiol Rev 2023; 47:fuad051. [PMID: 37667515 PMCID: PMC10503653 DOI: 10.1093/femsre/fuad051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 08/02/2023] [Accepted: 09/01/2023] [Indexed: 09/06/2023] Open
Abstract
The oral microbiota has an enormous impact on human health, with oral dysbiosis now linked to many oral and systemic diseases. Recent advancements in sequencing, mass spectrometry, bioinformatics, computational biology, and machine learning are revolutionizing oral microbiome research, enabling analysis at an unprecedented scale and level of resolution using omics approaches. This review contains a comprehensive perspective of the current state-of-the-art tools available to perform genomics, metagenomics, phylogenomics, pangenomics, transcriptomics, proteomics, metabolomics, lipidomics, and multi-omics analysis on (all) microbiomes, and then provides examples of how the techniques have been applied to research of the oral microbiome, specifically. Key findings of these studies and remaining challenges for the field are highlighted. Although the methods discussed here are placed in the context of their contributions to oral microbiome research specifically, they are pertinent to the study of any microbiome, and the intended audience of this includes researchers would simply like to get an introduction to microbial omics and/or an update on the latest omics methods. Continued research of the oral microbiota using omics approaches is crucial and will lead to dramatic improvements in human health, longevity, and quality of life.
Collapse
Affiliation(s)
- Jonathon L Baker
- Department of Oral Rehabilitation & Biosciences, School of Dentistry, Oregon Health & Science University, 3181 Sam Jackson Park Road, Portland, OR 97202, United States
- Genomic Medicine Group, J. Craig Venter Institute, La Jolla, CA 92037, United States
- Department of Pediatrics, UC San Diego School of Medicine, La Jolla, CA 92093, United States
| |
Collapse
|
24
|
Declercq A, Bouwmeester R, Chiva C, Sabidó E, Hirschler A, Carapito C, Martens L, Degroeve S, Gabriels R. Updated MS²PIP web server supports cutting-edge proteomics applications. Nucleic Acids Res 2023:7151340. [PMID: 37140039 DOI: 10.1093/nar/gkad335] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 04/04/2023] [Accepted: 04/25/2023] [Indexed: 05/05/2023] Open
Abstract
Interest in the use of machine learning for peptide fragmentation spectrum prediction has been strongly on the rise over the past years, especially for applications in challenging proteomics identification workflows such as immunopeptidomics and the full-proteome identification of data independent acquisition spectra. Since its inception, the MS²PIP peptide spectrum predictor has been widely used for various downstream applications, mostly thanks to its accuracy, ease-of-use, and broad applicability. We here present a thoroughly updated version of the MS²PIP web server, which includes new and more performant prediction models for both tryptic- and non-tryptic peptides, for immunopeptides, and for CID-fragmented TMT-labeled peptides. Additionally, we have also added new functionality to greatly facilitate the generation of proteome-wide predicted spectral libraries, requiring only a FASTA protein file as input. These libraries also include retention time predictions from DeepLC. Moreover, we now provide pre-built and ready-to-download spectral libraries for various model organisms in multiple DIA-compatible spectral library formats. Besides upgrading the back-end models, the user experience on the MS²PIP web server is thus also greatly enhanced, extending its applicability to new domains, including immunopeptidomics and MS3-based TMT quantification experiments. MS²PIP is freely available at https://iomics.ugent.be/ms2pip/.
Collapse
Affiliation(s)
- Arthur Declercq
- VIB-UGent Center for Medical Biotechnology, VIB, Belgium
- Department of Biomolecular Medicine, Ghent University, Belgium
| | - Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, VIB, Belgium
- Department of Biomolecular Medicine, Ghent University, Belgium
| | - Cristina Chiva
- Proteomics Unit, Universitat Pompeu Fabra, 08003, Barcelona, Spain
- Proteomics Unit, Centre for Genomic Regulation, Barcelona Institute of Science and Technology (BIST), 08003, Barcelona, Spain
| | - Eduard Sabidó
- Proteomics Unit, Universitat Pompeu Fabra, 08003, Barcelona, Spain
- Proteomics Unit, Centre for Genomic Regulation, Barcelona Institute of Science and Technology (BIST), 08003, Barcelona, Spain
| | - Aurélie Hirschler
- Laboratoire de Spectrométrie de Masse BioOrganique (LSMBO), Université de Strasbourg, CNRS, France
| | - Christine Carapito
- Laboratoire de Spectrométrie de Masse BioOrganique (LSMBO), Université de Strasbourg, CNRS, France
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Belgium
- Department of Biomolecular Medicine, Ghent University, Belgium
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, VIB, Belgium
- Department of Biomolecular Medicine, Ghent University, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, Belgium
- Department of Biomolecular Medicine, Ghent University, Belgium
| |
Collapse
|
25
|
Rehfeldt TG, Krawczyk K, Echers SG, Marcatili P, Palczynski P, Röttger R, Schwämmle V. Variability analysis of LC-MS experimental factors and their impact on machine learning. Gigascience 2022; 12:giad096. [PMID: 37983748 PMCID: PMC10659119 DOI: 10.1093/gigascience/giad096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 08/23/2023] [Accepted: 10/11/2023] [Indexed: 11/22/2023] Open
Abstract
BACKGROUND Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. RESULTS We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. CONCLUSIONS Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.
Collapse
Affiliation(s)
- Tobias Greisager Rehfeldt
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Konrad Krawczyk
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | | | - Paolo Marcatili
- Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Pawel Palczynski
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| |
Collapse
|