1
|
Liu X, Zhang H, Zhou W, Zhou Y, Zhang Y, Cao X, Liu M, Peng Y. Machine learning for predicting retention times of chiral analytes chromatographically separated by CMPA technique. J Chromatogr A 2025; 1749:465896. [PMID: 40147253 DOI: 10.1016/j.chroma.2025.465896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Revised: 03/18/2025] [Accepted: 03/22/2025] [Indexed: 03/29/2025]
Abstract
Chiral mobile phase additive (CMPA) technique is an attractive method for chromatographic enantioseparation of chiral analytes. However, establishing chromatographic separation and analysis methods for given chiral analytes often requires extensive trial-and-error experiments, leading to time-consuming processes with high experimental costs. To address this challenge, machine learning (ML) was employed for the prediction of retention times of R and S-analytes to facilitate chromatographic enantioseparation. In this study, the enantiomeric retention times of chiral analytes enantioseparated by HPLC using cyclodextrin derivatives as CMPA were recorded, and the molecular descriptors of both the chiral analytes and the CMPA were calculated. Subsequently, several algorithms were employed for model development, with the coefficient of determination (R2) serving as the metric to assess the precision of these models. The findings indicate that the CatBoost model works well in predicting retention times and separability of chiral analytes. This study provides a rapid and efficient method to facilitate the development of CMPA technique.
Collapse
Affiliation(s)
- Xiong Liu
- School of Chemistry and Chemical Engineering, Hunan University of Science and Technology, Xiangtan 411201, Hunan, PR China.
| | - He Zhang
- School of Chemistry and Chemical Engineering, Hunan University of Science and Technology, Xiangtan 411201, Hunan, PR China
| | - Wei Zhou
- Hunan Diantou Education Technology Co., Ltd, Changsha 410221, Hunan, PR China
| | - Yuying Zhou
- School of Chemistry and Chemical Engineering, Hunan University of Science and Technology, Xiangtan 411201, Hunan, PR China
| | - Yuexin Zhang
- School of Chemistry and Chemical Engineering, Hunan University of Science and Technology, Xiangtan 411201, Hunan, PR China
| | - Xiaoliang Cao
- Hunan Diantou Education Technology Co., Ltd, Changsha 410221, Hunan, PR China
| | - Muqing Liu
- Hunan Diantou Education Technology Co., Ltd, Changsha 410221, Hunan, PR China
| | - Yingzi Peng
- Hunan Diantou Education Technology Co., Ltd, Changsha 410221, Hunan, PR China.
| |
Collapse
|
2
|
Xu R, Zhu J. Unveiling the dark matter of the metabolome: A narrative review of bioinformatics tools for LC-HRMS-based compound annotation. Talanta 2025; 295:128327. [PMID: 40393240 DOI: 10.1016/j.talanta.2025.128327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2025] [Revised: 05/07/2025] [Accepted: 05/13/2025] [Indexed: 05/22/2025]
Abstract
Compound annotation, including the unveiling of dark matter in the metabolomics study represents a pivotal undertaking within the metabolomics field, serving as the linchpin for unraveling the identities and attributes of chemical entities. This narrative review examines the evolution of widely adopted compound annotation tools tailored for liquid chromatography-mass spectrometry (LC-MS) data analysis over the past two decades, which has been characterized by a transition from library-based search methodologies to advanced high-throughput approaches. Furthermore, emerging tools originating from both LC and MS domains were summarized. The synergistic partnership between quantitative structure-retention relationship (QSRR) models and machine learning (ML) techniques is explored, encompassing both conventional methodologies and advanced convolutional neural networks (CNNs). This collaborative framework has played a pivotal role in the precise prediction of retention times. Additionally, the enhanced applicability and extensibility of retention order prediction are emphasized, particularly under the constraints of experimental configurations. Within the domain of mass spectra-based annotation, the foundational task of mapping compound structures to mass spectra is examined-traditionally accomplished by aligning experimental data with established standards and libraries. Recent advancements highlight emerging tools that adopt multi-tiered mapping strategies, such as molecular networks and fragmentation trees, or incorporate machine learning to capture complex mapping patterns. This comprehensive examination underscores the pivotal role of compound annotation tools in advancing our understanding of complex LC-MS data matrix to further assist the annotation of dark matter in metabolome.
Collapse
Affiliation(s)
- Rui Xu
- Human Nutrition Program, Department of Human Sciences, The Ohio State University, Columbus, OH, 43210, United States; Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, United States.
| | - Jiangjiang Zhu
- Human Nutrition Program, Department of Human Sciences, The Ohio State University, Columbus, OH, 43210, United States; Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, United States.
| |
Collapse
|
3
|
Liao GQ, Tang HM, Yu YD, Fu LZ, Li SJ, Zhu MX. Mass spectrometry-based metabolomic as a powerful tool to unravel the component and mechanism in TCM. Chin Med 2025; 20:62. [PMID: 40355943 PMCID: PMC12067679 DOI: 10.1186/s13020-025-01112-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Accepted: 04/21/2025] [Indexed: 05/15/2025] Open
Abstract
Mass spectrometry (MS)-based metabolomics has emerged as a transformative tool to unraveling components and their mechanisms in traditional Chinese medicine (TCM). The integration of advanced analytical platforms, such as LC-MS and GC-MS, coupled with metabolomics, has propelled the qualitative and quantitative characterization of TCM's complex components. This review comprehensively examines the applications of MS-based metabolomics in elucidating TCM efficacy, spanning chemical composition analysis, molecular target identification, mechanism-of-action studies, and syndrome differentiation. Recent innovations in functional metabolomics, spatial metabolomics, single-cell metabolomics, and metabolic flux analysis have further expanded TCM research horizons. Artificial intelligence (AI) and bioinformatics integration offer promising avenues for overcoming analytical bottlenecks, enhancing database standardization, and driving interdisciplinary breakthroughs. However, challenges remain, including the need for improved data processing standardization, database expansion, and understanding of metabolite-gene-protein interactions. By addressing these gaps, metabolomics can bridge traditional practices and modern biomedical research, fostering global acceptance of TCM. This review highlights the synergy of advanced MS techniques, computational tools, and TCM's holistic philosophy, presenting a forward-looking perspective on its clinical translation and internationalization.
Collapse
Affiliation(s)
- Guang-Qin Liao
- Chongqing Academy of Animal Sciences, Chongqing, 402460, China
- National Center of Technology Innovation for Pigs, Chongqing, 402460, China
| | - Hong-Mei Tang
- Chongqing Academy of Animal Sciences, Chongqing, 402460, China
- National Animal Disease-Chongqing Monitoring Station, Chongqing, 402460, China
| | - Yuan-Di Yu
- National Center of Technology Innovation for Pigs, Chongqing, 402460, China
- National Animal Disease-Chongqing Monitoring Station, Chongqing, 402460, China
| | - Li-Zhi Fu
- Chongqing Academy of Animal Sciences, Chongqing, 402460, China
- Chongqing Research Center of Veterinary Biologicals Engineering and Technology, Chongqing, 402460, China
| | - Shuang-Jiao Li
- Chinese Academy of Agricultural Sciences, Beijing, 100061, China
| | - Mai-Xun Zhu
- Chongqing Academy of Animal Sciences, Chongqing, 402460, China.
- National Center of Technology Innovation for Pigs, Chongqing, 402460, China.
| |
Collapse
|
4
|
Hong Y, Ye Y, Tang H. Machine Learning in Small-Molecule Mass Spectrometry. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2025; 18:193-215. [PMID: 40014655 DOI: 10.1146/annurev-anchem-071224-082157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2025]
Abstract
Tandem mass spectrometry (MS/MS) is crucial for small-molecule analysis; however, traditional computational methods are limited by incomplete reference libraries and complex data processing. Machine learning (ML) is transforming small-molecule mass spectrometry in three key directions: (a) predicting MS/MS spectra and related physicochemical properties to expand reference libraries, (b) improving spectral matching through automated pattern extraction, and (c) predicting molecular structures of compounds directly from their MS/MS spectra. We review ML approaches for molecular representations [descriptors, simplified molecular-input line-entry (SMILE) strings, and graphs] and MS/MS spectra representations (using binned vectors and peak lists) along with recent advances in spectra prediction, retention time, collision cross sections, and spectral matching. Finally, we discuss ML-integrated workflows for chemical formula identification. By addressing the limitations of current methods for compound identification, these ML approaches can greatly enhance the understanding of biological processes and the development of diagnostic and therapeutic tools.
Collapse
Affiliation(s)
- Yuhui Hong
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA;
| | - Yuzhen Ye
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA;
| | - Haixu Tang
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA;
| |
Collapse
|
5
|
Zakir M, LeVatte MA, Wishart DS. RT-Pred: A web server for accurate, customized liquid chromatography retention time prediction of chemicals. J Chromatogr A 2025; 1747:465816. [PMID: 40023050 DOI: 10.1016/j.chroma.2025.465816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/21/2025] [Accepted: 02/23/2025] [Indexed: 03/04/2025]
Abstract
High-performance liquid chromatography (HPLC) together with mass spectrometry (MS) is routinely used to separate, identify and quantify chemicals. HPLC data also provides retention time (RT) which can be aligned with structural data. Recent developments in machine learning (ML) have improved our ability to predict RTs from known or postulated chemical structures, allowing RT data to be used more effectively in LC-MS-based compound identification. However, RT data is highly specific to each chromatographic method (CM) and hundreds of different CMs with interdependent parameters are used in separations. This has limited the application of ML-based RT predictions in compound identification. Here we introduce an easy-to-use RT prediction webserver (called RT-Pred) that predicts RTs for molecules across most chromatographic setups. RT-Pred not only supports its own in-house CM-specific RT predictors, it allows users to easily train a custom RT-Pred model using their own RT data on their own CM and to predict RTs with that custom model. RT-Pred also supports RT and compound searches against its own database of millions of predicted RTs spanning >40 different CMs. RT-Pred is also uniquely capable of accurately identifying compounds that will elute in the void volume or be retained on the column. Including this void/retained/eluted classifier significantly improves RT-Pred's performance. Tests indicate that RT-Pred had an average coefficient of determination (R²) of 0.95 over 20 different CMs. Comparisons of RT-Pred against other RT predictors showed that RT-Pred achieved lower mean absolute errors and higher R² scores than any other published RT predictor. RT-Pred is freely available at https://rtpred.ca.
Collapse
Affiliation(s)
- Mahi Zakir
- Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada
| | - Marcia A LeVatte
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - David S Wishart
- Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada; Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada; Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, AB T6G 2B7, Canada; Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB T6G 2H7, Canada.
| |
Collapse
|
6
|
Mazraedoost S, Sedigh Malekroodi H, Žuvela P, Yi M, Liu JJ. Prediction of Chromatographic Retention Time of a Small Molecule from SMILES Representation Using a Hybrid Transformer-LSTM Model. J Chem Inf Model 2025; 65:3343-3356. [PMID: 40152775 DOI: 10.1021/acs.jcim.5c00167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2025]
Abstract
Accurate retention time (RT) prediction in liquid chromatography remains a significant consideration in molecular analysis. In this study, we explore the use of a transformer-based language model to predict RTs by treating simplified molecular input line entry system (SMILES) sequences as textual input, an approach that has not been previously utilized in this field. Our architecture combines a pretrained RoBERTa (robustly optimized BERT approach, a variant of BERT) with bidirectional long short-term memory (BiLSTM) networks to predict retention times in reversed-phase high-performance liquid chromatography (RP-HPLC). The METLIN small molecule retention time (SMRT) data set comprising 77,980 small molecules after preprocessing, was encoded using SMILES notation and processed through a tokenizer to enable molecular representation as sequential data. The proposed transformer-LSTM architecture incorporates layer fusion from multiple transformer layers and bidirectional sequence processing, achieving superior performance compared to existing methods with a mean absolute error (MAE) of 26.23 s, a mean absolute percentage error (MAPE) of 3.25%, and R-squared (R2) value of 0.91. The model's explainability was demonstrated through attention visualization, revealing its focus on key molecular features that can influence RT. Furthermore, we evaluated the model's transfer learning capabilities across ten data sets from the PredRet database, demonstrating robust performance across different chromatographic conditions with consistent improvement over previous approaches. Our results suggest that the hybrid model presents a valuable approach for predicting RT in liquid chromatography, with potential applications in metabolomics and small molecule analysis.
Collapse
Affiliation(s)
- Sargol Mazraedoost
- Department of Chemical Engineering, Pukyong National University, Busan 48513, Republic of Korea
| | - Hadi Sedigh Malekroodi
- Industry 4.0 Convergence Bionics Engineering, Pukyong National University, Busan 48513, Republic of Korea
| | - Petar Žuvela
- Department of Chemical Engineering, Pukyong National University, Busan 48513, Republic of Korea
| | - Myunggi Yi
- Industry 4.0 Convergence Bionics Engineering, Pukyong National University, Busan 48513, Republic of Korea
- Major of Biomedical Engineering, Division of Smart Healthcare, Pukyong National University, Busan 48513, Republic of Korea
| | - J Jay Liu
- Department of Chemical Engineering, Pukyong National University, Busan 48513, Republic of Korea
- Institute of Cleaner Production Technology Pukyong National University, 45, Yongso-Ro, Nam-Gu, Busan 48513, South Korea
| |
Collapse
|
7
|
Marchetto A, Tirapelle M, Mazzei L, Sorensen E, Besenhard MO. In Silico High-Performance Liquid Chromatography Method Development via Machine Learning. Anal Chem 2025; 97:6991-7001. [PMID: 40152207 PMCID: PMC11983366 DOI: 10.1021/acs.analchem.4c03466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 03/11/2025] [Accepted: 03/13/2025] [Indexed: 03/29/2025]
Abstract
High-performance liquid chromatography (HPLC) remains the gold standard for analyzing and purifying molecular components in solutions. However, developing HPLC methods is material- and time-consuming, so computer-aided shortcuts are highly desirable. In line with the digitalization of process development and the growth of HPLC databases, we propose a data-driven methodology to predict molecule retention factors as a function of mobile phase composition without the need for any new experiments, solely relying on molecular descriptors (MDs) obtained via simplified molecular input line entry system (SMILES) string representations of molecules. This new approach combines: (a) quantitative structure-property relationships (QSPR) using MDs to predict solute-dependent parameters in (b) linear solvation energy relationships (LSER) and (c) linear solvent strength (LSS) theory. We demonstrate the potential of this computational methodology using experimental data for retention factors of small molecules made available by the research community for which the MDs were obtained via SMILES string representations determined by the structural formulas of the molecules. This method can be adopted directly to predict elution times of molecular components; however, in combination with first-principle-based mechanistic transport models, the method can also be employed to optimize HPLC methods in-silico. Both options can reduce the experimental load and accelerate HPLC method development significantly, lowering the time and cost of the drug manufacturing cycle and reducing the time to market. Given the growing number and quality of HPLC databases, the predictive power of this methodology will only increase in the coming years.
Collapse
Affiliation(s)
- Alberto Marchetto
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
- Department
of Management, Economics and Industrial
Engineering, Politecnico di Milano, Via Raffaele Lambruschini 4/B, Milano 20156, Italy
| | - Monica Tirapelle
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Luca Mazzei
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Eva Sorensen
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| | - Maximilian O. Besenhard
- Department
of Chemical Engineering, University College
London, Torrington Place, London WC1E 7JE, U.K.
| |
Collapse
|
8
|
Kelman MJ, Renaud JB, McCarron P, Hoogstra S, Chow W, Wang J, Varga E, Patriarca A, Vaya AM, Visintin L, Nguyen T, De Boevre M, De Saeger S, Karanghat V, Vuckovic D, McMullin DR, Dall'Asta C, Ayeni K, Warth B, Huang M, Tittlemier S, Mats L, Cao R, Sulyok M, Xu K, Berthiller F, Kuhn M, Cramer B, Ciasca B, Lattanzio V, De Baere S, Croubels S, DesRochers N, Sura S, Bates J, Wright EJ, Thapa I, Blackwell BA, Zhang K, Wong J, Burns L, Borts DJ, Sumarah MW. International interlaboratory study to normalize liquid chromatography-based mycotoxin retention times through implementation of a retention index system. J Chromatogr A 2025; 1745:465732. [PMID: 39913989 DOI: 10.1016/j.chroma.2025.465732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Revised: 01/17/2025] [Accepted: 01/27/2025] [Indexed: 02/25/2025]
Abstract
Monitoring for mycotoxins in food or feed matrices is necessary to ensure the safety and security of global food systems. Due to a lack of standardized methods and individual laboratory priorities, most institutions have developed their own methods for mycotoxin determinations. Given the diversity of mycotoxin chemical structures and physicochemical properties, searching databases, and comparing data between institutions is complicated. We previously introduced incorporating a retention index (RI) system into liquid chromatography mass spectrometry (LC-MS) based mycotoxin determinations. To validate this concept, we designed an interlaboratory study where each participating laboratory was sent N-alkylpyridinium-3-sulfonates (NAPS) RI standards, and 36 mycotoxin standards for analysis using their pre-optimized LC-MS methods. Data from 44 analytical methods were submitted from 24 laboratories representing various manufacturer platforms, LC columns, and mobile phase compositions. Mycotoxin retention times (tR) were converted to RI values based on their elution relative to the NAPS standards. Trichothecenes (deoxynivalenol, 3-acetyldeoxynivalenol, 15-acetyldeoxynivalenol) showed tR consistency (± 20-50 RI units, 1-5 % median RI) regardless of mobile phase or type of chromatography column in this study. For the remaining mycotoxins tested, the RI values were strongly impacted by the mobile phase composition and column chemistry. The ability to predict tR was evaluated based on the median RI mycotoxin values and the NAPS tR. These values were corrected using Tanimoto coefficients to investigate whether structurally similar compounds could be used as anchors to further improve accuracy. This study demonstrated the power of employing an RI system for mycotoxin determinations, further enhancing the confidence of identifications.
Collapse
Affiliation(s)
- M J Kelman
- London Research and Development Centre, Agriculture and Agri-Food Canada, 1391 Sandford Street, London, Ontario N5 V 4T3, Canada
| | - J B Renaud
- London Research and Development Centre, Agriculture and Agri-Food Canada, 1391 Sandford Street, London, Ontario N5 V 4T3, Canada
| | - P McCarron
- Metrology, National Research Council Canada, 1411 Oxford Street, Halifax, Nova Scotia, B3H 3Z1, Canada
| | - S Hoogstra
- Agassiz Research and Development Centre, Agriculture and Agri-Food Canada, 6947 Lougheed Hwy., Agassiz, British Columbia V0 M 1A2, Canada
| | - W Chow
- Calgary Laboratory, Canadian Food Inspection Agency, 3650 36th Street NW, Calgary, Alberta T2 L 2L1, Canada
| | - J Wang
- Calgary Laboratory, Canadian Food Inspection Agency, 3650 36th Street NW, Calgary, Alberta T2 L 2L1, Canada
| | - E Varga
- Unit Food Hygiene and Technology, Centre for Food Science and Veterinary Public Health, Clinical Department for Farm Animals and Food System Science, University of Veterinary Medicine Vienna, Veterinärplatz 1, 1210 Vienna, Austria
| | - A Patriarca
- Magan Centre of Applied Mycology, Faculty of Engineering and Applied Sciences, Cranfield University, Cranfield, Bedford MK43 0AL, United Kingdom
| | - A Medina Vaya
- Magan Centre of Applied Mycology, Faculty of Engineering and Applied Sciences, Cranfield University, Cranfield, Bedford MK43 0AL, United Kingdom
| | - L Visintin
- Campus Heymans, Department of Bioanalysis, Centre of Excellence in Mycotoxicology & Public Health, Ghent University, Ottergemsesteenweg, 460, 9000 Ghent, Belgium
| | - T Nguyen
- Campus Heymans, Department of Bioanalysis, Centre of Excellence in Mycotoxicology & Public Health, Ghent University, Ottergemsesteenweg, 460, 9000 Ghent, Belgium
| | - M De Boevre
- Campus Heymans, Department of Bioanalysis, Centre of Excellence in Mycotoxicology & Public Health, Ghent University, Ottergemsesteenweg, 460, 9000 Ghent, Belgium
| | - S De Saeger
- Campus Heymans, Department of Bioanalysis, Centre of Excellence in Mycotoxicology & Public Health, Ghent University, Ottergemsesteenweg, 460, 9000 Ghent, Belgium
| | - V Karanghat
- Department of Chemistry and Biochemistry, Concordia University, 7141 Sherbrooke Street West, Montréal, Québec H4B 1R6, Canada
| | - D Vuckovic
- Department of Chemistry and Biochemistry, Concordia University, 7141 Sherbrooke Street West, Montréal, Québec H4B 1R6, Canada
| | - D R McMullin
- Department of Chemistry, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6, Canada
| | - C Dall'Asta
- Department of Food and Drug, University of Parma, Parco Area delle Scienze 27/A, 43125 Parma, Italy
| | - K Ayeni
- Department of Food Chemistry and Toxicology, University of Vienna, Währingerstraße 38, A-1090 Vienna, Austria
| | - B Warth
- Department of Food Chemistry and Toxicology, University of Vienna, Währingerstraße 38, A-1090 Vienna, Austria
| | - M Huang
- Grain Research Laboratory, Canadian Grain Commission, 1404-303 Main Street, Winnipeg, Manitoba R3C 3G7, Canada
| | - S Tittlemier
- Grain Research Laboratory, Canadian Grain Commission, 1404-303 Main Street, Winnipeg, Manitoba R3C 3G7, Canada
| | - L Mats
- Guelph Research and Development Center- Agriculture and Agri-Food Canada, 93 Stone Road West, Guelph, Ontario N1 G 5C9, Canada
| | - R Cao
- Guelph Research and Development Center- Agriculture and Agri-Food Canada, 93 Stone Road West, Guelph, Ontario N1 G 5C9, Canada
| | - M Sulyok
- Institute of Bioanalytics and Agro-Metabolomics Department of Agrobiotechnology (IFA-Tulln) University of Natural Resources and Life Sciences, Vienna, Konrad Lorenz Straße 20, 3430 Tulln, Austria
| | - K Xu
- Institute of Bioanalytics and Agro-Metabolomics Department of Agrobiotechnology (IFA-Tulln) University of Natural Resources and Life Sciences, Vienna, Konrad Lorenz Straße 20, 3430 Tulln, Austria
| | - F Berthiller
- Institute of Bioanalytics and Agro-Metabolomics Department of Agrobiotechnology (IFA-Tulln) University of Natural Resources and Life Sciences, Vienna, Konrad Lorenz Straße 20, 3430 Tulln, Austria
| | - M Kuhn
- Institute of Food Chemistry, Universität Münster, Corrensstraße 45, 48149 Muenster, Germany
| | - B Cramer
- Institute of Food Chemistry, Universität Münster, Corrensstraße 45, 48149 Muenster, Germany
| | - B Ciasca
- Institute of Sciences of Food Production, National Research Council, Amendola 122/O 70126 Bari, Italy
| | - V Lattanzio
- Institute of Sciences of Food Production, National Research Council, Amendola 122/O 70126 Bari, Italy
| | - S De Baere
- Laboratory of Pharmacology and Toxicology, Department of Pathobiology, Pharmacology and Zoological Medicine, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium
| | - S Croubels
- Laboratory of Pharmacology and Toxicology, Department of Pathobiology, Pharmacology and Zoological Medicine, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium
| | - N DesRochers
- London Research and Development Centre, Agriculture and Agri-Food Canada, 1391 Sandford Street, London, Ontario N5 V 4T3, Canada
| | - S Sura
- Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Rout, Unit 100, Morden, Manitoba R6 M 1Y5, Canada
| | - J Bates
- National Research Council Canada, Metrology, 1200 Montreal Road, Ottawa, Ontario K1A 0R6, Canada
| | - E J Wright
- Metrology, National Research Council Canada, 1411 Oxford Street, Halifax, Nova Scotia, B3H 3Z1, Canada
| | - I Thapa
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada
| | - B A Blackwell
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada
| | - K Zhang
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, 5001 Campus Drive, College Park, Maryland 20740, USA
| | - J Wong
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, 5001 Campus Drive, College Park, Maryland 20740, USA
| | - L Burns
- Veterinary Diagnostic Laboratory, Iowa State University, 1850 Christensen Drive, Ames, Iowa 50011-1134, USA
| | - D J Borts
- Veterinary Diagnostic Laboratory, Iowa State University, 1850 Christensen Drive, Ames, Iowa 50011-1134, USA
| | - M W Sumarah
- London Research and Development Centre, Agriculture and Agri-Food Canada, 1391 Sandford Street, London, Ontario N5 V 4T3, Canada.
| |
Collapse
|
9
|
Khrisanfov M, Matyushin D, Samokhin A. Finding potentially erroneous entries in METLIN SMRT. J Chromatogr A 2025; 1745:465761. [PMID: 39954582 DOI: 10.1016/j.chroma.2025.465761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Revised: 02/06/2025] [Accepted: 02/08/2025] [Indexed: 02/17/2025]
Abstract
METLIN SMRT is a widely-used dataset of retention times for high-performance liquid chromatography (HPLC). Besides direct application it is used for training models aimed at predicting retention times in HPLC. Although there are quite a number of articles featuring METLIN SMRT, the pipelines used for filtering from errors are either simplistic or nonexistent. Therefore, a reliable method for filtering potentially erroneous entries is still required. An approach to filter potentially erroneous entries, suggested in our earlier work for a database of gas chromatography retention indexes, was repurposed for METLIN SMRT using five predictive models (GNN, CNN, FCFP, FCD, and CatBoost). The retention times were predicted for the whole dataset using a 5-fold cross-validation strategy. Entries with retention times differing significantly from the predictions obtained from a given model (bottom 5%) were flagged with a "yellow card". This procedure was repeated for each model, leading to a group containing about 1500 entries (or 2% of the dataset) with 5 "yellow cards". According to our estimate (derived from analyzing trends and distributions for groups with varying numbers of "yellow cards") about 1200 entries were strongly suspected to be erroneous, while 300 were likely predicted inaccurately. This work demonstrates the viability of the approach and its potential to improve the quality of other large-scale chromatography-related databases for both machine learning and experimental use.
Collapse
Affiliation(s)
- Mikhail Khrisanfov
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Moscow, Russia; Chemistry Department, Lomonosov Moscow State University, Moscow, Russia.
| | - Dmitriy Matyushin
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Moscow, Russia
| | - Andrey Samokhin
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Moscow, Russia; Chemistry Department, Lomonosov Moscow State University, Moscow, Russia
| |
Collapse
|
10
|
Nowatzky Y, Russo FF, Lisec J, Kister A, Reinert K, Muth T, Benner P. FIORA: Local neighborhood-based prediction of compound mass spectra from single fragmentation events. Nat Commun 2025; 16:2298. [PMID: 40055306 PMCID: PMC11889238 DOI: 10.1038/s41467-025-57422-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 02/20/2025] [Indexed: 05/13/2025] Open
Abstract
Non-targeted metabolomics holds great promise for advancing precision medicine and biomarker discovery. However, identifying compounds from tandem mass spectra remains a challenging task due to the incomplete nature of spectral reference libraries. Augmenting these libraries with simulated mass spectra can provide the necessary references to resolve unmatched spectra, but generating high-quality data is difficult. In this study, we present FIORA, an open-source graph neural network designed to simulate tandem mass spectra. Our main contribution lies in utilizing the molecular neighborhood of bonds to learn breaking patterns and derive fragment ion probabilities. FIORA not only surpasses state-of-the-art fragmentation algorithms, ICEBERG and CFM-ID, in prediction quality, but also facilitates the prediction of additional features, such as retention time and collision cross section. Utilizing GPU acceleration, FIORA enables rapid validation of putative compound annotations and large-scale expansion of spectral reference libraries with high-quality predictions.
Collapse
Affiliation(s)
- Yannek Nowatzky
- Section VP.1 eScience, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Francesco Friedrich Russo
- Department of Analytical Chemistry and Reference Materials, Organic Trace Analysis and Food Analysis, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany
- Institute of Pharmacy, Freie Universität Berlin, Berlin, Germany
| | - Jan Lisec
- Department of Analytical Chemistry and Reference Materials, Organic Trace Analysis and Food Analysis, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany
| | - Alexander Kister
- Section VP.1 eScience, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Thilo Muth
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Data Competence Center MF 2, Robert Koch Institute, Berlin, Germany
| | - Philipp Benner
- Section VP.1 eScience, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany.
| |
Collapse
|
11
|
Persaud M, Lewis A, Kisiala A, Smith E, Azimychetabi Z, Sultana T, Narine SS, Emery RJN. Untargeted Metabolomics and Targeted Phytohormone Profiling of Sweet Aloes ( Euphorbia neriifolia) from Guyana: An Assessment of Asthma Therapy Potential in Leaf Extracts and Latex. Metabolites 2025; 15:177. [PMID: 40137143 PMCID: PMC11943701 DOI: 10.3390/metabo15030177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/16/2025] [Accepted: 02/25/2025] [Indexed: 03/27/2025] Open
Abstract
Background/Objectives:Euphorbia neriifolia is a succulent plant from the therapeutically rich family of Euphorbia comprising 2000 species globally. E. neriifolia is used in Indigenous Guyanese asthma therapy. Methods: To investigate E. neriifolia's therapeutic potential, traditionally heated leaf, simple leaf, and latex extracts were evaluated for phytohormones and therapeutic compounds. Full scan, data-dependent acquisition, and parallel reaction monitoring modes via liquid chromatography Orbitrap mass spectrometry were used for screening. Results: Pathway analysis of putative features from all extracts revealed a bias towards the phenylpropanoid, terpenoid, and flavonoid biosynthetic pathways. A total of 850 compounds were annotated using various bioinformatics tools, ranging from confidence levels 1 to 3. Lipids and lipid-like molecules (34.35%), benzenoids (10.24%), organic acids and derivatives (12%), organoheterocyclic compounds (12%), and phenylpropanoids and polyketides (10.35%) dominated the contribution of compounds among the 13 superclasses. Semi-targeted screening revealed 14 out of 16 literature-relevant therapeutic metabolites detected, with greater upregulation in traditional heated extracts. Targeted screening of 39 phytohormones resulted in 25 being detected and quantified. Simple leaf extract displayed 4.4 and 45 times greater phytohormone levels than traditional heated leaf and latex extracts, respectively. Simple leaf extracts had the greatest nucleotide and riboside cytokinin and acidic phytohormone levels. In contrast, traditional heated extracts exhibited the highest free base and glucoside cytokinin levels and uniquely contained methylthiolated and aromatic cytokinins while lacking acidic phytohormones. Latex samples had trace gibberellic acid levels, the lowest free base, riboside, and nucleotide levels, with absences of aromatic, glucoside, or methylthiolated cytokinin forms. Conclusions: In addition to metabolites with possible therapeutic value for asthma treatment, we present the first look at cytokinin phytohormones in the species and Euphorbia genus alongside metabolite screening to present a comprehensive assessment of heated leaf extract used in Indigenous Guyanese asthma therapy.
Collapse
Affiliation(s)
- Malaika Persaud
- Sustainability Studies Graduate Program, Faculty of Arts and Science, Trent University, Peterborough, ON K9J 0G2, Canada;
| | - Ainsely Lewis
- Department of Biology, Trent University, Peterborough, ON K9J 0G2, Canada; (A.K.); (R.J.N.E.)
- Department of Biology, University of Toronto Mississauga, Mississauga, ON L5L 1C6, Canada
| | - Anna Kisiala
- Department of Biology, Trent University, Peterborough, ON K9J 0G2, Canada; (A.K.); (R.J.N.E.)
| | - Ewart Smith
- Environmental and Life Sciences Graduate Program, Trent University, Peterborough, ON K9J 0G2, Canada; (E.S.); (Z.A.)
| | - Zeynab Azimychetabi
- Environmental and Life Sciences Graduate Program, Trent University, Peterborough, ON K9J 0G2, Canada; (E.S.); (Z.A.)
| | - Tamanna Sultana
- Department of Chemistry, Trent University, Peterborough, ON K9J 0G2, Canada;
| | - Suresh S. Narine
- Trent Centre for Biomaterials Research, Trent University, Peterborough, ON K9J 0G2, Canada;
- Departments of Physics & Astronomy and Chemistry, Trent University, Peterborough, ON K9J 0G2, Canada
| | - R. J. Neil Emery
- Department of Biology, Trent University, Peterborough, ON K9J 0G2, Canada; (A.K.); (R.J.N.E.)
| |
Collapse
|
12
|
Stienstra CMK, Nazdrajić E, Hopkins WS. From Reverse Phase Chromatography to HILIC: Graph Transformers Power Method-Independent Machine Learning of Retention Times. Anal Chem 2025; 97:4461-4472. [PMID: 39972614 DOI: 10.1021/acs.analchem.4c05859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Liquid chromatography (LC) is a cornerstone of analytical separations, but comparing the retention times (RTs) across different LC methods is challenging because of variations in experimental parameters such as column type and solvent gradient. Nevertheless, RTs are powerful metrics in tandem mass spectrometry (MS2) that can reduce false positive rates for metabolite annotation, differentiate isobaric species, and improve peptide identification. Here, we present Graphormer-RT, a novel graph transformer that performs the first single-model method-independent prediction of RTs. We use the RepoRT data set, which contains 142,688 reverse phase (RP) RTs (from 191 methods) and 4,373 HILIC RTs (from 49 methods). Our best RP model (trained and tested on 191 methods) achieved a test set mean average error (MAE) of 29.3 ± 0.6 s, comparable performance to the state-of-the-art model which was only trained on a single LC method. Our best-performing HILIC model achieved a test MAE = 42.4 ± 2.9 s. We expect that Graphormer-RT can be used as an LC "foundation model", where transfer learning can reduce the amount of training data needed for highly accurate "specialist" models applied to method-specific RP and HILIC tasks. These frameworks could enable the machine optimization of automated LC workflows, improved filtration of candidate structures using predicted RTs, and the in silico annotation of unknown analytes in LC-MS2 measurements.
Collapse
Affiliation(s)
- Cailum M K Stienstra
- Department of Chemistry, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | - Emir Nazdrajić
- Department of Chemistry, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | - W Scott Hopkins
- Department of Chemistry, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
- WaterFEL Free Electron Laser Laboratory, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
- Centre for Eye and Vision Research, Hong Kong Science Park, New Territories 999077, Hong Kong
| |
Collapse
|
13
|
Barnabé A, Delcourt V, Loup B, Montanuy W, Trévisiol S, Popot MA, Garcia P, Bailly-Chouriberry L. Convolutional Neural Networks Assisted Peak Classification in Targeted LC-HRMS/MS for Equine Doping Control Screening Analyses. Anal Chem 2025; 97:3236-3241. [PMID: 39901649 DOI: 10.1021/acs.analchem.4c03608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2025]
Abstract
Doping control screening analyses usually involve visual inspection of extracted ion chromatograms (EIC) by a trained analytical chemist, followed by further investigations if needed. This task is both highly repetitive and time-consuming, given the hundreds of compounds and metabolites to be screened in tens of thousands of samples per year. With the recent widespread adoption of machine learning in analytical chemistry and the training of high-performance convolutional neural networks (CNN), these operations can be automated with high accuracy and throughput. Applying this technology to doping control is challenging as the false negative rate (FNR) shall be equal to zero. In this study, we demonstrated that implementing a deep learning strategy for chromatogram classification in equine doping control can be feasible and accurate. We illustrated our findings with a CNN scoring model combined with a linear discriminant analysis (LDA) classifier trained on chromatogram images from our ultra-high-pressure liquid chromatography coupled to high-resolution tandem mass spectrometry (UHPLC-HRMS/MS)-based biotherapeutics screening method. We expect that artificial intelligence (AI) will be a valuable tool for doping control laboratories in the near future.
Collapse
Affiliation(s)
- Agnès Barnabé
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - Vivian Delcourt
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - Benoit Loup
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - William Montanuy
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - Stéphane Trévisiol
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - Marie-Agnès Popot
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | - Patrice Garcia
- GIE LCH, Laboratoire des Courses Hippiques, 15 rue de Paradis, 91370 Verrières-le-Buisson, France
| | | |
Collapse
|
14
|
Sun FY, Yin YH, Liu HJ, Shen LN, Kang XL, Xin GZ, Liu LF, Zheng JY. ROASMI: accelerating small molecule identification by repurposing retention data. J Cheminform 2025; 17:20. [PMID: 39953609 PMCID: PMC11829455 DOI: 10.1186/s13321-025-00968-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 02/02/2025] [Indexed: 02/17/2025] Open
Abstract
The limited replicability of retention data hinders its application in untargeted metabolomics for small molecule identification. While retention order models hold promise in addressing this issue, their predictive reliability is limited by uncertain generalizability. Here, we present the ROASMI model, which enables reliable prediction of retention order within a well-defined application domain by coupling data-driven molecular representation and mechanistic insights. The generalizability of ROASMI is proven by 71 independent reversed-phase liquid chromatography (RPLC) datasets. The application of ROASMI to four real-world datasets demonstrates its advantages in distinguishing coexisting isomers with similar fragmentation patterns and in annotating detection peaks without informative spectra. ROASMI is flexible enough to be retrained with user-defined reference sets and is compatible with other MS/MS scorers, making further improvements in small-molecule identification.
Collapse
Affiliation(s)
- Fang-Yuan Sun
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China
| | - Ying-Hao Yin
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China
- Shenzhen Key Laboratory of Hospital Chinese Medicine Preparation, Shenzhen Traditional Chinese Medicine Hospital, The Fourth Clinical Medical College of Guangzhou University of Chinese Medicine, Shenzhen, 518033, China
| | - Hui-Jun Liu
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China
| | - Lu-Na Shen
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China
| | - Xiu-Lin Kang
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China
| | - Gui-Zhong Xin
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China.
| | - Li-Fang Liu
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China.
| | - Jia-Yi Zheng
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, 210009, China.
| |
Collapse
|
15
|
Cheng G, Wang B, Bai N, Li W. ABCoRT: Retention Time Prediction for Metabolite Identification via Atom-Bond Co-Learning. J Chem Inf Model 2025; 65:1419-1427. [PMID: 39818945 DOI: 10.1021/acs.jcim.4c02179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2025]
Abstract
Liquid chromatography retention time (RT) prediction plays a crucial role in metabolite identification, a challenging and essential task in untargeted metabolomics. Accurate molecular representation is vital for reliable RT prediction. To address this, we propose a novel molecular representation learning framework, ABCoRT(Atom-Bond Co-learning for Retention Time prediction), designed for predicting metabolite retention times. Our model transforms molecular graphs into dual hypergraphs, enabling the collaborative updating of atomic and bond information within both molecular graphs and hypergraphs, thereby producing highly informative molecular representations. We evaluated ABCoRT on a large-scale Small Molecule Retention Time (SMRT) data set comprising 80,038 molecules. Our model achieved a mean absolute error (MAE) of 25.75 s and a mean relative error (MRE) of 3.24% after removing nonretained molecules. Additionally, we fine-tuned pretrained ABCoRT models on six additional data sets from PredRet, achieving the lowest MAEs on five of them. Additionally, in metabolite screening conducted on the MetaboBASE and RIKEN_PlaSM data sets from the MassBank of North America, ABCoRT demonstrates its capability to filter out 38.35 and 28.46% of candidate compounds, respectively.
Collapse
Affiliation(s)
- Guangbin Cheng
- School of Information Science and Engineering, Yunnan University, Kunming650091,China
| | - Bingyi Wang
- Yunnan Police College, Kunming650223, China
- Key Laboratory of Smart Drugs Control (Yunnan Police College), Ministry of Education, Kunming650223, China
| | - Nannan Bai
- Yunnan Police College, Kunming650223, China
- Key Laboratory of Smart Drugs Control (Yunnan Police College), Ministry of Education, Kunming650223, China
| | - Weihua Li
- School of Information Science and Engineering, Yunnan University, Kunming650091,China
| |
Collapse
|
16
|
Shi Z, Yi Y, Madrigal E, Hrovat F, Zhang K, Lin J. A generalizable methodology for predicting retention time of small molecule pharmaceutical compounds across reversed-phase HPLC columns. J Chromatogr A 2025; 1742:465628. [PMID: 39798480 DOI: 10.1016/j.chroma.2024.465628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 12/11/2024] [Accepted: 12/24/2024] [Indexed: 01/15/2025]
Abstract
Quantitative structure retention relation (QSRR) is an active field of research, primarily focused on predicting chromatography retention time (Rt) based on molecular structures of an input analyte on a single or limited number of reversed-phase HPLC (RP-HPLC) columns. However, in the pharmaceutical chemistry manufacturing and controls (CMC) settings, single-column QSRR models are often insufficient. It is important to translate retention time across different HPLC methods, specifically different stationary phases (SP) and mobile phases (MP), to guide the HPLC method development, and to bridge organic impurity profiles across different development phases and laboratories. In response to this need, we present a novel approach for retention time transfer across SPs and MPs, without requiring pre-existing Rt data on the target column. To achieve this, we developed an RP-HPLC based Genentech Multi-column Retention Time (GMCRT) database containing 51 small molecule pharmaceutical compounds analyzed on twenty SPs and multiple pH levels. The database incorporated the SP selectivity parameters from Hydrophobic Subtraction Model (HSM) - hydrophobicity (H), steric hindrance (S), hydrogen-bond acidity (A), hydrogen-bond basicity (B), ionic interaction (C) under two different pHs (2.8 and 7) and ethylbenzene (EB) retention factor. Two machine learning approaches, partial least squares (PLS) and artificial neural networks (ANN) were found to improve accuracy of Rt prediction on new SPs compared to the direct mapping approach that have been previously published, especially when the RP-HPLC columns have significant selectivity difference. As a comparison, our approach does not require pre-existing retention data on the target SPs and it is generalizable to any RP-HPLC columns with a set of known column selectivity parameters (https://www.hplccolumns.org/). The generalizability is achievable not only via the available retention data correlation among the twenty commonly-used RP-HPLC columns in GMCRT, but also via the retrainable mechanism of our ML models by adding Rt of the compounds of interest on the source columns into GMCRT, followed by predicting Rt on the target column. Thus, we propose a new QSRR framework that incorporates the physiochemical properties of SPs and MPs and makes the retention time prediction transferable across SPs and MPs. Such a framework is expected to open up possibilities for developing more comprehensive and generalizable models, and streamline RP-HPLC method development and lifecycle management across various pharmaceutical CMC development phases.
Collapse
Affiliation(s)
- Zhenqi Shi
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States.
| | - Yuyan Yi
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States
| | - Eddie Madrigal
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States
| | - Frank Hrovat
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States
| | - Kelly Zhang
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States
| | - Jessica Lin
- Synthetic Molecule Pharmaceutical Science, gRED, Genentech, Inc., 1 DNA Way, South San Francisco, CA, 94080, United States.
| |
Collapse
|
17
|
Noreldeen HAA. Enhancing lipid identification in LC-HRMS data through machine learning-based retention time prediction. J Chromatogr A 2025; 1742:465650. [PMID: 39798479 DOI: 10.1016/j.chroma.2024.465650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Revised: 12/12/2024] [Accepted: 12/30/2024] [Indexed: 01/15/2025]
Abstract
The comprehensive identification of peaks in untargeted lipidomics using LC-MS/MS remains a significant challenge. Confidence in lipid annotation can be greatly improved by integrating a highly accurate machine learning-based retention time prediction model. Such an approach enables the identification of lipids for understanding pathogenic mechanisms, biomarker discovery, and drug screening. In this study, we developed a machine learning model to predict retention times and facilitate lipid peak annotations in LC-MS-based untargeted lipidomics. Our model achieved high correlation coefficients of 0.998 and 0.990, with mean absolute errors (MAE) of 0.107 min and 0.240 min for the training and test sets, respectively. External validation showed similarly strong performance, with correlations of 0.991 and 0.978, and MAE values of 0.241 min and 0.270 min. We also compared the impact of molecular descriptors and molecular fingerprints on the model's performance, finding that molecular descriptors outperformed molecular fingerprints across all datasets when using Random Forest (RF) for model construction. Notably, this retention time calibration model demonstrates robust performance across chromatographic systems with comparable gradients and flow rates. Overall, this machine learning model enhances lipid annotation accuracy and reduces errors in untargeted lipidomics, improving data analysis across multiple datasets.
Collapse
|
18
|
Fan M, Sang C, Li H, Wei Y, Zhang B, Xing Y, Zhang J, Yin J, An W, Shao B. Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds. RESEARCH (WASHINGTON, D.C.) 2025; 8:0607. [PMID: 39925484 PMCID: PMC11803058 DOI: 10.34133/research.0607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Revised: 12/02/2024] [Accepted: 01/18/2025] [Indexed: 02/11/2025]
Abstract
Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure-retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R 2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.
Collapse
Affiliation(s)
- Mengdie Fan
- National Key Laboratory of Veterinary Public Health Security,
College of Veterinary Medicine, China Agricultural University, Beijing Key Laboratory of Detection Technology for Animal-Derived Food Safety, and Beijing Laboratory for Food Quality and Safety, Beijing 100193, China
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
| | - Chenhui Sang
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
- School of Public Health,
Capital Medical University, Beijing 100069, China
| | - Hua Li
- National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences,
Chinese Academy of Sciences, Beijing 100085, China
| | - Yue Wei
- National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences,
Chinese Academy of Sciences, Beijing 100085, China
| | - Bin Zhang
- National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences,
Chinese Academy of Sciences, Beijing 100085, China
| | - Yang Xing
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
- School of Public Health,
Capital Medical University, Beijing 100069, China
| | - Jing Zhang
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
- School of Public Health,
Capital Medical University, Beijing 100069, China
| | - Jie Yin
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
- School of Public Health,
Capital Medical University, Beijing 100069, China
| | - Wei An
- National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences,
Chinese Academy of Sciences, Beijing 100085, China
| | - Bing Shao
- National Key Laboratory of Veterinary Public Health Security,
College of Veterinary Medicine, China Agricultural University, Beijing Key Laboratory of Detection Technology for Animal-Derived Food Safety, and Beijing Laboratory for Food Quality and Safety, Beijing 100193, China
- Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China
- School of Public Health,
Capital Medical University, Beijing 100069, China
| |
Collapse
|
19
|
Qu X, Jiang C, Shan M, Ke W, Chen J, Zhao Q, Hu Y, Liu J, Qin LP, Cheng G. Prediction of Proteolysis-Targeting Chimeras Retention Time Using XGBoost Model Incorporated with Chromatographic Conditions. J Chem Inf Model 2025; 65:613-625. [PMID: 39786356 DOI: 10.1021/acs.jcim.4c01732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that target undruggable proteins, enhance selectivity and prevent target accumulation through catalytic activity. The unique structure of PROTACs presents challenges in structural identification and drug design. Liquid chromatography (LC), combined with mass spectrometry (MS), enhances compound annotation by providing essential retention time (RT) data, especially when MS alone is insufficient. However, predicting RT for PROTACs remains challenging. To address this, we compiled the PROTAC-RT data set from literature and evaluated the performance of four machine learning algorithms─extreme gradient boosting (XGBoost), random forest (RF), K-nearest neighbor (KNN) and support vector machines (SVM)─and a deep learning model, fully connected neural network (FCNN), using 24 molecular fingerprints and descriptors. Through screening combinations of molecular fingerprints, descriptors and chromatographic condition descriptors (CCs), we developed an optimized XGBoost model (XGBoost + moe206+Path + Charge + CCs) that achieved an R2 of 0.958 ± 0.027 and an RMSE of 0.934 ± 0.412. After hyperparameter tuning, the model's R2 improved to 0.963 ± 0.023, with an RMSE of 0.896 ± 0.374. The model showed strong predictive accuracy under new chromatographic separation conditions and was validated using six experimentally determined compounds. SHapley Additive exPlanations (SHAP) not only highlights the advantages of XGBoost but also emphasizes the importance of CCs and molecular features, such as bond variability, van der Waals surface area, and atomic charge states. The optimized XGBoost model combines moe206, path, charge descriptors, and CCs, providing a fast and precise method for predicting the RT of PROTACs compounds, thus facilitating their annotation.
Collapse
Affiliation(s)
- Xinhao Qu
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
| | - Chen Jiang
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
- Universal Identification Technology (Hangzhou) Co., Ltd., Hangzhou 311199, China
| | - Mengyi Shan
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
| | - Wenhao Ke
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, 1 Xiangshanzhi Road, Hangzhou 310024, China
| | - Jing Chen
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, People's Republic of China
| | - Qiming Zhao
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
| | - Youhong Hu
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, 1 Xiangshanzhi Road, Hangzhou 310024, China
| | - Jia Liu
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, 1 Xiangshanzhi Road, Hangzhou 310024, China
| | - Lu-Ping Qin
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
| | - Gang Cheng
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, People's Republic of China
| |
Collapse
|
20
|
Xu H, Wu W, Chen Y, Zhang D, Mo F. Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning. Nat Commun 2025; 16:832. [PMID: 39828717 PMCID: PMC11743788 DOI: 10.1038/s41467-025-56136-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 01/09/2025] [Indexed: 01/22/2025] Open
Abstract
In chemistry, empirical paradigms prevail, especially within the realm of chromatography, where the selection of separation conditions frequently relies on the chemist's experience. However, the underlying rationale for such experiential knowledge has not been established or analysed. This study explicitly elucidates how chemists use thin-layer chromatography (TLC) to determine column chromatography (CC) conditions, employing statistical analysis and machine learning techniques. An experimental dataset of the CC is generated from the automatic platform developed in this study. On this basis, an "artificial intelligence (AI) experience" is generated through a knowledge discovery framework, where the relationship between the retardation factor (RF) value from TLC and retention volume from CC is unveiled in the form of explicit equations. These equations demonstrate satisfactory accuracy and generalizability, providing a scientific basis for the selection of the experimental conditions, and contributing to a better understanding of chromatography.
Collapse
Affiliation(s)
- Hao Xu
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
- BIC-ESAT, ERE, and SKLTCS, College of Engineering, Peking University, 100871, Beijing, P. R. China
- Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, P. R. China
| | - Wenchao Wu
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
- School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China
| | - Yuntian Chen
- Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, P. R. China
- Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, China
| | - Dongxiao Zhang
- Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, China.
| | - Fanyang Mo
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China.
- School of Advanced Materials, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- Guangdong Provincial Key Laboratory of Nano-Micro Materials Research, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
| |
Collapse
|
21
|
Zhao M, Chen Z, Ye D, Yu R, Yang Q. Comprehensive lipidomic profiling of human milk from lactating women across varying lactation stages and gestational ages. Food Chem 2025; 463:141242. [PMID: 39278081 DOI: 10.1016/j.foodchem.2024.141242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 08/28/2024] [Accepted: 09/09/2024] [Indexed: 09/17/2024]
Abstract
An untargeted lipidomic analysis was conducted to investigate the lipid composition of human milk across different lactation stages and gestational ages systematically. A total of 25 lipid subclasses and 934 lipid species as well as 90 free fatty acids were identified. Dynamic changes of the lipids throughout lactation and gestational phases were highlighted. In general, lactation stages introduced more variations in the lipid composition of human milk than gestational ages. Most lipids decreased as the milk progressed from the colostral stage to the mature stage, with some reaching a peak at the transitional stage. Significant variations in lipid composition across gestational ages were predominantly evident during early lactation period. In mature milks, most of the lipids exhibited no discernible statistical differences among gestational ages. This elucidation offers valuable insights and guidance for tailoring precise nutritional strategies for infants with diverse health needs.
Collapse
Affiliation(s)
- Min Zhao
- Wuxi School of Medicine, Jiangnan University, Wuxi 214122, China
| | - Zhenying Chen
- Wuxi School of Medicine, Jiangnan University, Wuxi 214122, China
| | - Danni Ye
- Department of Neonatology, Affiliated Women's Hospital of Jiangnan University, Wuxi 214002, China
| | - Renqiang Yu
- Department of Neonatology, Affiliated Women's Hospital of Jiangnan University, Wuxi 214002, China.
| | - Qin Yang
- Wuxi School of Medicine, Jiangnan University, Wuxi 214122, China; Wuxi Translational Medicine Research Center and School of Translational Medicine, Jiangnan University, Wuxi 214122, China.
| |
Collapse
|
22
|
Metz TO, Chang CH, Gautam V, Anjum A, Tian S, Wang F, Colby SM, Nunez JR, Blumer MR, Edison AS, Fiehn O, Jones DP, Li S, Morgan ET, Patti GJ, Ross DH, Shapiro MR, Williams AJ, Wishart DS. Introducing "Identification Probability" for Automated and Transferable Assessment of Metabolite Identification Confidence in Metabolomics and Related Studies. Anal Chem 2025; 97:1-11. [PMID: 39699939 PMCID: PMC11740175 DOI: 10.1021/acs.analchem.4c04060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 12/02/2024] [Accepted: 12/06/2024] [Indexed: 12/20/2024]
Abstract
Methods for assessing compound identification confidence in metabolomics and related studies have been debated and actively researched for the past two decades. The earliest effort in 2007 focused primarily on mass spectrometry and nuclear magnetic resonance spectroscopy and resulted in four recommended levels of metabolite identification confidence─the Metabolite Standards Initiative (MSI) Levels. In 2014, the original MSI Levels were expanded to five levels (including two sublevels) to facilitate communication of compound identification confidence in high resolution mass spectrometry studies. Further refinement in identification levels have occurred, for example to accommodate use of ion mobility spectrometry in metabolomics workflows, and alternate approaches to communicate compound identification confidence also have been developed based on identification points schema. However, neither qualitative levels of identification confidence nor quantitative scoring systems address the degree of ambiguity in compound identifications in the context of the chemical space being considered. Neither are they easily automated nor transferable between analytical platforms. In this perspective, we propose that the metabolomics and related communities consider identification probability as an approach for automated and transferable assessment of compound identification and ambiguity in metabolomics and related studies. Identification probability is defined simply as 1/N, where N is the number of compounds in a database that matches an experimentally measured molecule within user-defined measurement precision(s), for example mass measurement or retention time accuracy, etc. We demonstrate the utility of identification probability in an in silico analysis of multiproperty reference libraries constructed from a subset of the Human Metabolome Database and computational property predictions, provide guidance to the community in transparent implementation of the concept, and invite the community to further evaluate this concept in parallel with their current preferred methods for assessing metabolite identification confidence.
Collapse
Affiliation(s)
- Thomas O. Metz
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Christine H. Chang
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Vasuk Gautam
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Afia Anjum
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Siyang Tian
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Fei Wang
- Department
of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada
- Alberta
Machine Intelligence Institute, Edmonton, Alberta T5J
1S5, Canada
| | - Sean M. Colby
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Jamie R. Nunez
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Madison R. Blumer
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Arthur S. Edison
- Department
of Biochemistry & Molecular Biology, Complex Carbohydrate Research
Center and Institute of Bioinformatics, University of Georgia, Athens, Georgia 30602, United States
| | - Oliver Fiehn
- West Coast
Metabolomics Center, University of California
Davis, Davis, California 95616, United States
| | - Dean P. Jones
- Clinical
Biomarkers Laboratory, Department of Medicine, Emory University, Atlanta, Georgia 30322, United States
| | - Shuzhao Li
- The Jackson
Laboratory for Genomic Medicine, Farmington, Connecticut 06032, United States
| | - Edward T. Morgan
- Department
of Pharmacology and Chemical Biology, Emory
University School of Medicine, Atlanta, Georgia 30322, United States
| | - Gary J. Patti
- Center
for Mass Spectrometry and Metabolic Tracing, Department of Chemistry,
Department of Medicine, Washington University, Saint Louis, Missouri 63105, United States
| | - Dylan H. Ross
- Biological
Sciences Division, Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Madelyn R. Shapiro
- Artificial
Intelligence & Data Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Antony J. Williams
- U.S. Environmental
Protection Agency, Office of Research & Development, Center for Computational Toxicology & Exposure
(CCTE), Research Triangle Park, North Carolina 27711, United States
| | - David S. Wishart
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| |
Collapse
|
23
|
Kajtazi A, Kajtazi M, Santos Barbetta MF, Bandini E, Eghbali H, Lynen F. Prediction of Retention Indices in LC-HRMS for Enhanced Structural Identification of Organic Micropollutants in Water: Selectivity-Based Filtration. Anal Chem 2025; 97:65-74. [PMID: 39752599 DOI: 10.1021/acs.analchem.4c01784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2025]
Abstract
Addressing the global challenge of ensuring access to safe drinking water, especially in developing countries, demands cost-effective, eco-friendly, and readily available technologies. The persistence, toxicity, and bioaccumulation potential of organic pollutants arising from various human activities pose substantial hurdles. While high-performance liquid chromatography coupled with high-resolution mass spectrometry (HPLC-HRMS) is a widely utilized technique for identifying pollutants in water, the multitude of structures for a single elemental composition complicates structural identification. While current HRMS and MS/MS databases often can provide hits for known molecules, these are often erroneous or misleading when authentic standards are unavailable. In this research, a machine-learning algorithm is developed to support the structural elucidation of small organic pollutants in water, with a focus on (carbon, oxygen, and hydrogen-based) molecules weighing less than 500 Da. The approach relies on a comparison of the experimental and predicted retention of the possible structures of unknowns for which an elemental composition was obtained by HRMS. A promising novelty is thereby the improved removal of erroneous structures via the combination of the retention information obtained from two reversed-phase-based stationary phases, depicting different selectivities (octadecylsilica, C18 and pentafluorphenylsilica, F5). The study translates retention times into retention indices for instrument independence and transferability across diverse HPLC-HRMS systems. The predictive algorithm, utilizing retention data and molecular descriptors, accurately predicts retention indices and proves its utility by eliminating incorrect structural formulas through a 2-stationary phase intersection-based filtration. Using a data set of 100 training compounds and 16 external test set compounds, two Multiple Linear Regression (MLR), MLR-C18 and MLR-F5 models were developed, employing the 16 most influential descriptors, out of 5666 screened. MLR-C18 achieves precise RI predictions, R2 = 0.97, RMSE = 36, MAE = 26, while MLR-F5, though slightly less accurate, maintains a performance with R2 = 0.96, RMSE = 44, MAE = 34. The intersection-based filtration (within ±1.5σ) showed the elimination of more than 70% of impossible structures for a given elemental composition. The model was further implemented in the identification of a drinking water sample to prove its potential. This tool holds significant promise for supporting water quality management and sustainable practices, contributing to faster structural identification of unknown organic micropollutants in water.
Collapse
Affiliation(s)
- Ardiana Kajtazi
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Ghent University, Krijgslaan 281 S4bis, B-9000 Ghent, Belgium
| | - Marin Kajtazi
- Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Ul. Ivana Lučića 5, 10000 Zagreb, Croatia
| | - Maike Felipe Santos Barbetta
- Department of Chemistry, Faculty of Philosophy, Science and Letters at Ribeirão Preto, University of São Paulo, 14040-901 Ribeirão Preto, SP, Brazil
| | - Elena Bandini
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Ghent University, Krijgslaan 281 S4bis, B-9000 Ghent, Belgium
| | - Hamed Eghbali
- Packaging and Specialty Plastics R&D, Dow Benelux B.V., Terneuzen 4530 AA, The Netherlands
| | - Frédéric Lynen
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Ghent University, Krijgslaan 281 S4bis, B-9000 Ghent, Belgium
| |
Collapse
|
24
|
Kretschmer F, Seipp J, Ludwig M, Klau GW, Böcker S. Coverage bias in small molecule machine learning. Nat Commun 2025; 16:554. [PMID: 39788952 PMCID: PMC11718084 DOI: 10.1038/s41467-024-55462-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 12/12/2024] [Indexed: 01/12/2025] Open
Abstract
Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.
Collapse
Affiliation(s)
- Fleming Kretschmer
- Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany
| | - Jan Seipp
- Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Marcus Ludwig
- Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany
- Currently at Bright Giant, Jena, Germany
| | - Gunnar W Klau
- Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Sebastian Böcker
- Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany.
| |
Collapse
|
25
|
Belaid WF, Dekhira A, Lesot P, Ferroukhi O. Development of deep learning software to improve HPLC and GC predictions using a new crown-ether based mesogenic stationary phase and beyond. J Chromatogr A 2025; 1739:465476. [PMID: 39566284 DOI: 10.1016/j.chroma.2024.465476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 10/23/2024] [Accepted: 10/25/2024] [Indexed: 11/22/2024]
Abstract
The application of AI to analytical and separative sciences is a recent challenge that offers new perspectives in terms of data prediction. In this work, we report an AI-based software, named Chrompredict 1.0, which based on chromatographic data of a novel mesogenic crown ether stationary phase (CESP). Its molecular design represents a significant advancement due to the unique combination of properties and binding capabilities, including the formation of a cavity, mesogenic behavior via mobile chains, and a range of polar and non-polar interactions (aromatic rings, N=N and C=O double bonds, alkyl chains, π-π interactions, and hydrogen bonding). The mesogenic phase is effective in both normal and reversed-phase chromatography, enhancing the software's adaptability across diverse datasets. Here we introduce for the first time an unprecedented scientific approach, integrating deep learning techniques with the novel CESP, which demonstrates exceptional thermal and analytical performance in both liquid chromatography modes, especially in the separation of complex hydrocarbon isomers. This ability enables the results obtained with CESP to extend across various types of stationary phases. Leveraging these insights, a comprehensive chromatographic dataset on a series of aromatic and polyaromatic molecules interacting with our CESP was used to train a Deep Learning Model (DLM). This model is embedded within a user-friendly software, Chrompredict 1.0, designed for predicting chromatographic parameters (MAE = 0.042, R² = 0.95) by selecting chemical descriptors directly from SMILES notation. It offers a deeper understanding of molecular structure and interactions through exploratory data analysis, identifying key factors affecting model accuracy and chromatographic behavior. Users can configure hyperparameters, choose from six machine learning models, and compare their performance with DLM. Chrompredict 1.0 excels in retention behavior prediction for compounds with known structures, and it accurately predicts chromatographic retention and thermal characteristics for different temperatures in HPLC and GC. The model has been successfully tested with METLIN database of 1,023 small molecules of diverse structures and polarities (R² > 0.75, error range ±7.8 s). Overall, the CESP, combined with Chrompredict 1.0, offers a robust tool for intelligent chromatographic analysis, encompassing chemo-informatics, statistical analysis, and graphical capabilities across a broad range of compounds and stationary phases.
Collapse
Affiliation(s)
- Warda Fella Belaid
- Laboratory of Chromatography, Faculty of Chemistry, University of Sciences and Technology Houari Boumedienne, USTHB, B.P. 32 El-Alia, Bab-Ezzouar, Algiers 16111, Algeria
| | - Azeddine Dekhira
- Laboratory of Computational Theoretical Chemistry and Photonics, Faculty of Chemistry, University of Sciences and Technology Houari Boumedienne, USTHB, B.P. 32 El-Alia, Bab-Ezzouar, Algiers 16111, Algeria
| | - Philippe Lesot
- Institut de Chimie Moléculaire et des Matériaux d'Orsay (ICMMO), UMR-CNRS 8182, Faculté des Sciences d'Orsay, Equipe RMN en Milieu Orienté, Université Paris-Saclay, Site Henri Moissan (HM-1), Bureau 0209 - RDC, 17-19, Avenue des Sciences, Orsay 91400, France; Centre National de la Recherche Scientifique (CNRS), 3, Rue Michel Ange, Paris 75016, France
| | - Ouassila Ferroukhi
- Laboratory of Chromatography, Faculty of Chemistry, University of Sciences and Technology Houari Boumedienne, USTHB, B.P. 32 El-Alia, Bab-Ezzouar, Algiers 16111, Algeria.
| |
Collapse
|
26
|
Meister I, Boccard J, Rudaz S. Extracting Knowledge from MS Clinical Metabolomic Data: Processing and Analysis Strategies. Methods Mol Biol 2025; 2855:539-554. [PMID: 39354326 DOI: 10.1007/978-1-0716-4116-3_29] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/03/2024]
Abstract
Assessing potential alterations of metabolic pathways using large-scale approaches plays today a central role in clinical research. Because several thousands of mass features can be measured for each sample with separation techniques hyphenated to mass spectrometry (MS) detection, adapted strategies have to be implemented to detect altered pathways and help to elucidate the mechanisms of pathologies. These procedures include peak detection, sample alignment, normalization, statistical analysis, and metabolite annotation. Interestingly, considerable advances have been made over the last years in terms of analytics, bioinformatics, and chemometrics to help massive and complex metabolomic data to be more adequately handled with automated processing and data analysis workflows. Recent developments and remaining challenges related to MS signal processing, metabolite annotation, and biomarker discovery based on statistical models are illustrated in this chapter in light of their application to clinical research.
Collapse
Affiliation(s)
- Isabel Meister
- School of Pharmaceutical Sciences, University of Geneva, University of Lausanne, Geneva, Switzerland
- Swiss Centre for Applied Human Toxicology (SCAHT), Universities of Basel and Geneva, Basel, Switzerland
| | - Julien Boccard
- School of Pharmaceutical Sciences, University of Geneva, University of Lausanne, Geneva, Switzerland
- Swiss Centre for Applied Human Toxicology (SCAHT), Universities of Basel and Geneva, Basel, Switzerland
| | - Serge Rudaz
- School of Pharmaceutical Sciences, University of Geneva, University of Lausanne, Geneva, Switzerland.
- Swiss Centre for Applied Human Toxicology (SCAHT), Universities of Basel and Geneva, Basel, Switzerland.
| |
Collapse
|
27
|
Xie J, Chen S, Zhao L, Dong X. Application of artificial intelligence to quantitative structure-retention relationship calculations in chromatography. J Pharm Anal 2025; 15:101155. [PMID: 39896319 PMCID: PMC11782803 DOI: 10.1016/j.jpha.2024.101155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 11/09/2024] [Accepted: 11/20/2024] [Indexed: 02/04/2025] Open
Abstract
Quantitative structure-retention relationship (QSRR) is an important tool in chromatography. QSRR examines the correlation between molecular structures and their retention behaviors during chromatographic separation. This approach involves developing models for predicting the retention time (RT) of analytes, thereby accelerating method development and facilitating compound identification. In addition, QSRR can be used to study compound retention mechanisms and support drug screening efforts. This review provides a comprehensive analysis of QSRR workflows and applications, with a special focus on the role of artificial intelligence-an area not thoroughly explored in previous reviews. Moreover, we discuss current limitations in RT prediction and propose promising solutions. Overall, this review offers a fresh perspective on future QSRR research, encouraging the development of innovative strategies that enable the diverse applications of QSRR models in chromatographic analysis.
Collapse
Affiliation(s)
- Jingru Xie
- School of Medicine, Shanghai University, Shanghai, 200444, China
- Department of Pharmacy, Shanghai Baoshan Luodian Hospital, Baoshan District, Shanghai, 201908, China
- Luodian Clinical Drug Research Center, Institute for Translational Medicine Research, Shanghai University, Shanghai, 200444, China
| | - Si Chen
- School of Medicine, Shanghai University, Shanghai, 200444, China
- Luodian Clinical Drug Research Center, Institute for Translational Medicine Research, Shanghai University, Shanghai, 200444, China
| | - Liang Zhao
- School of Medicine, Shanghai University, Shanghai, 200444, China
- Department of Pharmacy, Shanghai Baoshan Luodian Hospital, Baoshan District, Shanghai, 201908, China
- Luodian Clinical Drug Research Center, Institute for Translational Medicine Research, Shanghai University, Shanghai, 200444, China
| | - Xin Dong
- School of Medicine, Shanghai University, Shanghai, 200444, China
- Luodian Clinical Drug Research Center, Institute for Translational Medicine Research, Shanghai University, Shanghai, 200444, China
- Suzhou Innovation Center of Shanghai University, Suzhou, 215000, Jiangsu, China
| |
Collapse
|
28
|
Hupatz H, Rahu I, Wang WC, Peets P, Palm EH, Kruve A. Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening. Anal Bioanal Chem 2025; 417:473-493. [PMID: 39138659 DOI: 10.1007/s00216-024-05471-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/15/2024]
Abstract
Non-targeted screening with liquid chromatography coupled to high-resolution mass spectrometry (LC/HRMS) is increasingly leveraging in silico methods, including machine learning, to obtain candidate structures for structural annotation of LC/HRMS features and their further prioritization. Candidate structures are commonly retrieved based on the tandem mass spectral information either from spectral or structural databases; however, the vast majority of the detected LC/HRMS features remain unannotated, constituting what we refer to as a part of the unknown chemical space. Recently, the exploration of this chemical space has become accessible through generative models. Furthermore, the evaluation of the candidate structures benefits from the complementary empirical analytical information such as retention time, collision cross section values, and ionization type. In this critical review, we provide an overview of the current approaches for retrieving and prioritizing candidate structures. These approaches come with their own set of advantages and limitations, as we showcase in the example of structural annotation of ten known and ten unknown LC/HRMS features. We emphasize that these limitations stem from both experimental and computational considerations. Finally, we highlight three key considerations for the future development of in silico methods.
Collapse
Affiliation(s)
- Henrik Hupatz
- Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden
- Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden
| | - Ida Rahu
- Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden.
| | - Wei-Chieh Wang
- Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden
| | - Pilleriin Peets
- Institute of Biodiversity, Faculty of Biological Science, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743, Jena, Germany
| | - Emma H Palm
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, 4367, Belvaux, Luxembourg
| | - Anneli Kruve
- Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden.
- Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden.
- Department of Environmental Science, Stockholm University, Svante Arrhenius Väg 8, 114 18, Stockholm, Sweden.
| |
Collapse
|
29
|
Matyushin DD, Burov IA, Sholokhova AY. Uncertainty Quantification and Flagging of Unreliable Predictions in Predicting Mass Spectrometry-Related Properties of Small Molecules Using Machine Learning. Int J Mol Sci 2024; 25:13077. [PMID: 39684785 DOI: 10.3390/ijms252313077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 11/28/2024] [Accepted: 12/04/2024] [Indexed: 12/18/2024] Open
Abstract
Mass spectral identification (in particular, in metabolomics) can be refined by comparing the observed and predicted properties of molecules, such as chromatographic retention. Significant advancements have been made in predicting these values using machine learning and deep learning. Usually, model predictions do not contain any indication of the possible error (uncertainty) or only one criterion is used for this purpose. The spread of predictions of several models included in the ensemble, and the molecular similarity of the considered molecule and the most "similar" molecule from the training set, are values that allow us to estimate the uncertainty. The Euclidean distance between vectors, calculated based on real-valued molecular descriptors, can be used for the assessment of molecular similarity. Another factor indicating uncertainty is the molecule's belonging to one of the clusters (data set clustering). Together, all three factors can be used as features for the uncertainty assessment model. Classification models that predict whether a prediction belongs to the worst 15% were obtained. The area under the receiver operating curve value is in the range of 0.73-0.82 for the considered tasks: the prediction of retention indices in gas chromatography, retention times in liquid chromatography, and collision cross-sections in ion mobility spectroscopy.
Collapse
Affiliation(s)
- Dmitriy D Matyushin
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071 Moscow, Russia
| | - Ivan A Burov
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071 Moscow, Russia
| | - Anastasia Yu Sholokhova
- A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071 Moscow, Russia
| |
Collapse
|
30
|
Kensert A, Desmet G, Cabooter D. MolGraph: a Python package for the implementation of molecular graphs and graph neural networks with TensorFlow and Keras. J Comput Aided Mol Des 2024; 39:3. [PMID: 39636382 DOI: 10.1007/s10822-024-00578-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 10/28/2024] [Indexed: 12/07/2024]
Abstract
Molecular machine learning (ML) has proven important for tackling various molecular problems, such as predicting molecular properties based on molecular descriptors or fingerprints. Since relatively recently, graph neural network (GNN) algorithms have been implemented for molecular ML, showing comparable or superior performance to descriptor or fingerprint-based approaches. Although various tools and packages exist to apply GNNs in molecular ML, a new GNN package, named MolGraph, was developed in this work with the motivation to create GNN model pipelines highly compatible with the TensorFlow and Keras application programming interface (API). MolGraph also implements a module to accommodate the generation of small molecular graphs, which can be passed to a GNN algorithm to solve a molecular ML problem. To validate the GNNs, benchmarking was conducted using the datasets from MoleculeNet, as well as three chromatographic retention time datasets. The benchmarking results demonstrate that the GNNs performed in line with expectations. Additionally, the GNNs proved useful for molecular identification and improved interpretability of chromatographic retention time data. MolGraph is available at https://github.com/akensert/molgraph . Installation, tutorials and implementation details can be found at https://molgraph.readthedocs.io/en/latest/ .
Collapse
Affiliation(s)
- Alexander Kensert
- Pharmaceutical and Pharmacological Sciences, KU Leuven, Herestraat 49, 3000, Leuven, Belgium.
- Chemical Engineering, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussel, Belgium.
| | - Gert Desmet
- Chemical Engineering, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussel, Belgium
| | - Deirdre Cabooter
- Pharmaceutical and Pharmacological Sciences, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| |
Collapse
|
31
|
Sarkar S, Zheng X, Clair GC, Kwon YM, You Y, Swensen AC, Webb-Robertson BJM, Nakayasu ES, Qian WJ, Metz TO. Exploring new frontiers in type 1 diabetes through advanced mass-spectrometry-based molecular measurements. Trends Mol Med 2024; 30:1137-1151. [PMID: 39152082 PMCID: PMC11631641 DOI: 10.1016/j.molmed.2024.07.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 07/17/2024] [Accepted: 07/22/2024] [Indexed: 08/19/2024]
Abstract
Type 1 diabetes (T1D) is a devastating autoimmune disease for which advanced mass spectrometry (MS) methods are increasingly used to identify new biomarkers and better understand underlying mechanisms. For example, integration of MS analysis and machine learning has identified multimolecular biomarker panels. In mechanistic studies, MS has contributed to the discovery of neoepitopes, and pathways involved in disease development and identifying therapeutic targets. However, challenges remain in understanding the role of tissue microenvironments, spatial heterogeneity, and environmental factors in disease pathogenesis. Recent advancements in MS, such as ultra-fast ion-mobility separations, and single-cell and spatial omics, can play a central role in addressing these challenges. Here, we review recent advancements in MS-based molecular measurements and their role in understanding T1D.
Collapse
Affiliation(s)
- Soumyadeep Sarkar
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Xueyun Zheng
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Geremy C Clair
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Yu Mi Kwon
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Youngki You
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Adam C Swensen
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | | | - Ernesto S Nakayasu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA.
| | - Wei-Jun Qian
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA.
| | - Thomas O Metz
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA.
| |
Collapse
|
32
|
Klein J, Lam H, Mak TD, Bittremieux W, Perez-Riverol Y, Gabriels R, Shofstahl J, Hecht H, Binz PA, Kawano S, Van Den Bossche T, Carver J, Neely BA, Mendoza L, Suomi T, Claeys T, Payne T, Schulte D, Sun Z, Hoffmann N, Zhu Y, Neumann S, Jones AR, Bandeira N, Vizcaíno JA, Deutsch EW. The Proteomics Standards Initiative Standardized Formats for Spectral Libraries and Fragment Ion Peak Annotations: mzSpecLib and mzPAF. Anal Chem 2024; 96:18491-18501. [PMID: 39514576 PMCID: PMC11579979 DOI: 10.1021/acs.analchem.4c04091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Revised: 10/16/2024] [Accepted: 11/01/2024] [Indexed: 11/16/2024]
Abstract
Mass spectral libraries are collections of reference spectra, usually associated with specific analytes from which the spectra were generated, that are used for further downstream analysis of new spectra. There are many different formats used for encoding spectral libraries, but none have undergone a standardization process to ensure broad applicability to many applications. As part of the Human Proteome Organization Proteomics Standards Initiative (PSI), we have developed a standardized format for encoding spectral libraries, called mzSpecLib (https://psidev.info/mzSpecLib). It is primarily a data model that flexibly encodes metadata about the library entries using the extensible PSI-MS controlled vocabulary and can be encoded in and converted between different serialization formats. We have also developed a standardized data model and serialization for fragment ion peak annotations, called mzPAF (https://psidev.info/mzPAF). It is defined as a separate standard, since it may be used for other applications besides spectral libraries. The mzSpecLib and mzPAF standards are compatible with existing PSI standards such as ProForma 2.0 and the Universal Spectrum Identifier. The mzSpecLib and mzPAF standards have been primarily defined for peptides in proteomics applications with basic small molecule support. They could be extended in the future to other fields that need to encode spectral libraries for nonpeptidic analytes.
Collapse
Affiliation(s)
- Joshua Klein
- Program
for Bioinformatics, Boston University, Boston, Massachusetts 02215, United States
| | - Henry Lam
- Department
of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, 999077 Hong Kong, P. R. China
| | - Tytus D. Mak
- Mass
Spectrometry Data Center, National Institute
of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Wout Bittremieux
- Department
of Computer Science, University of Antwerp, 2020 Antwerpen, Belgium
| | - Yasset Perez-Riverol
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Ralf Gabriels
- VIB-UGent
Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Jim Shofstahl
- Thermo
Fisher
Scientific, 355 River Oaks Parkway, San Jose, California 95134, United States
| | - Helge Hecht
- RECETOX,
Faculty of Science, Masaryk University, Kotlářská 2, 60200 Brno, Czech Republic
| | | | - Shin Kawano
- Database
Center for Life Science, Joint Support Center
for Data Science Research, Research Organization of Information and
Systems, Chiba 277-0871, Japan
- School
of Frontier Engineering, Kitasato University, Sagamihara 252-0373, Japan
| | - Tim Van Den Bossche
- VIB-UGent
Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Jeremy Carver
- Center
for Computational Mass Spectrometry, Department of Computer Science
and Engineering, University of California, San Diego, California 92093-0404, United
States
| | - Benjamin A. Neely
- National
Institute of Standards and Technology (NIST) Charleston, Charleston, South Carolina 29412, United States
| | - Luis Mendoza
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | - Tomi Suomi
- Turku Bioscience
Centre, University of Turku and Åbo
Akademi University, FI-20520 Turku, Finland
| | - Tine Claeys
- VIB-UGent
Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Thomas Payne
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Douwe Schulte
- Biomolecular
Mass Spectrometry and Proteomics, Bijvoet Center for Biomolecular
Research and Utrecht Institute of Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584,
CH, Utrecht, The
Netherlands
| | - Zhi Sun
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | - Nils Hoffmann
- Institute
for Bio- and Geosciences (IBG-5), Forschungszentrum
Jülich GmbH, 52428 Jülich, Germany
| | - Yunping Zhu
- National
Center for Protein Sciences (Beijing), Beijing
Institute of Lifeomics, #38, Life Science Park, Changping District, Beijing 102206, China
| | - Steffen Neumann
- Computational
Plant Biochemistry, Leibniz Institute of
Plant Biochemistry, 06120 Halle, Germany
- German
Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, 04103 Leipzig, Germany
| | - Andrew R. Jones
- Institute
of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 3BX, United Kingdom
| | - Nuno Bandeira
- Center
for Computational Mass Spectrometry, Department of Computer Science
and Engineering, University of California, San Diego, California 92093-0404, United
States
- Skaggs
School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Juan Antonio Vizcaíno
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
33
|
Kavianpour B, Piadeh F, Gheibi M, Ardakanian A, Behzadian K, Campos LC. Applications of artificial intelligence for chemical analysis and monitoring of pharmaceutical and personal care products in water and wastewater: A review. CHEMOSPHERE 2024; 368:143692. [PMID: 39515544 DOI: 10.1016/j.chemosphere.2024.143692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 09/15/2024] [Accepted: 11/04/2024] [Indexed: 11/16/2024]
Abstract
Specifying and interpreting the occurrence of emerging pollutants is essential for assessing treatment processes and plants, conducting wastewater-based epidemiology, and advancing environmental toxicology research. In recent years, artificial intelligence (AI) has been increasingly applied to enhance chemical analysis and monitoring of contaminants in environmental water and wastewater. However, their specific roles targeting pharmaceuticals and personal care products (PPCPs) have not been reviewed sufficiently. This review aims to narrow the gap by highlighting, scoping, and discussing the incorporation of AI during the detection and quantification of PPCPs when utilising chemical analysis equipment and interpreting their monitoring data for the first time. In the chemical analysis of PPCPs, AI-assisted prediction of chromatographic retention times and collision cross-sections (CCS) in suspect and non-target screenings using high-resolution mass spectrometry (HRMS) enhances detection confidence, reduces analysis time, and lowers costs. AI also aids in interpreting spectroscopic analysis results. However, this approach still cannot be applied in all matrices, as it offers lower sensitivity than liquid chromatography coupled with tandem or HRMS. For the interpretation of monitoring of PPCPs, unsupervised AI methods have recently presented the capacity to survey regional or national community health and socioeconomic factors. Nevertheless, as a challenge, long-term monitoring data sources are not given in the literature, and more comparative AI studies are needed for both chemical analysis and monitoring. Finally, AI assistance anticipates more frequent applications of CCS prediction to enhance detection confidence and the use of AI methods in data processing for wastewater-based epidemiology and community health surveillance.
Collapse
Affiliation(s)
- Babak Kavianpour
- School of Computing and Engineering, University of West London, St Mary's Rd, London W5 5RF, UK
| | - Farzad Piadeh
- School of Computing and Engineering, University of West London, St Mary's Rd, London W5 5RF, UK; Centre for Engineering Research, School of Physics, Engineering and Computer Science, University of Hertfordshire, Hatfield, AL10 9AB, UK
| | - Mohammad Gheibi
- Institute for Nanomaterials, Advanced Technologies and Innovation, Technical University of Liberec, 46117, Liberec, Czech Republic
| | - Atiyeh Ardakanian
- School of Computing and Engineering, University of West London, St Mary's Rd, London W5 5RF, UK
| | - Kourosh Behzadian
- School of Computing and Engineering, University of West London, St Mary's Rd, London W5 5RF, UK; Centre for Urban Sustainability and Resilience, Department of Civil, Environmental and Geomatic Engineering, University College London, London WC1E6BT, UK.
| | - Luiza C Campos
- Centre for Urban Sustainability and Resilience, Department of Civil, Environmental and Geomatic Engineering, University College London, London WC1E6BT, UK
| |
Collapse
|
34
|
Kumari P, Guilherme MSR, Choudhary P, Van Laethem T, Fillet M, Hubert P, Sacre PY, Hubert C. Transfer Learning Approach to Multitarget QSRR Modeling in RPLC. J Chem Inf Model 2024; 64:7447-7456. [PMID: 39284310 DOI: 10.1021/acs.jcim.4c00608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
QSRR is a valuable technique for the retention time predictions of small molecules. This aims to bridge the gap between molecular structure and chromatographic behavior, offering invaluable insights for analytical chemistry. Given the challenge of simultaneous target prediction with variable experimental conditions and the scarcity of comprehensive data sets for such predictive modelings in chromatography, this study introduces a transfer learning-based multitarget QSRR approach to enhance retention time prediction. Through a comparative study of four models, both with and without the transfer learning approach, the performance of both single and multitarget QSRR was evaluated based on Mean Squared Error (MSE) and R2 metrics. Individual models were also tested for their performance against benchmark studies in this field. The findings suggest that transfer learning based multitarget models exhibit potential for enhanced accuracy in predicting retention times of small molecules, presenting a promising avenue for QSRR modeling. These models will be highly beneficial for optimizing experimental conditions in method development by better retention time predictions in Reversed-Phase Liquid Chromatography (RPLC). The reliable and effective predictive capabilities of these models make them valuable tools for pharmaceutical research and development endeavors.
Collapse
Affiliation(s)
- Priyanka Kumari
- Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, CIRM, Liège, Belgium 4000
- Laboratory for the Analysis of Medicines, CIRM, Liège, Belgium 4000
| | | | | | - Thomas Van Laethem
- Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, CIRM, Liège, Belgium 4000
| | - Marianne Fillet
- Laboratory for the Analysis of Medicines, CIRM, Liège, Belgium 4000
| | - Phillipe Hubert
- Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, CIRM, Liège, Belgium 4000
| | - Pierre Yves Sacre
- Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, CIRM, Liège, Belgium 4000
| | - Cedric Hubert
- Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, CIRM, Liège, Belgium 4000
| |
Collapse
|
35
|
Liu Y, Yoshizawa AC, Ling Y, Okuda S. Insights into predicting small molecule retention times in liquid chromatography using deep learning. J Cheminform 2024; 16:113. [PMID: 39375739 PMCID: PMC11460055 DOI: 10.1186/s13321-024-00905-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 09/13/2024] [Indexed: 10/09/2024] Open
Abstract
In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and m/z (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges. SCIENTIFIC CONTRIBUTION: This article focuses on the advancements in small molecule retention time prediction in computational metabolomics over the past five years, with a particular emphasis on the application of AI technologies in this field. It reviews the publicly available datasets for small molecule retention time, the molecular representation methods, the AI algorithms applied in recent studies. Furthermore, it discusses the effectiveness of these models in assisting with the annotation of small molecule structures and the challenges that must be addressed to achieve practical applications.
Collapse
Affiliation(s)
- Yuting Liu
- Medical AI Center, Niigata University School of Medicine, Niigata City, Niigata, 951-8514, Japan
| | - Akiyasu C Yoshizawa
- Medical AI Center, Niigata University School of Medicine, Niigata City, Niigata, 951-8514, Japan
| | - Yiwei Ling
- Medical AI Center, Niigata University School of Medicine, Niigata City, Niigata, 951-8514, Japan
| | - Shujiro Okuda
- Medical AI Center, Niigata University School of Medicine, Niigata City, Niigata, 951-8514, Japan.
| |
Collapse
|
36
|
Zhang Y, Liu F, Li XQ, Gao Y, Li KC, Zhang QH. Retention time dataset for heterogeneous molecules in reversed-phase liquid chromatography. Sci Data 2024; 11:946. [PMID: 39209861 PMCID: PMC11362277 DOI: 10.1038/s41597-024-03780-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 08/14/2024] [Indexed: 09/04/2024] Open
Abstract
Quantitative structure-property relationships have been extensively studied in the field of predicting retention times in liquid chromatography (LC). However, making transferable predictions is inherently complex because retention times are influenced by both the structure of the molecule and the chromatographic method used. Despite decades of development and numerous published machine learning models, the practical application of predicting small molecule retention time remains limited. The resulting models are typically limited to specific chromatographic conditions and the molecules used in their training and evaluation. Here, we have developed a comprehensive dataset comprising over 10,000 experimental retention times. These times were derived from 30 different reversed-phase liquid chromatography methods and pertain to a collection of 343 small molecules representing a wide range of chemical structures. These chromatographic methods encompass common LC setups for studying the retention behavior of small molecules. They offer a wide range of examples for modeling retention time with different LC setups.
Collapse
Affiliation(s)
- Yan Zhang
- Key Laboratory of Groundwater Conservation of MWR, China University of Geosciences, Beijing, 100083, People's Republic of China
- Division of Chemical Metrology and Analytical Science, National Institute of Metrology, Beijing, 100029, People's Republic of China
- Key Laboratory of Chemical Metrology and Applications on Nutrition and Health for State Market Regulation, Beijing, 100029, China
| | - Fei Liu
- Key Laboratory of Groundwater Conservation of MWR, China University of Geosciences, Beijing, 100083, People's Republic of China.
| | - Xiu Qin Li
- Division of Chemical Metrology and Analytical Science, National Institute of Metrology, Beijing, 100029, People's Republic of China
- Key Laboratory of Chemical Metrology and Applications on Nutrition and Health for State Market Regulation, Beijing, 100029, China
| | - Yan Gao
- Division of Chemical Metrology and Analytical Science, National Institute of Metrology, Beijing, 100029, People's Republic of China
- Key Laboratory of Chemical Metrology and Applications on Nutrition and Health for State Market Regulation, Beijing, 100029, China
| | - Kang Cong Li
- Division of Chemical Metrology and Analytical Science, National Institute of Metrology, Beijing, 100029, People's Republic of China
- Key Laboratory of Chemical Metrology and Applications on Nutrition and Health for State Market Regulation, Beijing, 100029, China
| | - Qing He Zhang
- Division of Chemical Metrology and Analytical Science, National Institute of Metrology, Beijing, 100029, People's Republic of China.
- Key Laboratory of Chemical Metrology and Applications on Nutrition and Health for State Market Regulation, Beijing, 100029, China.
| |
Collapse
|
37
|
Chou L, Zhang S, Luo W, Zhu W, Guo J, Tu K, Tan H, Wang C, Wei S, Yu H, Zhang X, Shi W. Identification of Key Toxic Substances Considering Metabolic Activation: A Combination of Transcriptome and Nontarget Analysis. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:14831-14842. [PMID: 39120612 DOI: 10.1021/acs.est.4c03683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
There have been numerous studies using effect-directed analysis (EDA) to identify key toxic substances present in source and drinking water, but none of these studies have considered the effects of metabolic activation. This study developed a comprehensive method including a pretreatment process based on an in vitro metabolic activation system, a comprehensive biological effect evaluation based on concentration-dependent transcriptome (CDT), and a chemical feature identification based on nontarget chemical analysis (NTA), to evaluate the changes in the toxic effects and differences in the chemical composition after metabolism. Models for matching metabolites and precursors as well as data-driven identification methods were further constructed to identify toxic metabolites and key toxic precursor substances in drinking water samples from the Yangtze River. After metabolism, the metabolic samples showed a general trend of reduced toxicity in terms of overall biological potency (mean: 3.2-fold). However, metabolic activation led to an increase in some types of toxic effects, including pathways such as excision repair, mismatch repair, protein processing in endoplasmic reticulum, nucleotide excision repair, and DNA replication. Meanwhile, metabolic samples showed a decrease (17.8%) in the number of peaks and average peak area after metabolism, while overall polarity, hydrophilicity, and average molecular weight increased slightly (10.3%). Based on the models for matching of metabolites and precursors and the data-driven identification methods, 32 chemicals were efficiently identified as key toxic substances as main contributors to explain the different transcriptome biological effects such as cellular component, development, and DNA damage related, including 15 industrial compounds, 7 PPCPs, 6 pesticides, and 4 natural products. This study avoids the process of structure elucidation of toxic metabolites and can trace them directly to the precursors based on MS spectra, providing a new idea for the identification of key toxic pollutants of metabolites.
Collapse
Affiliation(s)
- Liben Chou
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Shaoqing Zhang
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Wenrui Luo
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Wenxuan Zhu
- Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, Minnesota 55105, United States
| | - Jing Guo
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Keng Tu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Haoyue Tan
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Chang Wang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, Institute of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Si Wei
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
- Jiangsu Province Ecology and Environment Protection Key Laboratory of Chemical Safety and Health Risk, Nanjing 210023, China
| | - Hongxia Yu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
- Jiangsu Province Ecology and Environment Protection Key Laboratory of Chemical Safety and Health Risk, Nanjing 210023, China
| | - Xiaowei Zhang
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
- Jiangsu Province Ecology and Environment Protection Key Laboratory of Chemical Safety and Health Risk, Nanjing 210023, China
| | - Wei Shi
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, China
- Jiangsu Province Ecology and Environment Protection Key Laboratory of Chemical Safety and Health Risk, Nanjing 210023, China
| |
Collapse
|
38
|
Bosten E, Pardon M, Chen K, Koppen V, Van Herck G, Hellings M, Cabooter D. Assisted Active Learning for Model-Based Method Development in Liquid Chromatography. Anal Chem 2024; 96:13699-13709. [PMID: 38979746 DOI: 10.1021/acs.analchem.4c02700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
In recent decades, there has been a growing interest in fully automated methods for tackling complex optimization problems across various fields. Active learning (AL) and its variant, assisted active learning (AAL), incorporating guidance or assistance from external sources into the learning process, play key roles in this automation by enabling the autonomous selection of optimal experimental conditions to efficiently explore the problem space. These approaches are particularly valuable in situations wherein experimentation is costly or time-consuming. This study explores the application of AAL in model-based method development (MD) for liquid chromatography (LC) by using Bayesian statistics to incorporate historical data and analyte information for the generation of initial retention models. The process involves updating the model parameters based on new experiments, coupled with an active data selection method to choose the most informative experiment to run in a subsequent step. This iterative process balances model exploitation and experimental exploration until a satisfactory separation is achieved. The effectiveness of this approach is demonstrated via two practical examples, resulting in optimized separations in a limited number of experiments by optimizing the gradient slope. It is shown that the ability of AAL to leverage past knowledge and compound information to improve accuracy and reduce experimental runs offers a flexible alternative approach to fixed design methods.
Collapse
Affiliation(s)
- Emery Bosten
- Department for Pharmaceutical and Pharmacological Sciences, Pharmaceutical Analysis, University of Leuven (KU Leuven), Herestraat 49, 3000 Leuven, Belgium
- Therapeutics Development & Supply, Janssen Pharmaceutica, Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Marie Pardon
- Department for Pharmaceutical and Pharmacological Sciences, Pharmaceutical Analysis, University of Leuven (KU Leuven), Herestraat 49, 3000 Leuven, Belgium
| | - Kai Chen
- Therapeutics Development & Supply, Janssen Pharmaceutica, Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Valerie Koppen
- Therapeutics Development & Supply, Janssen Pharmaceutica, Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Gerd Van Herck
- Therapeutics Development & Supply, Janssen Pharmaceutica, Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Mario Hellings
- Therapeutics Development & Supply, Janssen Pharmaceutica, Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Deirdre Cabooter
- Department for Pharmaceutical and Pharmacological Sciences, Pharmaceutical Analysis, University of Leuven (KU Leuven), Herestraat 49, 3000 Leuven, Belgium
| |
Collapse
|
39
|
Beck AG, Fine J, Aggarwal P, Regalado EL, Levorse D, De Jesus Silva J, Sherer EC. Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals. J Chromatogr A 2024; 1730:465109. [PMID: 38968662 DOI: 10.1016/j.chroma.2024.465109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/17/2024] [Accepted: 06/18/2024] [Indexed: 07/07/2024]
Abstract
The predictive modeling of liquid chromatography methods can be an invaluable asset, potentially saving countless hours of labor while also reducing solvent consumption and waste. Tasks such as physicochemical screening and preliminary method screening systems where large amounts of chromatography data are collected from fast and routine operations are particularly well suited for both leveraging large datasets and benefiting from predictive models. Therefore, the generation of predictive models for retention time is an active area of development. However, for these predictive models to gain acceptance, researchers first must have confidence in model performance and the computational cost of building them should be minimal. In this study, a simple and cost-effective workflow for the development of machine learning models to predict retention time using only Molecular Operating Environment 2D descriptors as input for support vector regression is developed. Furthermore, we investigated the relative performance of models based on molecular descriptor space by utilizing uniform manifold approximation and projection and clustering with Gaussian mixture models to identify chemically distinct clusters. Results outlined herein demonstrate that local models trained on clusters in chemical space perform equivalently when compared to models trained on all data. Through 10-fold cross-validation on a comprehensive set containing 67,950 of our company's proprietary analytes, these models achieved coefficients of determination of 0.84 and 3 % error in terms of retention time. This promising statistical significance is found to translate from cross-validation to prospective prediction on an external test set of pharmaceutically relevant analytes. The observed equivalency of global and local modeling of large datasets is retained with METLIN's SMRT dataset, thereby confirming the wider applicability of the developed machine learning workflows for global models.
Collapse
Affiliation(s)
- Armen G Beck
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Jonathan Fine
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Pankaj Aggarwal
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
| | - Erik L Regalado
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Dorothy Levorse
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA
| | | | - Edward C Sherer
- Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA
| |
Collapse
|
40
|
Zeng Z, Huo J, Zhang Y, Shi Y, Wu Z, Yang Q, Zhang X. Study on the correlation and difference of qualitative information among three types of UPLC-HRMS and potential generalization in metabolites annotation. J Chromatogr B Analyt Technol Biomed Life Sci 2024; 1243:124219. [PMID: 38943690 DOI: 10.1016/j.jchromb.2024.124219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 05/24/2024] [Accepted: 06/24/2024] [Indexed: 07/01/2024]
Abstract
The variation of qualitative information among different types of mainstream hyphenated instruments of ultra-performance liquid chromatography coupled to high-resolution mass spectrometry (UPLC-HRMS) makes data sharing and standardization, and further comparison of results consistency in metabolite annotation not easy to attain. In this work, a quantitative study of correlation and difference was first achieved to systematically investigate the variation of retention time (tR), precursor ion (MS1), and product fragment ions (MS2) generated by three typical UPLC-HRMS instruments commonly used in metabolomics area. In terms of the findings of systematic and correlated variation of tR, MS1, and MS2 between different instruments, a computational strategy for integrated metabolite annotation was proposed to reduce the influence of differential ions, which made full use of the characteristic (common) and non-common fragments for scoring assessment. The regular variations of MS2 among three instruments under four collision energy voltages of high, medium, low, and hybrid levels were respectively inspected with three technical replicates at each level. These discoveries could improve general metabolite annotation with a known database and similarity comparison. It should provide the potential for metabolite annotation to generalize qualitative information obtained under different experimental conditions or using instruments from various manufacturers, which is still a big headache in untargeted metabolomics. The mixture of standard compounds and serum samples with the addition of standards were applied to demonstrate the principle and performance of the proposed method. The results showed that it could be an optional strategy for general use in HRMS-based metabolomics to offset the difference in metabolite annotation. It has some potential in untargeted metabolomics.
Collapse
Affiliation(s)
- Zhongda Zeng
- College of Environmental and Chemical Engineering, Dalian University, Dalian 116622, China
| | - Jinfeng Huo
- College of Environmental and Chemical Engineering, Dalian University, Dalian 116622, China
| | - Yuxi Zhang
- Dalian ChemDataSolution Information Technology Co. Ltd., Dalian 116023, China
| | - Yingjiao Shi
- College of Environmental and Chemical Engineering, Dalian University, Dalian 116622, China
| | - Zeying Wu
- School of Chemical Engineering and Material Sciences, Changzhou Institute of Technology, Changzhou 213032, China.
| | - Qianxu Yang
- Technology Center of China Tobacco Yunnan Industrial Co. Ltd., Kunming 650231, China.
| | - Xiaodan Zhang
- Key Laboratory of Plant Secondary Metabolism and Regulation of Zhejiang Province, College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou 310018, China.
| |
Collapse
|
41
|
Cano-Prieto C, Undabarrena A, de Carvalho AC, Keasling JD, Cruz-Morales P. Triumphs and Challenges of Natural Product Discovery in the Postgenomic Era. Annu Rev Biochem 2024; 93:411-445. [PMID: 38639989 DOI: 10.1146/annurev-biochem-032620-104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2024]
Abstract
Natural products have played significant roles as medicine and food throughout human history. Here, we first provide a brief historical overview of natural products, their classification and biosynthetic origins, and the microbiological and genetic methods used for their discovery. We also describe and discuss the technologies that revolutionized the field, which transitioned from classic genetics to genome-centric discovery approximately two decades ago. We then highlight the most recent advancements and approaches in the current postgenomic era, in which genome mining is a standard operation and high-throughput analytical methods allow parallel discovery of genes and molecules at an unprecedented pace. Finally, we discuss the new challenges faced by the field of natural products and the future of systematic heterologous expression and strain-independent discovery, which promises to deliver more molecules in vials than ever before.
Collapse
Affiliation(s)
- Carolina Cano-Prieto
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark;
| | - Agustina Undabarrena
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark;
| | - Ana Calheiros de Carvalho
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark;
| | - Jay D Keasling
- Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Center for Synthetic Biochemistry, Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Shenzhen, China
- Joint BioEnergy Institute, Lawrence Berkeley National Laboratory, Emeryville, California, USA
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark;
- Department of Bioengineering, University of California, Berkeley, California, USA
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, California, USA
| | - Pablo Cruz-Morales
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark;
| |
Collapse
|
42
|
Metz TO, Chang CH, Gautam V, Anjum A, Tian S, Wang F, Colby SM, Nunez JR, Blumer MR, Edison AS, Fiehn O, Jones DP, Li S, Morgan ET, Patti GJ, Ross DH, Shapiro MR, Williams AJ, Wishart DS. Introducing 'identification probability' for automated and transferable assessment of metabolite identification confidence in metabolomics and related studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.30.605945. [PMID: 39131324 PMCID: PMC11312557 DOI: 10.1101/2024.07.30.605945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Methods for assessing compound identification confidence in metabolomics and related studies have been debated and actively researched for the past two decades. The earliest effort in 2007 focused primarily on mass spectrometry and nuclear magnetic resonance spectroscopy and resulted in four recommended levels of metabolite identification confidence - the Metabolite Standards Initiative (MSI) Levels. In 2014, the original MSI Levels were expanded to five levels (including two sublevels) to facilitate communication of compound identification confidence in high resolution mass spectrometry studies. Further refinement in identification levels have occurred, for example to accommodate use of ion mobility spectrometry in metabolomics workflows, and alternate approaches to communicate compound identification confidence also have been developed based on identification points schema. However, neither qualitative levels of identification confidence nor quantitative scoring systems address the degree of ambiguity in compound identifications in context of the chemical space being considered, are easily automated, or are transferable between analytical platforms. In this perspective, we propose that the metabolomics and related communities consider identification probability as an approach for automated and transferable assessment of compound identification and ambiguity in metabolomics and related studies. Identification probability is defined simply as 1/N, where N is the number of compounds in a reference library or chemical space that match to an experimentally measured molecule within user-defined measurement precision(s), for example mass measurement or retention time accuracy, etc. We demonstrate the utility of identification probability in an in silico analysis of multi-property reference libraries constructed from the Human Metabolome Database and computational property predictions, provide guidance to the community in transparent implementation of the concept, and invite the community to further evaluate this concept in parallel with their current preferred methods for assessing metabolite identification confidence.
Collapse
Affiliation(s)
- Thomas O. Metz
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Christine H. Chang
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Vasuk Gautam
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Afia Anjum
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Siyang Tian
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| | - Fei Wang
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
- Alberta Machine Intelligence Institute, Edmonton, AB, Canada
| | - Sean M. Colby
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Jamie R. Nunez
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Madison R. Blumer
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Arthur S. Edison
- Department of Biochemistry & Molecular Biology, Complex Carbohydrate Research Center and Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Oliver Fiehn
- West Coast Metabolomics Center, University of California Davis, Davis, CA, USA
| | - Dean P. Jones
- Clinical Biomarkers Laboratory, Department of Medicine, Emory University, Atlanta, Georgia, USA
| | - Shuzhao Li
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Edward T. Morgan
- Department of Pharmacology and Chemical Biology, Emory University School of Medicine, Atlanta, Georgia, USA
| | - Gary J. Patti
- Center for Mass Spectrometry and Metabolic Tracing, Department of Chemistry, Department of Medicine, Washington University, Saint Louis, Missouri, USA
| | - Dylan H. Ross
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Madelyn R. Shapiro
- Artificial Intelligence & Data Analytics Division, Pacific Northwest National Laboratory, Richland, WA USA
| | - Antony J. Williams
- U.S. Environmental Protection Agency, Office of Research & Development, Center for Computational Toxicology & Exposure (CCTE), Research Triangle Park, NC USA
| | - David S. Wishart
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
43
|
Catacutan DB, Alexander J, Arnold A, Stokes JM. Machine learning in preclinical drug discovery. Nat Chem Biol 2024:10.1038/s41589-024-01679-1. [PMID: 39030362 DOI: 10.1038/s41589-024-01679-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 06/13/2024] [Indexed: 07/21/2024]
Abstract
Drug-discovery and drug-development endeavors are laborious, costly and time consuming. These programs can take upward of 12 years and cost US $2.5 billion, with a failure rate of more than 90%. Machine learning (ML) presents an opportunity to improve the drug-discovery process. Indeed, with the growing abundance of public and private large-scale biological and chemical datasets, ML techniques are becoming well positioned as useful tools that can augment the traditional drug-development process. In this Perspective, we discuss the integration of algorithmic methods throughout the preclinical phases of drug discovery. Specifically, we highlight an array of ML-based efforts, across diverse disease areas, to accelerate initial hit discovery, mechanism-of-action (MOA) elucidation and chemical property optimization. With advances in the application of ML across diverse therapeutic areas, we posit that fully ML-integrated drug-discovery pipelines will define the future of drug-development programs.
Collapse
Affiliation(s)
- Denise B Catacutan
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jeremie Alexander
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Autumn Arnold
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jonathan M Stokes
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada.
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada.
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada.
| |
Collapse
|
44
|
Wang X, Liu S, Chen S, He X, Duan W, Wang S, Zhao J, Zhang L, Chen Q, Xiong C. Prediction of adsorption performance of ZIF-67 for malachite green based on artificial neural network using L-BFGS algorithm. JOURNAL OF HAZARDOUS MATERIALS 2024; 473:134629. [PMID: 38762987 DOI: 10.1016/j.jhazmat.2024.134629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 05/05/2024] [Accepted: 05/14/2024] [Indexed: 05/21/2024]
Abstract
Given the necessity and urgency in removing organic pollutants such as malachite green (MG) from the environment, it is vital to screen high-capacity adsorbents using artificial neural network (ANN) methods quickly and accurately. In this study, a series of ZIF-67 were synthesized, which adsorption properties for organic pollutants, especially MG, were systematically evaluated and determined as 241.720 mg g-1 (25 ℃, 2 h). The adsorption process was more consistent with pseudo-second-order kinetics and Langmuir adsorption isotherm, which correlation coefficients were 0.995 and 0.997, respectively. The chemisorption mechanism was considered to be π-π stacking interaction between imidazole and aromatic ring. Then, a Python-based neural network model using the Limited-memory BFGS algorithm was constructed by collecting the crucial structural parameters of ZIF-67 and the experimental data of batch adsorption. The model, optimized extensively, outperformed similar Matlab-based ANN with a coefficient of determination of 0.9882 and mean square error of 0.0009 in predicting ZIF-67 adsorption of MG. Furthermore, the model demonstrated a good generalization ability in the predictive training of other organic pollutants. In brief, ANN was successfully separated from the Matlab platform, providing a robust framework for high-precision prediction of organic pollutants and guiding the synthesis of adsorbents.
Collapse
Affiliation(s)
- Xiaoqing Wang
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China; Zhejiang Longsheng Group Co., Ltd, Shaoxing 312300, China
| | - Shangkun Liu
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Shaolei Chen
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Xubin He
- Zhejiang Longsheng Group Co., Ltd, Shaoxing 312300, China
| | - Wenjing Duan
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Siyuan Wang
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Junzi Zhao
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Liangquan Zhang
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Qing Chen
- Department of Applied Chemistry, Zhejiang Gongshang University, Hangzhou 310023, China
| | - Chunhua Xiong
- Department of Applied Chemistry, Zhejiang Gongshang University, Hangzhou 310023, China.
| |
Collapse
|
45
|
Nash W, Ngere JB, Najdekr L, Dunn WB. Characterization of Electrospray Ionization Complexity in Untargeted Metabolomic Studies. Anal Chem 2024; 96:10935-10942. [PMID: 38917347 PMCID: PMC11238156 DOI: 10.1021/acs.analchem.4c00966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 05/31/2024] [Accepted: 06/11/2024] [Indexed: 06/27/2024]
Abstract
The annotation of metabolites detected in LC-MS-based untargeted metabolomics studies routinely applies accurate m/z of the intact metabolite (MS1) as well as chromatographic retention time and MS/MS data. Electrospray ionization and transfer of ions through the mass spectrometer can result in the generation of multiple "features" derived from the same metabolite with different m/z values but the same retention time. The complexity of the different charged and neutral adducts, in-source fragments, and charge states has not been previously and deeply characterized. In this paper, we report the first large-scale characterization using publicly available data sets derived from different research groups, instrument manufacturers, LC assays, sample types, and ion modes. 271 m/z differences relating to different metabolite feature pairs were reported, and 209 were annotated. The results show a wide range of different features being observed with only a core 32 m/z differences reported in >50% of the data sets investigated. There were no patterns reporting specific m/z differences that were observed in relation to ion mode, instrument manufacturer, LC assay type, and mammalian sample type, although some m/z differences were related to study group (mammal, microbe, plant) and mobile phase composition. The results provide the metabolomics community with recommendations of adducts, in-source fragments, and charge states to apply in metabolite annotation workflows.
Collapse
Affiliation(s)
- William
J. Nash
- School
of Biosciences, University of Birmingham, Birmingham, West Midlands B15 2TT, United
Kingdom
| | - Judith B. Ngere
- School
of Biosciences, University of Birmingham, Birmingham, West Midlands B15 2TT, United
Kingdom
| | - Lukas Najdekr
- Institute
of Molecular and Translational Medicine, Palacký University Olomouc, Olomouc 779 00, Czech Republic
| | - Warwick B. Dunn
- School
of Biosciences, University of Birmingham, Birmingham, West Midlands B15 2TT, United
Kingdom
- Centre
for Metabolomics Research, Department of Biochemistry, Cell and Systems
Biology, Institute of Systems, Molecular, and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
| |
Collapse
|
46
|
Zhao D, Wang C, Qu H, Rao Q, Shen B, Jiang Y, Gong J, Wang Y, Geng D, Hong R, Lu T, Ni Q, Deng X. Glycomol: A pervasive tool for structure predication of natural saponin products basing on MS data. J Pharm Anal 2024; 14:100897. [PMID: 39036467 PMCID: PMC11259924 DOI: 10.1016/j.jpha.2023.11.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 10/23/2023] [Accepted: 11/15/2023] [Indexed: 07/23/2024] Open
Abstract
Image 1.
Collapse
Affiliation(s)
- Daotong Zhao
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
- School of Life Sciences, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Chunguo Wang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
- Beijing Research Institute of Chinese Medicine, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Hanyun Qu
- School of Life Sciences, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Qinling Rao
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Bingqing Shen
- Beijing Research Institute of Chinese Medicine, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Yinan Jiang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Jiayu Gong
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Yumiao Wang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Di Geng
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Rui Hong
- Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
| | - Tao Lu
- School of Life Sciences, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Qing Ni
- Guang'anmen Hospital, China Academy of Chinese Medical Sciences, Beijing, 100053, China
| | - Xinqi Deng
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| |
Collapse
|
47
|
Bouwmeester R, Richardson K, Denny R, Wilson ID, Degroeve S, Martens L, Vissers JPC. Predicting ion mobility collision cross sections and assessing prediction variation by combining conventional and data driven modeling. Talanta 2024; 274:125970. [PMID: 38621320 DOI: 10.1016/j.talanta.2024.125970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/01/2024] [Accepted: 03/20/2024] [Indexed: 04/17/2024]
Abstract
The use of collision cross section (CCS) values derived from ion mobility studies is proving to be an increasingly important tool in the characterization and identification of molecules detected in complex mixtures. Here, a novel machine learning (ML) based method for predicting CCS integrating both molecular modeling (MM) and ML methodologies has been devised and shown to be able to accurately predict CCS values for singly charged small molecular weight molecules from a broad range of chemical classes. The model performed favorably compared to existing models, improving compound identifications for isobaric analytes in terms of ranking and assigning identification probability values to the annotation. Furthermore, charge localization was seen to be correlated with CCS prediction accuracy and with gas-phase proton affinity demonstrating the potential to provide a proxy for prediction error based on chemical structural properties. The presented approach and findings represent a further step towards accurate prediction and application of computationally generated CCS values.
Collapse
Affiliation(s)
- Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium.
| | | | | | - Ian D Wilson
- Computational & Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College, United Kingdom
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | | |
Collapse
|
48
|
Bandini E, Castellano Ontiveros R, Kajtazi A, Eghbali H, Lynen F. Physicochemical modelling of the retention mechanism of temperature-responsive polymeric columns for HPLC through machine learning algorithms. J Cheminform 2024; 16:72. [PMID: 38907264 PMCID: PMC11193285 DOI: 10.1186/s13321-024-00873-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 06/14/2024] [Indexed: 06/23/2024] Open
Abstract
Temperature-responsive liquid chromatography (TRLC) offers a promising alternative to reversed-phase liquid chromatography (RPLC) for environmentally friendly analytical techniques by utilizing pure water as a mobile phase, eliminating the need for harmful organic solvents. TRLC columns, packed with temperature-responsive polymers coupled to silica particles, exhibit a unique retention mechanism influenced by temperature-induced polymer hydration. An investigation of the physicochemical parameters driving separation at high and low temperatures is crucial for better column manufacturing and selectivity control. Assessment of predictability using a dataset of 139 molecules analyzed at different temperatures elucidated the molecular descriptors (MDs) relevant to retention mechanisms. Linear regression, support vector regression (SVR), and tree-based ensemble models were evaluated, with no standout performer. The precision, accuracy, and robustness of models were validated through metrics, such as r and mean absolute error (MAE), and statistical analysis. At 45 ∘ C , logP predominantly influenced retention, akin to reversed-phase columns, while at5 ∘ C , complex interactions with lipophilic and negative MDs, along with specific functional groups, dictated retention. These findings provide deeper insights into TRLC mechanisms, facilitating method development and maximizing column potential.
Collapse
Affiliation(s)
- Elena Bandini
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Univeristy of Ghent, Krijgslaan 281 S4bis, Ghent, 9000, Belgium.
| | - Rodrigo Castellano Ontiveros
- School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, 11428, Sweden
| | - Ardiana Kajtazi
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Univeristy of Ghent, Krijgslaan 281 S4bis, Ghent, 9000, Belgium
| | - Hamed Eghbali
- Packaging and Specialty Plastics R&D, Dow Benelux B.V., Terneuzen, 4530 AA, the Netherlands
| | - Frédéric Lynen
- Separation Science Group, Department of Organic and Macromolecular Chemistry, Univeristy of Ghent, Krijgslaan 281 S4bis, Ghent, 9000, Belgium
| |
Collapse
|
49
|
de Cripan SM, Arora T, Olomí A, Canela N, Siuzdak G, Domingo-Almenara X. Predicting the Predicted: A Comparison of Machine Learning-Based Collision Cross-Section Prediction Models for Small Molecules. Anal Chem 2024; 96:9088-9096. [PMID: 38783786 PMCID: PMC11154685 DOI: 10.1021/acs.analchem.4c00630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 05/09/2024] [Accepted: 05/10/2024] [Indexed: 05/25/2024]
Abstract
The application of machine learning (ML) to -omics research is growing at an exponential rate owing to the increasing availability of large amounts of data for model training. Specifically, in metabolomics, ML has enabled the prediction of tandem mass spectrometry and retention time data. More recently, due to the advent of ion mobility, new ML models have been introduced for collision cross-section (CCS) prediction, but those have been trained with different and relatively small data sets covering a few thousands of small molecules, which hampers their systematic comparison. Here, we compared four existing ML-based CCS prediction models and their capacity to predict CCS values using the recently introduced METLIN-CCS data set. We also compared them with simple linear models and with ML models that used fingerprints as regressors. We analyzed the role of structural diversity of the data on which the ML models are trained with and explored the practical application of these models for metabolite annotation using CCS values. Results showed a limited capability of the existing models to achieve the necessary accuracy to be adopted for routine metabolomics analysis. We showed that for a particular molecule, this accuracy could only be improved when models were trained with a large number of structurally similar counterparts. Therefore, we suggest that current annotation capabilities will only be significantly altered with models trained with heterogeneous data sets composed of large homogeneous hubs of structurally similar molecules to those being predicted.
Collapse
Affiliation(s)
- Sara M. de Cripan
- Computational
Metabolomics for Systems Biology Lab, Eurecat—Technology
Centre of Catalonia, Barcelona 08005, Catalonia, Spain
- Centre
for Omics Sciences (COS), Unique Scientific and Technical Infrastructures
(ICTS), Eurecat—Technology Centre
of Catalonia & Rovira i Virgili University Joint Unit, Reus 43204, Catalonia, Spain
- Department
of Electrical, Electronic and Control Engineering (DEEEA), Universitat Rovira i Virgili, Tarragona 43007, Catalonia, Spain
| | - Trisha Arora
- Computational
Metabolomics for Systems Biology Lab, Eurecat—Technology
Centre of Catalonia, Barcelona 08005, Catalonia, Spain
- Centre
for Omics Sciences (COS), Unique Scientific and Technical Infrastructures
(ICTS), Eurecat—Technology Centre
of Catalonia & Rovira i Virgili University Joint Unit, Reus 43204, Catalonia, Spain
- Department
of Electrical, Electronic and Control Engineering (DEEEA), Universitat Rovira i Virgili, Tarragona 43007, Catalonia, Spain
| | - Adrià Olomí
- Computational
Metabolomics for Systems Biology Lab, Eurecat—Technology
Centre of Catalonia, Barcelona 08005, Catalonia, Spain
- Centre
for Omics Sciences (COS), Unique Scientific and Technical Infrastructures
(ICTS), Eurecat—Technology Centre
of Catalonia & Rovira i Virgili University Joint Unit, Reus 43204, Catalonia, Spain
| | - Núria Canela
- Centre
for Omics Sciences (COS), Unique Scientific and Technical Infrastructures
(ICTS), Eurecat—Technology Centre
of Catalonia & Rovira i Virgili University Joint Unit, Reus 43204, Catalonia, Spain
| | - Gary Siuzdak
- Scripps
Center of Metabolomics and Mass Spectrometry, Department of Chemistry,
Molecular and Computational Biology, Scripps
Research Institute, La Jolla, California 92037, United States
| | - Xavier Domingo-Almenara
- Computational
Metabolomics for Systems Biology Lab, Eurecat—Technology
Centre of Catalonia, Barcelona 08005, Catalonia, Spain
- Centre
for Omics Sciences (COS), Unique Scientific and Technical Infrastructures
(ICTS), Eurecat—Technology Centre
of Catalonia & Rovira i Virgili University Joint Unit, Reus 43204, Catalonia, Spain
- Department
of Electrical, Electronic and Control Engineering (DEEEA), Universitat Rovira i Virgili, Tarragona 43007, Catalonia, Spain
| |
Collapse
|
50
|
Kostyukevich Y, Osipenko S, Borisova L, Kireev A. In-Electrospray source Hydrogen/Deuterium exchange coupled to multistage fragmentation for the investigation of the protonation and fragmentation pathways of gas phase ions. JOURNAL OF MASS SPECTROMETRY : JMS 2024; 59:e5032. [PMID: 38736146 DOI: 10.1002/jms.5032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 04/02/2024] [Indexed: 05/14/2024]
Abstract
Identification of molecules in complex natural matrices relies on matching the fragmentation spectra of ions under investigation and the spectra acquired for the corresponding analytical standards. Currently, there are many databases of experimentally measured tandem mass spectrometry spectra (such as NIST, MzCloud, and Metlin), and considerable progress has been made in the development of software for predicting tandem mass spectrometry fragments in silico using combinatorial, machine learning, and quantum chemistry approaches (such as MetFrag, CFM-ID, and QCxMS). However, the electrospray ionization molecules can be ionized at different sites (protonated or deprotonated), and the fragmentation spectra of such ions are different. Here, we are using the combination of the in-ESI source hydrogen/deuterium exchange reaction and MSn fragmentation for the investigation of the fragmentation pathways for different protomers of organic molecules. It is shown that the distribution of the deuterium in the fragment ions reflects the presence of different protomers. For several molecules, the distribution of deuterium was traced up to the MS5 level of fragmentation revealing many unusual and unexpected effects. For example, we investigated the loss of HF from the ciprofloxacin and norfloxacin ions and observed that for ions protonated at -COOH group, the eliminating hydrogen always comes from -NH group. When ions are protonated at another site, the elimination of hydrogen with a probability of 30% occurs from the -NH group, and with a probability of 70%, it originates from other sites on the molecule. Such effects were not described previously. Quantum chemical simulation was used for the verification of the protonated structures and simulation of the corresponding fragmentation spectra.
Collapse
Affiliation(s)
| | - Sergey Osipenko
- Skolkovo Institute of Science and Technology, Moscow, Russia
| | | | - Albert Kireev
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|