1
|
Liu Y, De Vijlder T, Bittremieux W, Laukens K, Heyndrickx W. Current and future deep learning algorithms for tandem mass spectrometry (MS/MS)-based small molecule structure elucidation. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2025; 39 Suppl 1:e9120. [PMID: 33955607 DOI: 10.1002/rcm.9120] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 04/13/2021] [Accepted: 04/29/2021] [Indexed: 06/12/2023]
Abstract
RATIONALE Structure elucidation of small molecules has been one of the cornerstone applications of mass spectrometry for decades. Despite the increasing availability of software tools, structure elucidation from tandem mass spectrometry (MS/MS) data remains a challenging task, leaving many spectra unidentified. However, as an increasing number of reference MS/MS spectra are being curated at a repository scale and shared on public servers, there is an exciting opportunity to develop powerful new deep learning (DL) models for automated structure elucidation. ARCHITECTURES Recent early-stage DL frameworks mostly follow a "two-step approach" that translates MS/MS spectra to database structures after first predicting molecular descriptors. The related architectures could suffer from: (1) computational complexity because of the separate training of descriptor-specific classifiers, (2) the high dimensional nature of mass spectral data and information loss due to data preprocessing, (3) low substructure coverage and class imbalance problem of predefined molecular fingerprints. Inspired by successful DL frameworks employed in drug discovery fields, we have conceptualized and designed hypothetical DL architectures to tackle the above issues. For (1), we recommend multitask learning to achieve better performance with fewer classifiers by grouping structurally related descriptors. For (2) and (3), we introduce feature engineering to extract condensed and higher-order information from spectra and structure data. For instance, encoding spectra with subtrees and pre-calculated spectral patterns add peak interactions to the model input. Encoding structures with graph convolutional networks incorporates connectivity within a molecule. The joint embedding of spectra and structures can enable simultaneous spectral library and molecular database search. CONCLUSIONS In principle, given enough training data, adapted DL architectures, optimal hyperparameters and computing power, DL frameworks can predict small molecule structures, completely or at least partially, from MS/MS spectra. However, their performance and general applicability should be fairly evaluated against classical machine learning frameworks.
Collapse
Affiliation(s)
| | | | - Wout Bittremieux
- University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), University of Antwerp, Antwerp, Belgium
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, San Diego, CA, USA
| | - Kris Laukens
- University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), University of Antwerp, Antwerp, Belgium
| | | |
Collapse
|
2
|
Zheng F, You L, Zhao X, Lu X, Xu G. Predicting Tandem Mass Spectra of Small Molecules Using Graph Embedding of Precursor-Product Ion Pair Graph. Anal Chem 2024; 96:19190-19195. [PMID: 39575948 DOI: 10.1021/acs.analchem.4c04375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2024]
Abstract
Liquid chromatography-mass spectrometry (LC-MS)-based metabolomics identification relies heavily on high-quality MS/MS data; MS/MS prediction is a good way to address this issue. However, the accuracy of the prediction, resolution, and correlation with chemical structures have not been well-solved. In this study, we have developed a MS/MS prediction method, PPGB-MS2, which transforms the MS/MS prediction into fragment intensity prediction, and the concept of precursor-product ion pair graph bags (PPGBs) was introduced to represent fragments, achieving uniform representation of precursor and product ion structures and MS/MS fragmentation information. The chemical structure information is kept before it is incorporated into machine learning models. Due to the PPGB representation, graph neural networks (GNNs) can be utilized to achieve MS/MS fragment intensity prediction. The system was trained and evaluated using [M+H]+ and [M-H]- data acquired by an Agilent QTOF 6530 in the NIST 20 tandem MS database. Results demonstrated that the average cosine similarity is 0.71 in the test set, which is higher than classical MS/MS prediction methods. PPGB-MS2 also achieves high-resolution MS/MS prediction due to its effective management of the correspondence between fragments and structures.
Collapse
Affiliation(s)
- Fujian Zheng
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Liaoning Province Key Laboratory of Metabolomics, Dalian 116023, China
| | - Lei You
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Liaoning Province Key Laboratory of Metabolomics, Dalian 116023, China
| | - Xinjie Zhao
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Liaoning Province Key Laboratory of Metabolomics, Dalian 116023, China
| | - Xin Lu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Liaoning Province Key Laboratory of Metabolomics, Dalian 116023, China
| | - Guowang Xu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Liaoning Province Key Laboratory of Metabolomics, Dalian 116023, China
| |
Collapse
|
3
|
Russo FF, Nowatzky Y, Jaeger C, Parr MK, Benner P, Muth T, Lisec J. Machine learning methods for compound annotation in non-targeted mass spectrometry-A brief overview of fingerprinting, in silico fragmentation and de novo methods. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2024; 38:e9876. [PMID: 39180507 DOI: 10.1002/rcm.9876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/03/2024] [Accepted: 07/12/2024] [Indexed: 08/26/2024]
Abstract
Non-targeted screenings (NTS) are essential tools in different fields, such as forensics, health and environmental sciences. NTSs often employ mass spectrometry (MS) methods due to their high throughput and sensitivity in comparison to, for example, nuclear magnetic resonance-based methods. As the identification of mass spectral signals, called annotation, is labour intensive, it has been used for developing supporting tools based on machine learning (ML). However, both the diversity of mass spectral signals and the sheer quantity of different ML tools developed for compound annotation present a challenge for researchers in maintaining a comprehensive overview of the field. In this work, we illustrate which ML-based methods are available for compound annotation in non-targeted MS experiments and provide a nuanced comparison of the ML models used in MS data analysis, unravelling their unique features and performance metrics. Through this overview we support researchers to judiciously apply these tools in their daily research. This review also offers a detailed exploration of methods and datasets to show gaps in current methods, and promising target areas, offering a starting point for developers intending to improve existing methodologies.
Collapse
Affiliation(s)
- Francesco F Russo
- Department of Analytical Chemistry and Reference Materials, Organic Trace Analysis and Food Analysis, Bundesanstalt für Materialforschung und -prüfung (BAM), Berlin, Germany
| | - Yannek Nowatzky
- eScience, Bundesanstalt für Materialprüfung und -forschung, Berlin, Germany
| | - Carsten Jaeger
- Department of Analytical Chemistry and Reference Materials, Environmental Analysis, Bundesanstalt für Materialforschung und -prüfung (BAM), Berlin, Germany
| | - Maria K Parr
- Institute of Pharmacy, Pharmaceutical and Medicinal Chemistry (Pharmaceutical Analyses), Freie Universität, Berlin, Germany
| | - Phillipp Benner
- eScience, Bundesanstalt für Materialprüfung und -forschung, Berlin, Germany
| | - Thilo Muth
- Department MF 2, Domain Specific Data Competence Centre, Robert Koch Institut, Berlin, Germany
| | - Jan Lisec
- Department of Analytical Chemistry and Reference Materials, Organic Trace Analysis and Food Analysis, Bundesanstalt für Materialforschung und -prüfung (BAM), Berlin, Germany
| |
Collapse
|
4
|
Beck A, Muhoberac M, Randolph CE, Beveridge CH, Wijewardhane PR, Kenttämaa HI, Chopra G. Recent Developments in Machine Learning for Mass Spectrometry. ACS MEASUREMENT SCIENCE AU 2024; 4:233-246. [PMID: 38910862 PMCID: PMC11191731 DOI: 10.1021/acsmeasuresciau.3c00060] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/27/2023] [Accepted: 01/22/2024] [Indexed: 06/25/2024]
Abstract
Statistical analysis and modeling of mass spectrometry (MS) data have a long and rich history with several modern MS-based applications using statistical and chemometric methods. Recently, machine learning (ML) has experienced a renaissance due to advents in computational hardware and the development of new algorithms for artificial neural networks (ANN) and deep learning architectures. Moreover, recent successes of new ANN and deep learning architectures in several areas of science, engineering, and society have further strengthened the ML field. Importantly, modern ML methods and architectures have enabled new approaches for tasks related to MS that are now widely adopted in several popular MS-based subdisciplines, such as mass spectrometry imaging and proteomics. Herein, we aim to provide an introductory summary of the practical aspects of ML methodology relevant to MS. Additionally, we seek to provide an up-to-date review of the most recent developments in ML integration with MS-based techniques while also providing critical insights into the future direction of the field.
Collapse
Affiliation(s)
- Armen
G. Beck
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Matthew Muhoberac
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Caitlin E. Randolph
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Connor H. Beveridge
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Prageeth R. Wijewardhane
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Hilkka I. Kenttämaa
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Gaurav Chopra
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
- Department
of Computer Science (by courtesy), Purdue University, West Lafayette, Indiana 47907, United States
- Purdue
Institute for Drug Discovery, Purdue Institute for Cancer Research,
Regenstrief Center for Healthcare Engineering, Purdue Institute for
Inflammation, Immunology and Infectious Disease, Purdue Institute for Integrative Neuroscience, West Lafayette, Indiana 47907 United States
| |
Collapse
|
5
|
Sandström H, Rissanen M, Rousu J, Rinke P. Data-Driven Compound Identification in Atmospheric Mass Spectrometry. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2306235. [PMID: 38095508 PMCID: PMC10885664 DOI: 10.1002/advs.202306235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/04/2023] [Indexed: 02/24/2024]
Abstract
Aerosol particles found in the atmosphere affect the climate and worsen air quality. To mitigate these adverse impacts, aerosol particle formation and aerosol chemistry in the atmosphere need to be better mapped out and understood. Currently, mass spectrometry is the single most important analytical technique in atmospheric chemistry and is used to track and identify compounds and processes. Large amounts of data are collected in each measurement of current time-of-flight and orbitrap mass spectrometers using modern rapid data acquisition practices. However, compound identification remains a major bottleneck during data analysis due to lacking reference libraries and analysis tools. Data-driven compound identification approaches could alleviate the problem, yet remain rare to non-existent in atmospheric science. In this perspective, the authors review the current state of data-driven compound identification with mass spectrometry in atmospheric science and discuss current challenges and possible future steps toward a digital era for atmospheric mass spectrometry.
Collapse
Affiliation(s)
- Hilda Sandström
- Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| | - Matti Rissanen
- Aerosol Physics Laboratory, Tampere University, FI-33720, Tampere, Finland
- Department of Chemistry, University of Helsinki, P.O. Box 55, A.I. Virtasen aukio 1, FI-00560, Helsinki, Finland
| | - Juho Rousu
- Department of Computer Science, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| | - Patrick Rinke
- Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| |
Collapse
|
6
|
Heid E, Greenman KP, Chung Y, Li SC, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ. Chemprop: A Machine Learning Package for Chemical Property Prediction. J Chem Inf Model 2024; 64:9-17. [PMID: 38147829 PMCID: PMC10777403 DOI: 10.1021/acs.jcim.3c01250] [Citation(s) in RCA: 94] [Impact Index Per Article: 94.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 12/04/2023] [Accepted: 12/05/2023] [Indexed: 12/28/2023]
Abstract
Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by nonexperts. Among the current approaches, directed message-passing neural networks (D-MPNNs) have proven to perform well on a variety of property prediction tasks. The software package Chemprop implements the D-MPNN architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra. Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features. We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra. Chemprop enables out-of-the-box training of D-MPNN models for a variety of problem settings in fast, user-friendly, and open-source software.
Collapse
Affiliation(s)
- Esther Heid
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Institute
of Materials Chemistry, TU Wien, 1060 Vienna, Austria
| | - Kevin P. Greenman
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Yunsie Chung
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Shih-Cheng Li
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, National Taiwan
University, Taipei 10617, Taiwan
| | - David E. Graff
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemistry and Chemical Biology, Harvard
University, Cambridge, Massachusetts 02138, United States
| | - Florence H. Vermeire
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, KU Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
| | - Haoyang Wu
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - William H. Green
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Charles J. McGill
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, Virginia 23284, United States
| |
Collapse
|
7
|
Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00577-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
AbstractStructural annotation of small molecules in biological samples remains a key bottleneck in untargeted metabolomics, despite rapid progress in predictive methods and tools during the past decade. Liquid chromatography–tandem mass spectrometry, one of the most widely used analysis platforms, can detect thousands of molecules in a sample, the vast majority of which remain unidentified even with best-of-class methods. Here we present LC-MS2Struct, a machine learning framework for structural annotation of small-molecule data arising from liquid chromatography–tandem mass spectrometry (LC-MS2) measurements. LC-MS2Struct jointly predicts the annotations for a set of mass spectrometry features in a sample, using a novel structured prediction model trained to optimally combine the output of state-of-the-art MS2 scorers and observed retention orders. We evaluate our method on a dataset covering all publicly available reversed-phase LC-MS2 data in the MassBank reference database, including 4,327 molecules measured using 18 different LC conditions from 16 contributors, greatly expanding the chemical analytical space covered in previous multi-MS2 scorer evaluations. LC-MS2Struct obtains significantly higher annotation accuracy than earlier methods and improves the annotation accuracy of state-of-the-art MS2 scorers by up to 106%. The use of stereochemistry-aware molecular fingerprints improves prediction performance, which highlights limitations in existing approaches and has strong implications for future computational LC-MS2 developments.
Collapse
|
8
|
Ljoncheva M, Stepišnik T, Kosjek T, Džeroski S. Machine learning for identification of silylated derivatives from mass spectra. J Cheminform 2022; 14:62. [PMID: 36109826 PMCID: PMC9476372 DOI: 10.1186/s13321-022-00636-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 07/31/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Motivation
Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regression (CSI:IOKR) is an excellent example, have been used to elucidate compound structure from mass spectral (MS) data with significant accuracy, confidence and speed. They have, however, largely focused on data coming from liquid chromatography coupled to tandem mass spectrometry (LC–MS).
Gas chromatography coupled to mass spectrometry (GC–MS) is an alternative which offers several advantages as compared to LC–MS, including higher data reproducibility. Of special importance is the substantial compound coverage offered by GC–MS, further expanded by derivatization procedures, such as silylation, which can improve the volatility, thermal stability and chromatographic peak shape of semi-volatile analytes. Despite these advantages and the increasing size of compound databases and MS libraries, GC–MS data have not yet been used by machine learning approaches to compound structure identification.
Results
This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC–MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC–MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC–MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation.
Collapse
|
9
|
Tian Z, Liu F, Li D, Fernie AR, Chen W. Strategies for structure elucidation of small molecules based on LC–MS/MS data from complex biological samples. Comput Struct Biotechnol J 2022; 20:5085-5097. [PMID: 36187931 PMCID: PMC9489805 DOI: 10.1016/j.csbj.2022.09.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 09/03/2022] [Accepted: 09/03/2022] [Indexed: 11/06/2022] Open
Abstract
LC–MS/MS is a major analytical platform for metabolomics, which has become a recent hotspot in the research fields of life and environmental sciences. By contrast, structure elucidation of small molecules based on LC–MS/MS data remains a major challenge in the chemical and biological interpretation of untargeted metabolomics datasets. In recent years, several strategies for structure elucidation using LC–MS/MS data from complex biological samples have been proposed, these strategies can be simply categorized into two types, one based on structure annotation of mass spectra and for the other on retention time prediction. These strategies have helped many scientists conduct research in metabolite-related fields and are indispensable for the development of future tools. Here, we summarized the characteristics of the current tools and strategies for structure elucidation of small molecules based on LC–MS/MS data, and further discussed the directions and perspectives to improve the power of the tools or strategies for structure elucidation.
Collapse
|
10
|
Bach E, Rogers S, Williamson J, Rousu J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 2021; 37:1724-1731. [PMID: 33244585 PMCID: PMC8289373 DOI: 10.1093/bioinformatics/btaa998] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 10/27/2020] [Accepted: 11/17/2020] [Indexed: 11/14/2022] Open
Abstract
Motivation Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectrometry (MS2). Results We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features has MS2 measurements available besides MS1. Availability and implementation Software and data are freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eric Bach
- Department of Computer Science, School of Science, Aalto University, Espoo, Finland
| | - Simon Rogers
- School of Computing Science, University of Glasgow, Glasgow, UK
| | - John Williamson
- School of Computing Science, University of Glasgow, Glasgow, UK
| | - Juho Rousu
- Department of Computer Science, School of Science, Aalto University, Espoo, Finland
| |
Collapse
|
11
|
Krettler CA, Thallinger GG. A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics. Brief Bioinform 2021; 22:6184408. [PMID: 33758925 DOI: 10.1093/bib/bbab073] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 01/29/2021] [Accepted: 02/12/2021] [Indexed: 12/27/2022] Open
Abstract
Metabolomics, the comprehensive study of the metabolome, and lipidomics-the large-scale study of pathways and networks of cellular lipids-are major driving forces in enabling personalized medicine. Complicated and error-prone data analysis still remains a bottleneck, however, especially for identifying novel metabolites. Comparing experimental mass spectra to curated databases containing reference spectra has been the gold standard for identification of compounds, but constructing such databases is a costly and time-demanding task. Many software applications try to circumvent this process by utilizing cutting-edge advances in computational methods-including quantum chemistry and machine learning-and simulate mass spectra by performing theoretical, so called in silico fragmentations of compounds. Other solutions concentrate directly on experimental spectra and try to identify structural properties by investigating reoccurring patterns and the relationships between them. The considerable progress made in the field allows recent approaches to provide valuable clues to expedite annotation of experimental mass spectra. This review sheds light on individual strengths and weaknesses of these tools, and attempts to evaluate them-especially in view of lipidomics, when considering complex mixtures found in biological samples as well as mass spectrometer inter-instrument variability.
Collapse
Affiliation(s)
- Christoph A Krettler
- Institute of Biomedical Informatics, Graz University of Technology, Stremayrgasse 16/I, 8010, Graz, Austria.,Omics Center Graz, BioTechMed-Graz, Stiftingtalstrasse 24, 8010, Graz, Austria
| | - Gerhard G Thallinger
- Institute of Biomedical Informatics, Graz University of Technology, Stremayrgasse 16/I, 8010, Graz, Austria.,Omics Center Graz, BioTechMed-Graz, Stiftingtalstrasse 24, 8010, Graz, Austria
| |
Collapse
|
12
|
Perez De Souza L, Alseekh S, Brotman Y, Fernie AR. Network-based strategies in metabolomics data analysis and interpretation: from molecular networking to biological interpretation. Expert Rev Proteomics 2020; 17:243-255. [PMID: 32380880 DOI: 10.1080/14789450.2020.1766975] [Citation(s) in RCA: 84] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
INTRODUCTION Metabolomics has become a crucial part of systems biology; however, data analysis is still often undertaken in a reductionist way focusing on changes in individual metabolites. Whilst such approaches indeed provide relevant insights into the metabolic phenotype of an organism, the intricate nature of metabolic relationships may be better explored when considering the whole system. AREAS COVERED This review highlights multiple network strategies that can be applied for metabolomics data analysis from different perspectives including: association networks based on quantitative information, mass spectra similarity networks to assist metabolite annotation and biochemical networks for systematic data interpretation. We also highlight some relevant insights into metabolic organization obtained through the exploration of such approaches. EXPERT OPINION Network based analysis is an established method that allows the identification of non-intuitive metabolic relationships as well as the identification of unknown compounds in mass spectrometry. Additionally, the representation of data from metabolomics within the context of metabolic networks is intuitive and allows for the use of statistical analysis that can better summarize relevant metabolic changes from a systematic perspective.
Collapse
Affiliation(s)
- Leonardo Perez De Souza
- Department of molecular physiology, Max-Planck-Institute of Molecular Plant Physiology , Potsdam-Golm, Germany
| | - Saleh Alseekh
- Department of molecular physiology, Max-Planck-Institute of Molecular Plant Physiology , Potsdam-Golm, Germany.,Department of plant metabolomics, Centre of Plant Systems Biology and Biotechnology , Plovdiv, Bulgaria
| | - Yariv Brotman
- Department of Life Sciences, Ben-Gurion University of the Negev , Beersheba, Israel
| | - Alisdair R Fernie
- Department of molecular physiology, Max-Planck-Institute of Molecular Plant Physiology , Potsdam-Golm, Germany.,Department of plant metabolomics, Centre of Plant Systems Biology and Biotechnology , Plovdiv, Bulgaria
| |
Collapse
|
13
|
O'Shea K, Misra BB. Software tools, databases and resources in metabolomics: updates from 2018 to 2019. Metabolomics 2020; 16:36. [PMID: 32146531 DOI: 10.1007/s11306-020-01657-3] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 03/01/2020] [Indexed: 12/24/2022]
Abstract
Metabolomics has evolved as a discipline from a discovery and functional genomics tool, and is now a cornerstone in the era of big data-driven precision medicine. Sample preparation strategies and analytical technologies have seen enormous growth, and keeping pace with data analytics is challenging, to say the least. This review introduces and briefly presents around 100 metabolomics software resources, tools, databases, and other utilities that have surfaced or have improved in 2019. Table 1 provides the computational dependencies of the tools, categorizes the resources based on utility and ease of use, and provides hyperlinks to webpages where the tools can be downloaded or used. This review intends to keep the community of metabolomics researchers up to date with all the software tools, resources, and databases developed in 2019, in one place.
Collapse
Affiliation(s)
- Keiron O'Shea
- Institute of Biological, Environmental, and Rural Studies, Aberystwyth University, Ceredigion, Wales, SY23 3DA, UK
| | - Biswapriya B Misra
- Center for Precision Medicine, Department of Internal Medicine, Section of Molecular Medicine, Wake Forest School of Medicine, Medical Center Boulevard, Winston-Salem, NC, 27157, USA.
| |
Collapse
|
14
|
González-Riano C, Dudzik D, Garcia A, Gil-de-la-Fuente A, Gradillas A, Godzien J, López-Gonzálvez Á, Rey-Stolle F, Rojo D, Ruperez FJ, Saiz J, Barbas C. Recent Developments along the Analytical Process for Metabolomics Workflows. Anal Chem 2019; 92:203-226. [PMID: 31625723 DOI: 10.1021/acs.analchem.9b04553] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Carolina González-Riano
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Danuta Dudzik
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Department of Biopharmaceutics and Pharmacodynamics, Faculty of Pharmacy , Medical University of Gdańsk , 80-210 Gdańsk , Poland
| | - Antonia Garcia
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Alberto Gil-de-la-Fuente
- Department of Information Technology, Escuela Politécnica Superior , Universidad San Pablo-CEU , 28003 Madrid , Spain
| | - Ana Gradillas
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Joanna Godzien
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Clinical Research Centre , Medical University of Bialystok , 15-089 Bialystok , Poland
| | - Ángeles López-Gonzálvez
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Fernanda Rey-Stolle
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - David Rojo
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Francisco J Ruperez
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Jorge Saiz
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Coral Barbas
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| |
Collapse
|
15
|
Brouard C, Bassé A, d'Alché-Buc F, Rousu J. Improved Small Molecule Identification through Learning Combinations of Kernel Regression Models. Metabolites 2019; 9:E160. [PMID: 31374904 PMCID: PMC6724104 DOI: 10.3390/metabo9080160] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2019] [Revised: 07/30/2019] [Accepted: 07/31/2019] [Indexed: 01/15/2023] Open
Abstract
In small molecule identification from tandem mass (MS/MS) spectra, input-output kernel regression (IOKR) currently provides the state-of-the-art combination of fast training and prediction and high identification rates. The IOKR approach can be simply understood as predicting a fingerprint vector from the MS/MS spectrum of the unknown molecule, and solving a pre-image problem to find the molecule with the most similar fingerprint. In this paper, we bring forward the following improvements to the IOKR framework: firstly, we formulate the IOKRreverse model that can be understood as mapping molecular structures into the MS/MS feature space and solving a pre-image problem to find the molecule whose predicted spectrum is the closest to the input MS/MS spectrum. Secondly, we introduce an approach to combine several IOKR and IOKRreverse models computed from different input and output kernels, called IOKRfusion. The method is based on minimizing structured Hinge loss of the combined model using a mini-batch stochastic subgradient optimization. Our experiments show a consistent improvement of top-k accuracy both in positive and negative ionization mode data.
Collapse
Affiliation(s)
- Céline Brouard
- Unité de Mathématiques et Informatique Appliquées de Toulouse, UR 875, INRA, 31326 Castanet Tolosan, France.
| | - Antoine Bassé
- LTCI, Télécom Paris, Institut Polytechnique de Paris, 75634 Paris, France
| | | | - Juho Rousu
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, 00076 Espoo, Finland
| |
Collapse
|