1
|
Chau HYK, Zhang X, Ressom HW. Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation. Metabolites 2025; 15:132. [PMID: 39997757 PMCID: PMC11857613 DOI: 10.3390/metabo15020132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 02/07/2025] [Accepted: 02/10/2025] [Indexed: 02/26/2025] Open
Abstract
Background/Objectives: Liquid chromatography coupled with mass spectrometry (LC-MS) is a commonly used platform for many metabolomics studies. However, metabolite annotation has been a major bottleneck in these studies in part due to the limited publicly available spectral libraries, which consist of tandem mass spectrometry (MS/MS) data acquired from just a fraction of known compounds. Application of deep learning methods is increasingly reported as an alternative to spectral matching due to their ability to map complex relationships between molecular fingerprints and mass spectrometric measurements. The objectives of this study are to investigate deep learning methods for molecular fingerprint based on MS/MS spectra and to rank putative metabolite IDs according to similarity of their known and predicted molecular fingerprints. Methods: We trained three types of deep learning methods to model the relationships between molecular fingerprints and MS/MS spectra. Prior to training, various data processing steps, including scaling, binning, and filtering, were performed on MS/MS spectra obtained from National Institute of Standards and Technology (NIST), MassBank of North America (MoNA), and Human Metabolome Database (HMDB). Furthermore, selection of the most relevant m/z bins and molecular fingerprints was conducted. The trained deep learning models were evaluated on ranking putative metabolite IDs obtained from a compound database for the challenges in Critical Assessment of Small Molecule Identification (CASMI) 2016, CASMI 2017, and CASMI 2022 benchmark datasets. Results: Feature selection methods effectively reduced redundant molecular and spectral features prior to model training. Deep learning methods trained with the truncated features have shown comparable performances against CSI:FingerID on ranking putative metabolite IDs. Conclusion: The results demonstrate a promising potential of deep learning methods for metabolite annotation.
Collapse
Affiliation(s)
| | | | - Habtom W. Ressom
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC 20057, USA; (H.Y.K.C.); (X.Z.)
| |
Collapse
|
2
|
Adduri AK, McNutt AT, Ellington CN, Suraparaju K, Fang N, Yan D, Krummenacher B, Li S, Bodden C, Xing EP, Behsaz B, Koes D, Mohimani H. Interpretable adenylation domain specificity prediction using protein language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.13.632878. [PMID: 39868251 PMCID: PMC11761653 DOI: 10.1101/2025.01.13.632878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Natural products have long been a rich source of diverse and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity-guided techniques is costly and time-consuming, often resulting in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential to produce novel natural products. Adenylation domains play a key role in the biosynthesis of NRPs and NRP-PKs by recruiting substrates to incrementally build the final structure. We propose MASPR, a machine learning method that leverages protein language models for accurate and interpretable predictions of A-domain substrate specificities. MASPR demonstrates superior accuracy and generalization over existing methods and is capable of predicting substrates not present in its training data, or zero-shot classification. We use MASPR to develop Seq2Hybrid, an efficient algorithm to predict the structure of hybrid NRP-PK natural products from microbial genomes. Using Seq2Hybrid, we propose putative biosynthetic gene clusters for the orphan natural products Octaminomycin A, Dityromycin, SW-163B, and JBIR-39.
Collapse
Affiliation(s)
- Abhinav K Adduri
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Andrew T McNutt
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Caleb N Ellington
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Krish Suraparaju
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Nan Fang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Donghui Yan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Benjamin Krummenacher
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Sitong Li
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Camilla Bodden
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eric P Xing
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Bahar Behsaz
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - David Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Hosein Mohimani
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
3
|
Bui-Thi D, Liu Y, Lippens JL, Laukens K, De Vijlder T. TransExION: a transformer based explainable similarity metric for comparing IONS in tandem mass spectrometry. J Cheminform 2024; 16:61. [PMID: 38807166 PMCID: PMC11134763 DOI: 10.1186/s13321-024-00858-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
Small molecule identification is a crucial task in analytical chemistry and life sciences. One of the most commonly used technologies to elucidate small molecule structures is mass spectrometry. Spectral library search of product ion spectra (MS/MS) is a popular strategy to identify or find structural analogues. This approach relies on the assumption that spectral similarity and structural similarity are correlated. However, popular spectral similarity measures, usually calculated based on identical fragment matches between the MS/MS spectra, do not always accurately reflect the structural similarity. In this study, we propose TransExION, a Transformer based Explainable similarity metric for IONS. TransExION detects related fragments between MS/MS spectra through their mass difference and uses these to estimate spectral similarity. These related fragments can be nearly identical, but can also share a substructure. TransExION also provides a post-hoc explanation of its estimation, which can be used to support scientists in evaluating the spectral library search results and thus in structure elucidation of unknown molecules. Our model has a Transformer based architecture and it is trained on the data derived from GNPS MS/MS libraries. The experimental results show that it improves existing spectral similarity measures in searching and interpreting structural analogues as well as in molecular networking. SCIENTIFIC CONTRIBUTION: We propose a transformer-based spectral similarity metrics that improves the comparison of small molecule tandem mass spectra. We provide a post hoc explanation that can serve as a good starting point for unknown spectra annotation based on database spectra.
Collapse
Affiliation(s)
- Danh Bui-Thi
- Computer Science Department, University of Antwerp, Middelheimlaan 1, 2020, Antwerp, Belgium
| | - Youzhong Liu
- Therapeutic Development and Supply, Janssen Pharmaceutica N.V., Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Jennifer L Lippens
- Therapeutic Development and Supply, Janssen Pharmaceutica N.V., Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Kris Laukens
- Computer Science Department, University of Antwerp, Middelheimlaan 1, 2020, Antwerp, Belgium
| | - Thomas De Vijlder
- Therapeutic Development and Supply, Janssen Pharmaceutica N.V., Turnhoutseweg 30, 2340, Beerse, Belgium.
| |
Collapse
|
4
|
Heid E, Greenman KP, Chung Y, Li SC, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ. Chemprop: A Machine Learning Package for Chemical Property Prediction. J Chem Inf Model 2024; 64:9-17. [PMID: 38147829 PMCID: PMC10777403 DOI: 10.1021/acs.jcim.3c01250] [Citation(s) in RCA: 94] [Impact Index Per Article: 94.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 12/04/2023] [Accepted: 12/05/2023] [Indexed: 12/28/2023]
Abstract
Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by nonexperts. Among the current approaches, directed message-passing neural networks (D-MPNNs) have proven to perform well on a variety of property prediction tasks. The software package Chemprop implements the D-MPNN architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra. Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features. We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra. Chemprop enables out-of-the-box training of D-MPNN models for a variety of problem settings in fast, user-friendly, and open-source software.
Collapse
Affiliation(s)
- Esther Heid
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Institute
of Materials Chemistry, TU Wien, 1060 Vienna, Austria
| | - Kevin P. Greenman
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Yunsie Chung
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Shih-Cheng Li
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, National Taiwan
University, Taipei 10617, Taiwan
| | - David E. Graff
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemistry and Chemical Biology, Harvard
University, Cambridge, Massachusetts 02138, United States
| | - Florence H. Vermeire
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, KU Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
| | - Haoyang Wu
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - William H. Green
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Charles J. McGill
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, Virginia 23284, United States
| |
Collapse
|
5
|
Rutz A, Wolfender JL. Automated Composition Assessment of Natural Extracts: Untargeted Mass Spectrometry-Based Metabolite Profiling Integrating Semiquantitative Detection. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2023; 71:18010-18023. [PMID: 37949451 PMCID: PMC10683005 DOI: 10.1021/acs.jafc.3c03099] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 09/19/2023] [Accepted: 09/22/2023] [Indexed: 11/12/2023]
Abstract
Recent developments in mass spectrometry-based metabolite profiling allow unprecedented qualitative coverage of complex biological extract composition. However, the electrospray ionization used in metabolite profiling generates multiple artifactual signals for a single analyte. This leads to thousands of signals per analysis without satisfactory means of filtering those corresponding to abundant constituents. Generic approaches are therefore needed for the qualitative and quantitative annotation of a broad range of relevant constituents. For this, we used an analytical platform combining liquid chromatography-mass spectrometry (LC-MS) with Charged Aerosol Detection (CAD). We established a generic metabolite profiling for the concomitant recording of qualitative MS data and semiquantitative CAD profiles. The MS features (recorded in high-resolution tandem MS) are grouped and annotated using state-of-the-art tools. To efficiently attribute features to their corresponding extracted and integrated CAD peaks, a custom signal pretreatment and peak-shape comparison workflow is built. This strategy allows us to automatically contextualize features at both major and minor metabolome levels, together with a detailed reporting of their annotation including relevant orthogonal information (taxonomy, retention time). Signals not attributed to CAD peaks are considered minor metabolites. Results are illustrated on an ethanolic extract of Swertia chirayita (Roxb.) H. Karst., a bitter plant of industrial interest, exhibiting the typical complexity of plant extracts as a proof of concept. This generic qualitative and quantitative approach paves the way to automatically assess the composition of single natural extracts of interest or broader collections, thus facilitating new ingredient registrations or natural-extracts-based drug discovery campaigns.
Collapse
Affiliation(s)
- Adriano Rutz
- School
of Pharmaceutical Sciences, University of
Geneva, 1211 Geneva, Switzerland
- Institute
of Pharmaceutical Sciences of Western Switzerland, University of Geneva, 1211 Geneva, Switzerland
- Institute
of Molecular Systems Biology, ETH Zürich, 8093 Zürich, Switzerland
| | - Jean-Luc Wolfender
- School
of Pharmaceutical Sciences, University of
Geneva, 1211 Geneva, Switzerland
- Institute
of Pharmaceutical Sciences of Western Switzerland, University of Geneva, 1211 Geneva, Switzerland
| |
Collapse
|