1
|
Making 1H- 1H Couplings More Accessible and Accurate with Selective 2DJ NMR Experiments Aided by 13C Satellites. Anal Chem 2024; 96:7056-7064. [PMID: 38666447 DOI: 10.1021/acs.analchem.4c00315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
1H-1H coupling constants are one of the primary sources of information for nuclear magnetic resonance (NMR) structural analysis. Several selective 2DJ experiments have been proposed that allow for their individual measurement at pure shift resolution. However, all of these experiments fail in the not uncommon case when coupled protons have very close chemical shifts. First, the coupling between protons with overlapping multiplets is inaccessible due to the inability of a frequency-selective pulse to invert just one of them. Second, the strong coupling condition affects the accuracy of coupling measurements involving third spins. These shortcomings impose a limit on the effectiveness of state-of-the-art experiments, such as G-SERF or PSYCHEDELIC. Here, we introduce two new and complementary selective 2DJ experiments that we coin SERFBIRD and SATASERF. These experiments overcome the aforementioned issues by utilizing the 13C satellite signals at natural isotope abundance, which resolves the chemical shift degeneracy. We demonstrate the utility of these experiments on the tetrasaccharide stachyose and the challenging case of norcamphor, for the latter achieving measurement of all JHH couplings, while only a few were accessible with PSYCHEDELIC. The new experiments are applicable to any organic compound and will prove valuable for configurational and conformational analyses.
Collapse
|
2
|
From NMR to AI: Designing a Novel Chemical Representation to Enhance Machine Learning Predictions of Physicochemical Properties. J Chem Inf Model 2024; 64:3302-3321. [PMID: 38529877 DOI: 10.1021/acs.jcim.3c02039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
A novel approach to the utilization of nuclear magnetic resonance (NMR) spectroscopy data in the prediction of logD through machine learning algorithms is shown. In the analysis, a data set of 754 chemical compounds, organized into 30 clusters, was evaluated using advanced machine learning models, such as Support Vector Regression (SVR), Gradient Boosting, and AdaBoost, and comprehensive validation and testing methods were employed, including 10-fold cross-validation, bootstrapping, and leave-one-out. The study revealed the superior performance of the Bucket Integration method for dimensionality reduction, consistently yielding the lowest root mean square error (RMSE) across all data sets and normalization schemes. The SVR prediction models demonstrated remarkable computational efficiency and low cost, with the best RMSE value reaching 0.66. Our best model outperformed existing tools like JChem Suite's logD Predictor (0.91) and CplogD (1.27), and a comparison with traditional molecular representations yielded a comparable RMSE (0.50), emphasizing the robustness of our NMR data integration. The widespread availability of NMR data in pharmaceutical and industrial research presents an untapped resource for predictive modeling, highlighting the need for accessible methodologies like ours that complement the analytical toolbox beyond conventional 2D approaches. Our approach, designed to leverage the rich spatial data from NMR spectroscopy, provides additional insights and enriches drug discovery and computational chemistry with a freely accessible tool.
Collapse
|
3
|
Polymorph Identification for Flexible Molecules: Linear Regression Analysis of Experimental and Calculated Solution- and Solid-State NMR Data. J Phys Chem A 2024; 128:1793-1816. [PMID: 38427685 PMCID: PMC10945485 DOI: 10.1021/acs.jpca.3c07732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 02/06/2024] [Accepted: 02/07/2024] [Indexed: 03/03/2024]
Abstract
The Δδ regression approach of Blade et al. [ J. Phys. Chem. A 2020, 124(43), 8959-8977] for accurately discriminating between solid forms using a combination of experimental solution- and solid-state NMR data with density functional theory (DFT) calculation is here extended to molecules with multiple conformational degrees of freedom, using furosemide polymorphs as an exemplar. As before, the differences in measured 1H and 13C chemical shifts between solution-state NMR and solid-state magic-angle spinning (MAS) NMR (Δδexperimental) are compared to those determined by gauge-including projector augmented wave (GIPAW) calculations (Δδcalculated) by regression analysis and a t-test, allowing the correct furosemide polymorph to be precisely identified. Monte Carlo random sampling is used to calculate solution-state NMR chemical shifts, reducing computation times by avoiding the need to systematically sample the multidimensional conformational landscape that furosemide occupies in solution. The solvent conditions should be chosen to match the molecule's charge state between the solution and solid states. The Δδ regression approach indicates whether or not correlations between Δδexperimental and Δδcalculated are statistically significant; the approach is differently sensitive to the popular root mean squared error (RMSE) method, being shown to exhibit a much greater dynamic range. An alternative method for estimating solution-state NMR chemical shifts by approximating the measured solution-state dynamic 3D behavior with an ensemble of 54 furosemide crystal structures (polymorphs and cocrystals) from the Cambridge Structural Database (CSD) was also successful in this case, suggesting new avenues for this method that may overcome its current dependency on the prior determination of solution dynamic 3D structures.
Collapse
|
4
|
A Very Deep Graph Convolutional Network for 13C NMR Chemical Shift Calculations with Density Functional Theory Level Performance for Structure Assignment. JOURNAL OF NATURAL PRODUCTS 2024. [PMID: 38359467 DOI: 10.1021/acs.jnatprod.3c00862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2024]
Abstract
Nuclear magnetic resonance (NMR) chemical shift calculations are powerful tools for structure elucidation and have been extensively employed in both natural product and synthetic chemistry. However, density functional theory (DFT) NMR chemical shift calculations are usually time-consuming, while fast data-driven methods often lack reliability, making it challenging to apply them to computationally intensive tasks with a high requirement on quality. Herein, we have constructed a 54-layer-deep graph convolutional network for 13C NMR chemical shift calculations, which achieved high accuracy with low time-cost and performed competitively with DFT NMR chemical shift calculations on structure assignment benchmarks. Our model utilizes a semiempirical method, GFN2-xTB, and is compatible with a broad variety of organic systems, including those composed of hundreds of atoms or elements ranging from H to Rn. We used this model to resolve the controversial J/K ring junction problem of maitotoxin, which is the largest whole molecule assigned by NMR calculations to date. This model has been developed into user-friendly software, providing a useful tool for routine rapid structure validation and assignation as well as a new approach to elucidate the large structures that were previously unsuitable for NMR calculations.
Collapse
|
5
|
Frontiers of molecular crystal structure prediction for pharmaceuticals and functional organic materials. Chem Sci 2023; 14:13290-13312. [PMID: 38033897 PMCID: PMC10685338 DOI: 10.1039/d3sc03903j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 11/02/2023] [Indexed: 12/02/2023] Open
Abstract
The reliability of organic molecular crystal structure prediction has improved tremendously in recent years. Crystal structure predictions for small, mostly rigid molecules are quickly becoming routine. Structure predictions for larger, highly flexible molecules are more challenging, but their crystal structures can also now be predicted with increasing rates of success. These advances are ushering in a new era where crystal structure prediction drives the experimental discovery of new solid forms. After briefly discussing the computational methods that enable successful crystal structure prediction, this perspective presents case studies from the literature that demonstrate how state-of-the-art crystal structure prediction can transform how scientists approach problems involving the organic solid state. Applications to pharmaceuticals, porous organic materials, photomechanical crystals, organic semi-conductors, and nuclear magnetic resonance crystallography are included. Finally, efforts to improve our understanding of which predicted crystal structures can actually be produced experimentally and other outstanding challenges are discussed.
Collapse
|
6
|
NMR shift prediction from small data quantities. J Cheminform 2023; 15:114. [PMID: 38012793 PMCID: PMC10683292 DOI: 10.1186/s13321-023-00785-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Accepted: 11/16/2023] [Indexed: 11/29/2023] Open
Abstract
Prediction of chemical shift in NMR using machine learning methods is typically done with the maximum amount of data available to achieve the best results. In some cases, such large amounts of data are not available, e.g. for heteronuclei. We demonstrate a novel machine learning model that is able to achieve better results than other models for relevant datasets with comparatively low amounts of data. We show this by predicting [Formula: see text] and [Formula: see text] NMR chemical shifts of small molecules in specific solvents.
Collapse
|
7
|
Molecular Graph-Based Deep Learning Algorithm Facilitates an Imaging-Based Strategy for Rapid Discovery of Small Molecules Modulating Biomolecular Condensates. J Med Chem 2023; 66:15084-15093. [PMID: 37937963 PMCID: PMC10810226 DOI: 10.1021/acs.jmedchem.3c00490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2023]
Abstract
Biomolecular condensates are proposed to cause diseases, such as cancer and neurodegeneration, by concentrating proteins at abnormal subcellular loci. Imaging-based compound screens have been used to identify small molecules that reverse or promote biomolecular condensates. However, limitations of conventional imaging-based methods restrict the screening scale. Here, we used a graph convolutional network (GCN)-based computational approach and identified small molecule candidates that reduce the nuclear liquid-liquid phase separation of TAR DNA-binding protein 43 (TDP-43), an essential protein that undergoes phase transition in neurodegenerative diseases. We demonstrated that the GCN-based deep learning algorithm is suitable for spatial information extraction from the molecular graph. Thus, this is a promising method to identify small molecule candidates with novel scaffolds. Furthermore, we validated that these candidates do not affect the normal splicing function of TDP-43. Taken together, a combination of an imaging-based screen and a GCN-based deep learning method dramatically improves the speed and accuracy of the compound screen for biomolecular condensates.
Collapse
|
8
|
Machine learning-assisted structure annotation of natural products based on MS and NMR data. Nat Prod Rep 2023; 40:1735-1753. [PMID: 37519196 DOI: 10.1039/d3np00025g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/01/2023]
Abstract
Covering: up to March 2023Machine learning (ML) has emerged as a popular tool for analyzing the structures of natural products (NPs). This review presents a summary of the recent advancements in ML-assisted mass spectrometry (MS) and nuclear magnetic resonance (NMR) data analysis to establish the chemical structures of NPs. First, ML-based MS/MS analyses that rely on library matching are discussed, which involves the utilization of ML algorithms to calculate similarity, predict the MS/MS fragments, and form molecular fingerprint. Then, ML assisted MS/MS structural annotation without library matching is reviewed. Furthermore, the cases of ML algorithms in assisting structural studies of NPs based on NMR are discussed from four perspectives: NMR prediction, functional group identification, structural categorization and quantum chemical calculation. Finally, the review concludes with a discussion of the challenges and the trends associated with the structural establishment of NPs based on ML algorithms.
Collapse
|
9
|
Deep-Learning-Based Mixture Identification for Nuclear Magnetic Resonance Spectroscopy Applied to Plant Flavors. Molecules 2023; 28:7380. [PMID: 37959799 PMCID: PMC10648966 DOI: 10.3390/molecules28217380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 10/25/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
Nuclear magnetic resonance (NMR) is a crucial technique for analyzing mixtures consisting of small molecules, providing non-destructive, fast, reproducible, and unbiased benefits. However, it is challenging to perform mixture identification because of the offset of chemical shifts and peak overlaps that often exist in mixtures such as plant flavors. Here, we propose a deep-learning-based mixture identification method (DeepMID) that can be used to identify plant flavors (mixtures) in a formulated flavor (mixture consisting of several plant flavors) without the need to know the specific components in the plant flavors. A pseudo-Siamese convolutional neural network (pSCNN) and a spatial pyramid pooling (SPP) layer were used to solve the problems due to their high accuracy and robustness. The DeepMID model is trained, validated, and tested on an augmented data set containing 50,000 pairs of formulated and plant flavors. We demonstrate that DeepMID can achieve excellent prediction results in the augmented test set: ACC = 99.58%, TPR = 99.48%, FPR = 0.32%; and two experimentally obtained data sets: one shows ACC = 97.60%, TPR = 92.81%, FPR = 0.78% and the other shows ACC = 92.31%, TPR = 80.00%, FPR = 0.00%. In conclusion, DeepMID is a reliable method for identifying plant flavors in formulated flavors based on NMR spectroscopy, which can assist researchers in accelerating the design of flavor formulations.
Collapse
|
10
|
Rapid prediction of full spin systems using uncertainty-aware machine learning. Chem Sci 2023; 14:10902-10913. [PMID: 37829025 PMCID: PMC10566464 DOI: 10.1039/d3sc01930f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 09/15/2023] [Indexed: 10/14/2023] Open
Abstract
Accurate simulation of solution NMR spectra requires knowledge of all chemical shift and scalar coupling parameters, traditionally accomplished by heuristic-based techniques or ab initio computational chemistry methods. Here we present a novel machine learning technique which combines uncertainty-aware deep learning with rapid estimates of conformational geometries to generate Full Spin System Predictions with UnCertainty (FullSSPrUCe). We improve on previous state of the art in accuracy on chemical shift values, predicting protons to within 0.209 ppm and carbons to within 1.213 ppm. Further, we are able to predict all scalar coupling values, unlike previous GNN models, achieving 3JHH accuracies between 0.838 Hz and 1.392 Hz on small experimental datasets. Our uncertainty quantification shows a strong, useful correlation with accuracy, with the most confident predictions having significantly reduced error, including our top-80% most confident proton shift predictions having an average error of only 0.140 ppm. We also properly handle stereoisomerism and intelligently augment experimental data with ab initio data through disagreement regularization to account for deficiencies in training data.
Collapse
|
11
|
Disorder and Oxide Ion Diffusion Mechanism in La 1.54Sr 0.46Ga 3O 7.27 Melilite from Nuclear Magnetic Resonance. J Am Chem Soc 2023; 145:21817-21831. [PMID: 37782307 PMCID: PMC10571088 DOI: 10.1021/jacs.3c04821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Indexed: 10/03/2023]
Abstract
Layered tetrahedral network melilite is a promising structural family of fast ion conductors that exhibits the flexibility required to accommodate interstitial oxide anions, leading to excellent ionic transport properties at moderate temperatures. Here, we present a combined experimental and computational magic angle spinning (MAS) nuclear magnetic resonance (NMR) approach which aims at elucidating the local configurational disorder and oxide ion diffusion mechanism in a key member of this structural family possessing the La1.54Sr0.46Ga3O7.27 composition. 17O and 71Ga MAS NMR spectra display complex spectral line shapes that could be accurately predicted using a computational ensemble-based approach to model site disorder across multiple cationic and anionic sites, thereby enabling the assignment of bridging/nonbridging oxygens and the identification of distinct gallium coordination environments. The 17O and 71Ga MAS NMR spectra of La1.54Sr0.46Ga3O7.27 display additional features not observed for the parent LaSrGa3O7 phase which are attributed to interstitial oxide ions incorporated upon cation doping and stabilized by the formation of five-coordinate Ga centers conferring framework flexibility. 17O high-temperature (HT) MAS NMR experiments capture exchange within the bridging oxygens at 130 °C and reveal coalescence of all oxygen signals in La1.54Sr0.46Ga3O7.27 at approximately 300 °C, indicative of the participation of both interstitial and framework oxide ions in the transport process. These results further supported by the coalescence of the 71Ga resonances in the 71Ga HT MAS NMR spectra of La1.54Sr0.46Ga3O7.27 unequivocally provide evidence of the conduction mechanism in this melilite phase and highlight the potential of MAS NMR spectroscopy to enhance the understanding of ionic motion in solid electrolytes.
Collapse
|
12
|
MIM-ML: A Novel Quantum Chemical Fragment-Based Random Forest Model for Accurate Prediction of NMR Chemical Shifts of Nucleic Acids. J Chem Theory Comput 2023; 19:6632-6642. [PMID: 37703522 DOI: 10.1021/acs.jctc.3c00563] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
We developed a random forest machine learning (ML) model for the prediction of 1H and 13C NMR chemical shifts of nucleic acids. Our ML model is trained entirely on reproducing computed chemical shifts obtained previously on 10 nucleic acids using a Molecules-in-Molecules (MIM) fragment-based density functional theory (DFT) protocol including microsolvation effects. Our ML model includes structural descriptors as well as electronic descriptors from an inexpensive low-level semiempirical calculation (GFN2-xTB) and trained on a relatively small number of DFT chemical shifts (2080 1H chemical shifts and 1780 13C chemical shifts on the 10 nucleic acids). The ML model is then used to make chemical shift predictions on 8 new nucleic acids ranging in size from 600 to 900 atoms and compared directly to experimental data. Though no experimental data was used in the training, the performance of our model is excellent (mean absolute deviation of 0.34 ppm for 1H chemical shifts and 2.52 ppm for 13C chemical shifts for the test set), despite having some nonstandard structures. A simple analysis suggests that both structural and electronic descriptors are critical for achieving reliable predictions. This is the first attempt to combine ML from fragment-based DFT calculations to predict experimental chemical shifts accurately, making the MIM-ML model a valuable tool for NMR predictions of nucleic acids.
Collapse
|
13
|
Coarse-grained versus fully atomistic machine learning for zeolitic imidazolate frameworks. Chem Commun (Camb) 2023; 59:11405-11408. [PMID: 37668310 PMCID: PMC10513772 DOI: 10.1039/d3cc02265j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 08/22/2023] [Indexed: 09/06/2023]
Abstract
Zeolitic imidazolate frameworks are widely thought of as being analogous to inorganic AB2 phases. We test the validity of this assumption by comparing simplified and fully atomistic machine-learning models for local environments in ZIFs. Our work addresses the central question to what extent chemical information can be "coarse-grained" in hybrid framework materials.
Collapse
|
14
|
Atomic-level structure determination of amorphous molecular solids by NMR. Nat Commun 2023; 14:5138. [PMID: 37612269 PMCID: PMC10447443 DOI: 10.1038/s41467-023-40853-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Accepted: 08/10/2023] [Indexed: 08/25/2023] Open
Abstract
Structure determination of amorphous materials remains challenging, owing to the disorder inherent to these materials. Nuclear magnetic resonance (NMR) powder crystallography is a powerful method to determine the structure of molecular solids, but disorder leads to a high degree of overlap between measured signals, and prevents the unambiguous identification of a single modeled periodic structure as representative of the whole material. Here, we determine the atomic-level ensemble structure of the amorphous form of the drug AZD4625 by combining solid-state NMR experiments with molecular dynamics (MD) simulations and machine-learned chemical shifts. By considering the combined shifts of all 1H and 13C atomic sites in the molecule, we determine the structure of the amorphous form by identifying an ensemble of local molecular environments that are in agreement with experiment. We then extract and analyze preferred conformations and intermolecular interactions in the amorphous sample in terms of the stabilization of the amorphous form of the drug.
Collapse
|
15
|
Abstract
NMR spectroscopy undoubtedly plays a central role in determining molecular structures across different chemical disciplines, and the accurate computational prediction of NMR parameters is highly desirable. In this work, a new Δ-machine learning approach is presented to correct DFT-computed NMR chemical shifts using input features from the calculation and in addition highly accurate reference data at the CCSD(T)/pcSseg-2 level of theory with a basis set extrapolation scheme. The model is trained on a data set containing 1000 optimized and geometrically distorted structures of small organic molecules comprising most elements of the first three periods and containing data for 7090 1H and 4230 13C NMR chemical shifts. Applied to the PBE0/pcSseg-2 method, the mean absolute deviation (MAD) on the internal NMR shift test set is reduced by 81% for 1H and 92% for 13C at virtually no additional computational cost. For 12 different DFT functional and basis set combinations, the MAD of the ML-corrected NMR shifts ranges from 0.021 to 0.039 ppm (1H) and from 0.38 to 1.07 ppm (13C). Importantly, the new method consistently outperforms the simple and widely used linear regression correction technique. This behavior is reproduced on three different external benchmark sets, confirming the generality and robustness of the correction scheme, which can easily be applied in DFT-based spectral simulations.
Collapse
|
16
|
Predicting 195 Pt NMR Chemical Shifts in Water-Soluble Inorganic/Organometallic Complexes with a Fast and Simple Protocol Combining Semiempirical Modeling and Machine Learning. Chemphyschem 2023:e202200940. [PMID: 36806426 DOI: 10.1002/cphc.202200940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 02/20/2023] [Accepted: 02/20/2023] [Indexed: 02/23/2023]
Abstract
Water-soluble Pt complexes are the key components in medicinal chemistry and catalysis. The well-known cisplatin family of anticancer drugs and industrial hydrosylilation catalysts are two leading examples. On the molecular level, the activity mechanisms of such complexes mostly involve changes in the Pt coordination sphere. Using 195 Pt NMR spectroscopy for operando monitoring would be a valuable tool for uncovering the activity mechanisms; however, reliable approaches for the rapid correlation of Pt complex structure with 195 Pt chemical shifts are very challenging and not available for everyday research practice. While NMR shielding is a response property, molecular 3D structure determines NMR spectra, as widely known, which allows us to build up 3D structure to 195 Pt chemical shift correlations. Accordingly, we present a new workflow for the determination of lowest-energy configurational/conformational isomers based on the GFN2-xTB semiempirical method and prediction of corresponding chemical shifts with a Machine Learning (ML) model tuned for Pt complexes. The workflow was designed for the prediction of 195 Pt chemical shifts of water-soluble Pt(II) and Pt(IV) anionic, neutral, and cationic complexes with halide, NO2 - , (di)amino, and (di)carboxylate ligands with chemical shift values ranging from -6293 to 7090 ppm. The model offered an accuracy (normalized root-mean-square deviation/RMSD) of 1.08 %/145.02 ppm on the held-out test set.
Collapse
|
17
|
Abstract
Glycans, carbohydrate molecules in the realm of biology, are present as biomedically important glycoconjugates and a characteristic aspect is that their structures in many instances are branched. In determining the primary structure of a glycan, the sugar components including the absolute configuration and ring form, anomeric configuration, linkage(s), sequence, and substituents should be elucidated. Solution state NMR spectroscopy offers a unique opportunity to resolve all these aspects at atomic resolution. During the last two decades, advancement of both NMR experiments and spectrometer hardware have made it possible to unravel carbohydrate structure more efficiently. These developments applicable to glycans include, inter alia, NMR experiments that reduce spectral overlap, use selective excitations, record tilted projections of multidimensional spectra, acquire spectra by multiple receivers, utilize polarization by fast-pulsing techniques, concatenate pulse-sequence modules to acquire several spectra in a single measurement, acquire pure shift correlated spectra devoid of scalar couplings, employ stable isotope labeling to efficiently obtain homo- and/or heteronuclear correlations, as well as those that rely on dipolar cross-correlated interactions for sequential information. Refined computer programs for NMR spin simulation and chemical shift prediction aid the structural elucidation of glycans, which are notorious for their limited spectral dispersion. Hardware developments include cryogenically cold probes and dynamic nuclear polarization techniques, both resulting in enhanced sensitivity as well as ultrahigh field NMR spectrometers with a 1H NMR resonance frequency higher than 1 GHz, thus improving resolution of resonances. Taken together, the developments have made and will in the future make it possible to elucidate carbohydrate structure in great detail, thereby forming the basis for understanding of how glycans interact with other molecules.
Collapse
|
18
|
May the Force (Field) Be with You: On the Importance of Conformational Searches in the Prediction of NMR Chemical Shifts. Mar Drugs 2022; 20:699. [PMID: 36355022 PMCID: PMC9694776 DOI: 10.3390/md20110699] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 10/29/2022] [Accepted: 11/04/2022] [Indexed: 09/21/2023] Open
Abstract
NMR data prediction is increasingly important in structure elucidation. The impact of force field selection was assessed, along with geometry and energy cutoffs. Based on the conclusions, we propose a new approach named mix-J-DP4, which provides a remarkable increase in the confidence level of complex stereochemical assignments-100% in our molecular test set-with a very modest increment in computational cost.
Collapse
|
19
|
Prediction of 15 N chemical shifts by machine learning. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1087-1092. [PMID: 34407565 DOI: 10.1002/mrc.5208] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 06/16/2021] [Accepted: 08/13/2021] [Indexed: 06/13/2023]
Abstract
We demonstrate the potential for machine learning systems to predict three-dimensional (3D)-relevant NMR properties beyond traditional 1 H- and 13 C-based data, with comparable accuracy to density functional theory (DFT) (but orders of magnitude faster). Predictions of DFT-calculated 15 N chemical shifts for 3D molecular structures can be achieved using a machine learning system-IMPRESSION (Intelligent Machine PREdiction of Shift and Scalar information Of Nuclei), with an accuracy of 6.12-ppm mean absolute error (∼1% of the δ15 N chemical shift range) and an error of less than 20 ppm for 95% of the chemical shifts. It provides less accurate raw predictions of experimental chemical shifts, due to the limited size and chemical space diversity of the training dataset used in its creation, coupled with the limitations of the underlying DFT methodology in reproducing experiment.
Collapse
|
20
|
Prediction of chemical shift in NMR: A review. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1021-1031. [PMID: 34787335 DOI: 10.1002/mrc.5234] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/10/2021] [Accepted: 11/11/2021] [Indexed: 06/13/2023]
Abstract
Calculation of solution-state NMR parameters, including chemical shift values and scalar coupling constants, is often a crucial step for unambiguous structure assignment. Data-driven (sometimes called empirical) methods leverage databases of known parameter values to estimate parameters for unknown or novel molecules. This is in contrast to popular ab initio techniques that use detailed quantum computational chemistry calculations to arrive at parameter estimates. Data-driven methods have the potential to be considerably faster than ab inito techniques and have been the subject of renewed interest over the past decade with the rise of high-quality databases of NMR parameters and novel machine learning methods. Here, we review these methods, their strengths and pitfalls, and the databases they are built on.
Collapse
|
21
|
A Machine Learning Model of Chemical Shifts for Chemically and Structurally Diverse Molecular Solids. THE JOURNAL OF PHYSICAL CHEMISTRY. C, NANOMATERIALS AND INTERFACES 2022; 126:16710-16720. [PMID: 36237276 PMCID: PMC9549463 DOI: 10.1021/acs.jpcc.2c03854] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/24/2022] [Indexed: 06/16/2023]
Abstract
Nuclear magnetic resonance (NMR) chemical shifts are a direct probe of local atomic environments and can be used to determine the structure of solid materials. However, the substantial computational cost required to predict accurate chemical shifts is a key bottleneck for NMR crystallography. We recently introduced ShiftML, a machine-learning model of chemical shifts in molecular solids, trained on minimum-energy geometries of materials composed of C, H, N, O, and S that provides rapid chemical shift predictions with density functional theory (DFT) accuracy. Here, we extend the capabilities of ShiftML to predict chemical shifts for both finite temperature structures and more chemically diverse compounds, while retaining the same speed and accuracy. For a benchmark set of 13 molecular solids, we find a root-mean-squared error of 0.47 ppm with respect to experiment for 1H shift predictions (compared to 0.35 ppm for explicit DFT calculations), while reducing the computational cost by over four orders of magnitude.
Collapse
|
22
|
Deep Learning-Based Method for Compound Identification in NMR Spectra of Mixtures. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27123653. [PMID: 35744782 PMCID: PMC9227391 DOI: 10.3390/molecules27123653] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 06/03/2022] [Accepted: 06/05/2022] [Indexed: 11/16/2022]
Abstract
Nuclear magnetic resonance (NMR) spectroscopy is highly unbiased and reproducible, which provides us a powerful tool to analyze mixtures consisting of small molecules. However, the compound identification in NMR spectra of mixtures is highly challenging because of chemical shift variations of the same compound in different mixtures and peak overlapping among molecules. Here, we present a pseudo-Siamese convolutional neural network method (pSCNN) to identify compounds in mixtures for NMR spectroscopy. A data augmentation method was implemented for the superposition of several NMR spectra sampled from a spectral database with random noises. The augmented dataset was split and used to train, validate and test the pSCNN model. Two experimental NMR datasets (flavor mixtures and additional flavor mixture) were acquired to benchmark its performance in real applications. The results show that the proposed method can achieve good performances in the augmented test set (ACC = 99.80%, TPR = 99.70% and FPR = 0.10%), the flavor mixtures dataset (ACC = 97.62%, TPR = 96.44% and FPR = 2.29%) and the additional flavor mixture dataset (ACC = 91.67%, TPR = 100.00% and FPR = 10.53%). We have demonstrated that the translational invariance of convolutional neural networks can solve the chemical shift variation problem in NMR spectra. In summary, pSCNN is an off-the-shelf method to identify compounds in mixtures for NMR spectroscopy because of its accuracy in compound identification and robustness to chemical shift variation.
Collapse
|
23
|
GRAN3SAT: Creating Flexible Higher-Order Logic Satisfiability in the Discrete Hopfield Neural Network. MATHEMATICS 2022. [DOI: 10.3390/math10111899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
One of the main problems in representing information in the form of nonsystematic logic is the lack of flexibility, which leads to potential overfitting. Although nonsystematic logic improves the representation of the conventional k Satisfiability, the formulations of the first, second, and third-order logical structures are very predictable. This paper proposed a novel higher-order logical structure, named G-Type Random k Satisfiability, by capitalizing the new random feature of the first, second, and third-order clauses. The proposed logic was implemented into the Discrete Hopfield Neural Network as a symbolic logical rule. The proposed logic in Discrete Hopfield Neural Networks was evaluated using different parameter settings, such as different orders of clauses, different proportions between positive and negative literals, relaxation, and differing numbers of learning trials. Each evaluation utilized various performance metrics, such as learning error, testing error, weight error, energy analysis, and similarity analysis. In addition, the flexibility of the proposed logic was compared with current state-of-the-art logic rules. Based on the simulation, the proposed logic was reported to be more flexible, and produced higher solution diversity.
Collapse
|
24
|
Regression Machine Learning Models Used to Predict DFT-Computed NMR Parameters of Zeolites. COMPUTATION 2022. [DOI: 10.3390/computation10050074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Machine learning approaches can drastically decrease the computational time for the predictions of spectroscopic properties in materials, while preserving the quality of the computational approaches. We studied the performance of kernel-ridge regression (KRR) and gradient boosting regressor (GBR) models trained on the isotropic shielding values, computed with density-functional theory (DFT), in a series of different known zeolites containing out-of-frame metal cations or fluorine anion and organic structure-directing cations. The smooth overlap of atomic position descriptors were computed from the DFT-optimised Cartesian coordinates of each atoms in the zeolite crystal cells. The use of these descriptors as inputs in both machine learning regression methods led to the prediction of the DFT isotropic shielding values with mean errors within 0.6 ppm. The results showed that the GBR model scales better than the KRR model.
Collapse
|
25
|
ML- J-DP4: An Integrated Quantum Mechanics-Machine Learning Approach for Ultrafast NMR Structural Elucidation. Org Lett 2022; 24:7487-7491. [PMID: 35508069 DOI: 10.1021/acs.orglett.2c01251] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
A new tool, ML-J-DP4, provides an efficient and accurate method for determining the most likely structure of complex molecules within minutes using standard computational resources. The workflow involves combining fast Karplus-type J calculations with NMR chemical shifts predictions at the cheapest HF/STO-3G level enhanced using machine learning (ML), all embedded in the J-DP4 formalism. Our ML provides accurate predictions, which compare favorably alongside with other ML methods.
Collapse
|
26
|
Graph Convolutional Network-Based Screening Strategy for Rapid Identification of SARS-CoV-2 Cell-Entry Inhibitors. J Chem Inf Model 2022; 62:1988-1997. [PMID: 35404596 PMCID: PMC9016773 DOI: 10.1021/acs.jcim.2c00222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Indexed: 11/29/2022]
Abstract
The cell entry of SARS-CoV-2 has emerged as an attractive drug development target. We previously reported that the entry of SARS-CoV-2 depends on the cell surface heparan sulfate proteoglycan (HSPG) and the cortex actin, which can be targeted by therapeutic agents identified by conventional drug repurposing screens. However, this drug identification strategy requires laborious library screening, which is time consuming, and often limited number of compounds can be screened. As an alternative approach, we developed and trained a graph convolutional network (GCN)-based classification model using information extracted from experimentally identified HSPG and actin inhibitors. This method allowed us to virtually screen 170,000 compounds, resulting in ∼2000 potential hits. A hit confirmation assay with the uptake of a fluorescently labeled HSPG cargo further shortlisted 256 active compounds. Among them, 16 compounds had modest to strong inhibitory activities against the entry of SARS-CoV-2 pseudotyped particles into Vero E6 cells. These results establish a GCN-based virtual screen workflow for rapid identification of new small molecule inhibitors against validated drug targets.
Collapse
|
27
|
The DP5 probability, quantification and visualisation of structural uncertainty in single molecules. Chem Sci 2022; 13:3507-3518. [PMID: 35432857 PMCID: PMC8943899 DOI: 10.1039/d1sc04406k] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 02/24/2022] [Indexed: 12/22/2022] Open
Abstract
Whenever a new molecule is made, a chemist will justify the proposed structure by analysing the NMR spectra. The widely-used DP4 algorithm will choose the best match from a series of possibilities, but draws no conclusions from a single candidate structure. Here we present the DP5 probability, a step-change in the quantification of molecular uncertainty: given one structure and one 13C NMR spectra, DP5 gives the probability of the structure being correct. We show the DP5 probability can rapidly differentiate between structure proposals indistinguishable by NMR to an expert chemist. We also show in a number of challenging examples the DP5 probability may prevent incorrect structures being published and later reassigned. DP5 will prove extremely valuable in fields such as discovery-driven automated chemical synthesis and drug development. Alongside the DP4-AI package, DP5 can help guide synthetic chemists when resolving the most subtle structural uncertainty. The DP5 system is available at https://github.com/Goodman-lab/DP5.
Collapse
|
28
|
Label-Free and In Situ Identification of Cells via Combinational Machine Learning Models. SMALL METHODS 2022; 6:e2101405. [PMID: 34954897 DOI: 10.1002/smtd.202101405] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Indexed: 06/14/2023]
Abstract
Cell identification and counting in living and coculture systems are crucial in cell interaction studies, but current methods primarily rely on complicated and time-consuming staining techniques. Here, a label-free method to precisely recognize, identify, and instantly count cells in situ in coculture systems via combinational machine learning models s presented. A convolutional neural network (CNN) model is first used to generate virtual images of cell nuclei based on unlabeled phase-contrast images. Coordinates of all the cells are then returned according to the virtual nucleus images using two clustering algorithms. Finally, phase-contrast images of single cells are cropped based on the coordinates and sent into another CNN model for cell-type identification. This combinational approach is highly automatic and efficient, which requires few to no manual annotations of images in the training phase. It shows practical performance in different cell culture conditions including cell ratios, densities, and substrate materials, having great potential in real-time cell tracking and analyzing.
Collapse
|
29
|
A framework for automated structure elucidation from routine NMR spectra. Chem Sci 2021; 12:15329-15338. [PMID: 34976353 PMCID: PMC8635205 DOI: 10.1039/d1sc04105c] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 11/08/2021] [Indexed: 12/25/2022] Open
Abstract
Methods to automate structure elucidation that can be applied broadly across chemical structure space have the potential to greatly accelerate chemical discovery. NMR spectroscopy is the most widely used and arguably the most powerful method for elucidating structures of organic molecules. Here we introduce a machine learning (ML) framework that provides a quantitative probabilistic ranking of the most likely structural connectivity of an unknown compound when given routine, experimental one dimensional 1H and/or 13C NMR spectra. In particular, our ML-based algorithm takes input NMR spectra and (i) predicts the presence of specific substructures out of hundreds of substructures it has learned to identify; (ii) annotates the spectrum to label peaks with predicted substructures; and (iii) uses the substructures to construct candidate constitutional isomers and assign to them a probabilistic ranking. Using experimental spectra and molecular formulae for molecules containing up to 10 non-hydrogen atoms, the correct constitutional isomer was the highest-ranking prediction made by our model in 67.4% of the cases and one of the top-ten predictions in 95.8% of the cases. This advance will aid in solving the structure of unknown compounds, and thus further the development of automated structure elucidation tools that could enable the creation of fully autonomous reaction discovery platforms. A machine learning model and graph generator were able to accurately predict for the presence of nearly 1000 substructures and the connectivity of small organic molecules from experimental 1D NMR data.![]()
Collapse
|
30
|
Do Double-Hybrid Exchange-Correlation Functionals Provide Accurate Chemical Shifts? A Benchmark Assessment for Proton NMR. J Chem Theory Comput 2021; 17:6876-6885. [PMID: 34637284 DOI: 10.1021/acs.jctc.1c00604] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
A benchmark density functional theory (DFT) study of 1H NMR chemical shifts for data sets comprising 200 chemical shifts, including complex natural products, has been carried out to assess the performance of DFT methods. Two new benchmark data sets, NMRH33 and NMRH148, have been established. The meta-GGA revTPSS performs remarkably well against the NMRH33 benchmark set (mean absolute deviation (MAD), 0.10 ppm; maximum deviation (max), 0.26 ppm) with the smallest MAD of all evaluated functionals. The best-performing double-hybrid density functional (DHDF), revDSD-BLYP (MAD, 0.16 ppm; max, 0.35 ppm), performs similarly to hybrid-GGA methods (e.g., mPW1PW91/6-311G(d) (MAD, 0.15 ppm; max, 0.36 ppm)), but at a considerably higher computational cost. The results indicate that currently available double-hybrid DFT methods offer no benefit over GGA (including hybrid and meta) functionals in the calculation of 1H NMR chemical shifts.
Collapse
|
31
|
Real-time prediction of 1H and 13C chemical shifts with DFT accuracy using a 3D graph neural network. Chem Sci 2021; 12:12012-12026. [PMID: 34667567 PMCID: PMC8457395 DOI: 10.1039/d1sc03343c] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 07/19/2021] [Indexed: 11/23/2022] Open
Abstract
Nuclear magnetic resonance (NMR) is one of the primary techniques used to elucidate the chemical structure, bonding, stereochemistry, and conformation of organic compounds. The distinct chemical shifts in an NMR spectrum depend upon each atom's local chemical environment and are influenced by both through-bond and through-space interactions with other atoms and functional groups. The in silico prediction of NMR chemical shifts using quantum mechanical (QM) calculations is now commonplace in aiding organic structural assignment since spectra can be computed for several candidate structures and then compared with experimental values to find the best possible match. However, the computational demands of calculating multiple structural- and stereo-isomers, each of which may typically exist as an ensemble of rapidly-interconverting conformations, are expensive. Additionally, the QM predictions themselves may lack sufficient accuracy to identify a correct structure. In this work, we address both of these shortcomings by developing a rapid machine learning (ML) protocol to predict 1H and 13C chemical shifts through an efficient graph neural network (GNN) using 3D structures as input. Transfer learning with experimental data is used to improve the final prediction accuracy of a model trained using QM calculations. When tested on the CHESHIRE dataset, the proposed model predicts observed 13C chemical shifts with comparable accuracy to the best-performing DFT functionals (1.5 ppm) in around 1/6000 of the CPU time. An automated prediction webserver and graphical interface are accessible online at http://nova.chem.colostate.edu/cascade/. We further demonstrate the model in three applications: first, we use the model to decide the correct organic structure from candidates through experimental spectra, including complex stereoisomers; second, we automatically detect and revise incorrect chemical shift assignments in a popular NMR database, the NMRShiftDB; and third, we use NMR chemical shifts as descriptors for determination of the sites of electrophilic aromatic substitution. From quantum chemical and experimental NMR data, a 3D graph neural network, CASCADE, has been developed to predict carbon and proton chemical shifts. Stereoisomers and conformers of organic molecules can be correctly distinguished.![]()
Collapse
|
32
|
Predicting scalar coupling constants by graph angle-attention neural network. Sci Rep 2021; 11:18686. [PMID: 34548513 PMCID: PMC8455698 DOI: 10.1038/s41598-021-97146-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 08/17/2021] [Indexed: 01/20/2023] Open
Abstract
Scalar coupling constant (SCC), directly measured by nuclear magnetic resonance (NMR) spectroscopy, is a key parameter for molecular structure analysis, and widely used to predict unknown molecular structure. Restricted by the high cost of NMR experiments, it is impossible to measure the SCC of unknown molecules on a large scale. Using density functional theory (DFT) to theoretically calculate the SCC of molecules is incredibly challenging, due to the cost of substantial computational time and space. Graph neural networks (GNN) of artificial intelligence (AI) have great potential in constructing molecul
ar-like topology models, which endows them the ability to rapidly predict SCC through data-driven machine learning methods, and avoiding time-consuming quantum chemical calculations. With a priori knowledge of angles, we propose a graph angle-attention neural network (GAANN) model to predict SCC by means of some easily accessible related information. GAANN, with a multilayer message-passing network and a self-attention mechanism, can accurately simulate the molecular-like topological structure and predict molecular properties. Our simulations show that the prediction accuracy by GAANN, with the log(MAE) = −2.52, is close to that by DFT calculations. Different from conventional AI methods, GAANN combining the AI method with quantum chemistry theory (Karplus equation) has a strong physicochemical interpretability about angles. From an AI perspective, we find that bond angle has the highest correlation with the SCC among all angle features (dihedral angle, bond angle, geometric angles) about multiple coupling types in the small molecule datasets.
Collapse
|
33
|
Predicting chemical shifts with graph neural networks. Chem Sci 2021; 12:10802-10809. [PMID: 34476061 PMCID: PMC8372537 DOI: 10.1039/d1sc01895g] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 07/09/2021] [Indexed: 02/02/2023] Open
Abstract
Inferring molecular structure from Nuclear Magnetic Resonance (NMR) measurements requires an accurate forward model that can predict chemical shifts from 3D structure. Current forward models are limited to specific molecules like proteins and state-of-the-art models are not differentiable. Thus they cannot be used with gradient methods like biased molecular dynamics. Here we use graph neural networks (GNNs) for NMR chemical shift prediction. Our GNN can model chemical shifts accurately and capture important phenomena like hydrogen bonding induced downfield shift between multiple proteins, secondary structure effects, and predict shifts of organic molecules. Previous empirical NMR models of protein NMR have relied on careful feature engineering with domain expertise. These GNNs are trained from data alone with no feature engineering yet are as accurate and can work on arbitrary molecular structures. The models are also efficient, able to compute one million chemical shifts in about 5 seconds. This work enables a new category of NMR models that have multiple interacting types of macromolecules.
Collapse
|
34
|
A community-powered search of machine learning strategy space to find NMR property prediction models. PLoS One 2021; 16:e0253612. [PMID: 34283864 PMCID: PMC8291653 DOI: 10.1371/journal.pone.0253612] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 06/08/2021] [Indexed: 01/21/2023] Open
Abstract
The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published 'in-house' efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.
Collapse
|
35
|
Computer Assisted Structure Elucidation (CASE): Current and future perspectives. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2021; 59:669-690. [PMID: 33197069 DOI: 10.1002/mrc.5115] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 10/31/2020] [Accepted: 11/08/2020] [Indexed: 06/11/2023]
Abstract
The first efforts for the development of methods for Computer-Assisted Structure Elucidation (CASE) were published more than 50 years ago. CASE expert systems based on one-dimensional (1D) and two-dimensional (2D) Nuclear Magnetic Resonance (NMR) data have matured considerably by now. The structures of a great number of complex natural products have been elucidated and/or revised using such programs. In this article, we discuss the most likely directions in which CASE will evolve. We act on the premise that a synergistic interaction exists between CASE, new NMR experiments, and methods of computational chemistry, which are continuously being improved. The new developments in NMR experiments (long-range correlation experiments, pure-shift methods, coupling constants measurement and prediction, residual dipolar couplings [RDCs]), and residual chemical shift anisotropies [RCSAs], evolution of density functional theory (DFT), and machine learning algorithms will have an influence on CASE systems and vice versa. This is true also for new techniques for chemical analysis (Atomic Force Microscopy [AFM], "crystalline sponge" X-ray analysis, and micro-Electron Diffraction [micro-ED]), which will be used in combination with expert systems. We foresee that CASE will be utilized widely and become a routine tool for NMR spectroscopists and analysts in academic and industrial laboratories. We believe that the "golden age" of CASE is still in the future.
Collapse
|
36
|
Solubility Prediction from Molecular Properties and Analytical Data Using an In-phase Deep Neural Network (Ip-DNN). ACS OMEGA 2021; 6:14278-14287. [PMID: 34124451 PMCID: PMC8190808 DOI: 10.1021/acsomega.1c01035] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 04/28/2021] [Indexed: 06/12/2023]
Abstract
Materials informatics is an emerging field that allows us to predict the properties of materials and has been applied in various research and development fields, such as materials science. In particular, solubility factors such as the Hansen and Hildebrand solubility parameters (HSPs and SP, respectively) and Log P are important values for understanding the physical properties of various substances. In this study, we succeeded at establishing a solubility prediction tool using a unique machine learning method called the in-phase deep neural network (ip-DNN), which starts exclusively from the analytical input data (e.g., NMR information, refractive index, and density) to predict solubility by predicting intermediate elements, such as molecular components and molecular descriptors, in the multiple-step method. For improving the level of accuracy of the prediction, intermediate regression models were employed when performing in-phase machine learning. In addition, we developed a website dedicated to the established solubility prediction method, which is freely available at "http://dmar.riken.jp/matsolca/".
Collapse
|
37
|
Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abe347] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Abstract
The requirement for accelerated and quantitatively accurate screening of nuclear magnetic resonance spectra across the small molecules chemical compound space is two-fold: (1) a robust ‘local’ machine learning (ML) strategy capturing the effect of the neighborhood on an atom’s ‘near-sighted’ property—chemical shielding; (2) an accurate reference dataset generated with a state-of-the-art first-principles method for training. Herein we report the QM9-NMR dataset comprising isotropic shielding of over 0.8 million C atoms in 134k molecules of the QM9 dataset in gas and five common solvent phases. Using these data for training, we present benchmark results for the prediction transferability of kernel-ridge regression models with popular local descriptors. Our best model, trained on 100k samples, accurately predicts isotropic shielding of 50k ‘hold-out’ atoms with a mean error of less than 1.9 ppm. For the rapid prediction of new query molecules, the models were trained on geometries from an inexpensive theory. Furthermore, by using a Δ-ML strategy, we quench the error below 1.4 ppm. Finally, we test the transferability on non-trivial benchmark sets that include benchmark molecules comprising 10–17 heavy atoms and drugs.
Collapse
|
38
|
Improved Prediction of Carbonless NMR Spectra by the Machine Learning of Theoretical and Fragment Descriptors for Environmental Mixture Analysis. Anal Chem 2021; 93:6901-6906. [PMID: 33929838 DOI: 10.1021/acs.analchem.1c00756] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
As the first multidimensional NMR approach, 2D J-resolved (2DJ) spectroscopy is distinguished by signal resolution and detection sensitivity with remarkable advantages for the exhaustive evaluation of complex mixtures and environmental samples due to its carbonless feature without the requirement of 13C connectivity. Generally, the 2DJ signal assignment of metabolic mixtures is problematic in spite of references to experimental NMR databases, owing to the existence of metabolic "dark matter." In this study, a new method to predict 2DJ spectra was developed with a combination of quantum mechanical (QM) computation and machine learning (ML). The predictive accuracy of J-coupling constants was evaluated using validated data. The root-mean-square deviation (RMSD) for QM computation was 3.52 Hz, while the RMSD for QM + ML was 1.21 Hz, indicating a substantial increase in predictive accuracy. The proposed model was applied to predict the 2DJ spectra of 60 standard substances and 55 components of seawater. Furthermore, two practical environmental samples were used to evaluate the robustness of the constructed predictive model. A J-coupling tree and J-split spectra produced from QM + ML of aliphatic moieties had good consistency with the experimental data, as compared with the theoretical data produced by QM computation. The predicted J-coupling tree for the J-coupling multiplet analysis of freely rotating bonds in the complex mixture, which is traditionally difficult, was interpretable. In addition, in silico identification of the J-split 1H NMR signals, which was independent of experimental databases, aided in the discovery of new components in a mixture.
Collapse
|
39
|
Transfer Learning from Simulation to Experimental Data: NMR Chemical Shift Predictions. J Phys Chem Lett 2021; 12:3662-3668. [PMID: 33826849 DOI: 10.1021/acs.jpclett.1c00578] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
An accurate prediction of chemical shifts (δ) to elucidate molecular structures has been a challenging problem. Recently, noble machine learning architectures achieve accurate prediction performance, but the difficulty of building a huge chemical database limits the applicability of machine learning approaches. In this work, we demonstrate that the prior knowledge gained from the simulation database is successfully transferred into the problem of predicting an experimentally measured δ. Although both simulation and experimental databases are vastly different in chemical perspectives, reliable accuracy for δ is achieved by additional training with randomly sampled small numbers of experimental data. Furthermore, the prior knowledge allows us to successfully train the model on the more focused chemical space that the experimental database sparsely covers. The proposed approach, the knowledge transfer from the simulation database, can be utilized to enhance the usability of the local experimental database.
Collapse
|
40
|
Decomposition Factor Analysis Based on Virtual Experiments throughout Bayesian Optimization for Compost-Degradable Polymers. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11062820] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Bio-based polymers have been considered as an alternative to oil-based materials for their “carbon-neutral” environmentally degrative features. However, degradation is a complex system in which environmental factors and preparation conditions are involved, and the relationship between degradation and these factors/conditions has not yet been clarified. Moreover, an efficient system that addresses multiple degradation factors has not been developed for practical use. Thus, we constructed a decomposition degree predictive model to explore degradation factors based on analytical data and experimental conditions. The predictive model was constructed by machine learning using a dataset. The objective variable was the molecular weight, and the explanatory variables were the moisture content in a compost environment, degradation period, degree of crystallinity pre-experiment, and features of solid-state nuclear magnetic resonance spectra. The good accuracy of this predictive model was confirmed by statistical variables. The moisture content in the compost environment was a critical factor for considering initial degradation; specific scores revealed the contribution of degradation factors. Furthermore, the optimum decomposition degree, various analytical values, and experimental conditions were predictable when this predictive model was combined with Bayesian optimization. Information obtained from virtual experiments is expected to promote the material design and development of bio-based plastics.
Collapse
|
41
|
High-field and benchtop NMR spectroscopy for the characterization of new psychoactive substances. Forensic Sci Int 2021; 321:110718. [PMID: 33601154 DOI: 10.1016/j.forsciint.2021.110718] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/28/2021] [Accepted: 01/30/2021] [Indexed: 12/18/2022]
Abstract
New psychoactive substances (NPS) have become a serious threat to public health in Europe due to their ability to be sold in the street or on the darknet. Regulating NPS is an urgent priority but comes with a number of analytical challenges since they are structurally similar to legal products. A number of analytical techniques can be used for identifying NPS, among which NMR spectroscopy is a gold standard. High field NMR is typically used for structural elucidation in combination with others techniques like GC-MS, Infrared spectroscopy, together with databases. In addition to their strong ability to elucidate molecular structures, high field NMR techniques are the gold standard for quantification without any physical isolation procedure and with a single internal standard. However, high field NMR remains expensive and emerging "benchtop" NMR apparatus which are cheaper and transportable can be considered as valuable alternatives to high field NMR. Indeed, benchtop NMR, which emerged about ten years ago, makes it possible to carry out structural elucidation and quantification of NPS despite the gap in resolution and sensitivity as compared to high field NMR. This review describes recent advances in the field of NMR applied to the characterization of NPS. High-field NMR methods are first described in view of their complementarity with other analytical methods, focusing on both structural and quantitative aspects. The second part of the review highlights how emerging benchtop NMR approaches could act as a game changer in the field of forensics.
Collapse
|
42
|
|
43
|
Predicting Density Functional Theory-Quality Nuclear Magnetic Resonance Chemical Shifts via Δ-Machine Learning. J Chem Theory Comput 2021; 17:826-840. [DOI: 10.1021/acs.jctc.0c00979] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
44
|
Application of INADEQUATE NMR techniques for directly tracing out the carbon skeleton of a natural product. PHYTOCHEMICAL ANALYSIS : PCA 2021; 32:7-23. [PMID: 32671944 DOI: 10.1002/pca.2976] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2020] [Revised: 06/25/2020] [Accepted: 06/26/2020] [Indexed: 06/11/2023]
Abstract
INTRODUCTION Nuclear magnetic resonance (NMR) measurement of 1 JCC coupling by two-dimensional (2D) INADEQUATE (incredible natural abundance double quantum transfer experiment), which is a special case of double-quantum (DQ) spectroscopy that offers unambiguous determination of 13 C-13 C spin-spin connectivities through the DQ transitions of the spin system, is especially suited to solving structures rich in quaternary carbons and poor in hydrogen content (Crews rule). OBJECTIVE To review published literature on the application of NMR methods to determine structure in the liquid-state, which specifically considers the interaction of a pair of carbon-13 (13 C) nuclei adjacent to one another, to allow direct tracing out of contiguous carbon connectivity using 2D INADEQUATE. METHODOLOGY A comprehensive literature search was implemented with various databases: Web of Knowledge, PubMed and SciFinder, and other relevant published materials including published monographs. The keywords used, in various combinations, with INADEQUATE being present in all combinations, in the search were 2D NMR, 1 JCC coupling, natural product, structure elucidation, 13 C-13 C connectivity, cryoprobe and CASE (computer-assisted structure elucidation)/PANACEA (protons and nitrogen and carbon et alia). RESULTS The 2D INADEQUATE continues to solve "intractable" problems in natural product chemistry, and using milligram quantities with cryoprobe techniques combined with CASE/PANACEA experiments can increase machine time efficiency. The 13 C-13 C-based structural elucidation by dissolution single-scan dynamic nuclear polarisation NMR can overcome disadvantages of 13 C insensitivity at natural abundance. Selected examples have demonstrated the trajectory of INADEQUATE spectroscopy from structural determination to clarification of metabolomics analysis and use of DFT (density functional theory) and coupling constants to clarify the connectivity, hybridisation and stereochemistry within natural products. CONCLUSIONS Somewhat neglected over the years because of perceived lack of sensitivity, the 2D INADEQUATE NMR technique has re-emerged as a useful tool for solving natural products structures, which are rich in quaternary carbons and poor in hydrogen content.
Collapse
|
45
|
Are Computational Methods Useful for Structure Elucidation of Large and Flexible Molecules? Belizentrin as a Case Study. Org Lett 2020; 23:503-507. [PMID: 33382270 DOI: 10.1021/acs.orglett.0c04016] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Quantum mechanical NMR methods are progressively becoming decisive in structure elucidation. However, problems arise using low-level calculations for complex molecules, whereas methods using higher levels of theory are not practical for large molecules. This report outlines a synergistic effort employing computationally inexpensive quantum mechanical NMR calculations with conformer selection incorporating 3JHH values as a way to solve the structure of large, complex, and highly flexible molecules using readily available computational resources with belizentrin as a case study.
Collapse
|
46
|
Abstract
We introduce new and robust decompositions of mean-field Hartree-Fock and Kohn-Sham density functional theory relying on the use of localized molecular orbitals and physically sound charge population protocols. The new lossless property decompositions, which allow for partitioning one-electron reduced density matrices into either bond-wise or atomic contributions, are compared to alternatives from the literature with regard to both molecular energies and dipole moments. Besides commenting on possible applications as an interpretative tool in the rationalization of certain electronic phenomena, we demonstrate how decomposed mean-field theory makes it possible to expose and amplify compositional features in the context of machine-learned quantum chemistry. This is made possible by improving upon the granularity of the underlying data. On the basis of our preliminary proof-of-concept results, we conjecture that many of the structure-property inferences in existence today may be further refined by efficiently leveraging an increase in dataset complexity and richness.
Collapse
|
47
|
Toward Accurate Predictions of Atomic Properties via Quantum Mechanics Descriptors Augmented Graph Convolutional Neural Network: Application of This Novel Approach in NMR Chemical Shifts Predictions. J Phys Chem Lett 2020; 11:9812-9818. [PMID: 33151693 DOI: 10.1021/acs.jpclett.0c02654] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this study, an augmented Graph Convolutional Network (GCN) with quantum mechanics (QM) descriptors was reported for its accurate predictions of NMR chemical shifts with respect to experimental values. The prediction errors of 13C/1H NMR chemical shifts can be as small as 2.14/0.11 ppm. There are two crucial characteristics for this modified GCN: in one aspect, such a novel neural network could efficiently extract the overall molecule structure information; in another aspect, it could accurately solve the chemical environment of the target atom. As there exists an imperfect linear regression between the experimental NMR chemical shifts (δ) and the density functional theory (DFT) calculated isotropic shielding constants (σ), the inclusion of QM descriptors within GCN can largely improve its performance. Moreover, few-shot learning also becomes feasible with these descriptors. The success of this novel GCN in chemical shifts predictions also indicates its potential applicability for other computational studies.
Collapse
|
48
|
Deconvolution of fast exchange equilibrium states in NMR spectroscopy using virtual reference standards and probability theory. Org Biomol Chem 2020; 18:6927-6934. [PMID: 32936188 DOI: 10.1039/d0ob01459a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A methodology for deconvolution of fast exchange equilibrium states in NMR spectroscopy (DFEQNMR) was developed based on DFT-GIAO NMR chemical shift prediction and a probability theory algorithm. Proof-of-concept studies were performed to estimate the protonation state of N-containing organic molecules involving fast proton exchange equilibrium and evaluate the solution tautomerism of a purine derivative. DFT-GIAO calculations were optimized to achieve good accuracy in 13C, 1H and 15N chemical shift prediction for protonated species. The probability theory algorithm enabled the determination of solution species ratios and yielded 95% confidence regions by comparing experimental and simulated chemical shift data sets. The calculation showed good accuracy for model partial salts with various functionalities and application in structure elucidation of complex natural product partial salts was also demonstrated. This method showed promising potential in acquisition of important insight into fast exchange equilibrium systems with only one experimental NMR chemical shift data set.
Collapse
|
49
|
NMR Calculations with Quantum Methods: Development of New Tools for Structural Elucidation and Beyond. Acc Chem Res 2020; 53:1922-1932. [PMID: 32794691 DOI: 10.1021/acs.accounts.0c00365] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Structural elucidation is an important and challenging stage in the discovery of new organic molecules. Single-crystal X-ray analysis provides the most unquestionable results, though in practice the availability of suitable crystals limits its broad use. On the other hand, NMR spectroscopy has become the leading and universal technique to accomplish the task. Despite continuous advances in the field, the misinterpretation of NMR data is commonplace, evidenced by the large number of erroneous structures being published in top journals. Quantum calculations of NMR chemical shifts and scalar coupling constants emerged as ideal complements to facilitate the elucidation process when experimental NMR data is inconclusive. Since seminal reports demonstrated that affordable DFT methods provide NMR predictions accurate enough to differentiate among closely related isomers, the discipline has experienced substantial growth. The impact has been felt in different areas, and nowadays the results of such calculations are routinely seen in high impact literature.This Account describes our investigations in the field of quantum NMR calculations, focusing on the development of tools for structural elucidation and practical applications. We pioneered the use of artificial intelligence methods in the development of novel strategies of structural validation. Our first generation of trained artificial neural networks (ANNs) showed excellent ability to identify mistakes at the atom connectivity level, whereas the use of multidimensional pattern recognition pushed the performance to the stereochemical limit. In a conceptually different approach, we developed DP4+, an updated version of the DP4 probability used to determine the most likely structure among two or more candidates when one set of experimental data is available. Increasing the level of theory in NMR calculations and including unscaled data in the formalism improved the performance of the method, further validated to settle the configuration of challenging motifs such as spiroepoxides or Mosher's derivatives. One of the limitations of DP4+ is related to the relatively large computational cost involved in obtaining DFT-optimized geometries, which led to the development of a fast variant including the valuable information provided by coupling constants (J-DP4 method).These tools were explored to suggest the most probable structure of controversial natural or unnatural products originally misassigned, with some predictions further validated by synthesis (as in the case of pseudorubriflordilactone B). The possibility of predicting the structure of a natural product without requiring authentic sample was investigated in collaboration with Prof. Pilli (UNICAMP, Brazil) in the computer-guided total synthesis and stereochemical revisions of several natural products. Despite these advances, there remain considerable challenges, such as the case of configurational assessment of polar systems featuring multiple intramolecular hydrogen bonding interactions because of the poor energy predictions provided by most DFT methods. In our latest work, we tackle this problem by averaging the results provided by randomly generated ensembles, paving the way for a new paradigm in quantum NMR-assisted structural elucidation.
Collapse
|
50
|
Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Methods. Chem 2020; 6:1527-1542. [PMID: 32695924 PMCID: PMC7373218 DOI: 10.1016/j.chempr.2020.05.014] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Recently supervised machine learning has been ascending in providing new predictive approaches for chemical, biological and materials sciences applications. In this Perspective we focus on the interplay of machine learning method with the chemically motivated descriptors and the size and type of data sets needed for molecular property prediction. Using Nuclear Magnetic Resonance chemical shift prediction as an example, we demonstrate that success is predicated on the choice of feature extracted or real-space representations of chemical structures, whether the molecular property data is abundant and/or experimentally or computationally derived, and how these together will influence the correct choice of popular machine learning methods drawn from deep learning, random forests, or kernel methods.
Collapse
|