1
|
Tetko IV, van Deursen R, Godin G. Be aware of overfitting by hyperparameter optimization! J Cheminform 2024; 16:139. [PMID: 39654058 PMCID: PMC11629497 DOI: 10.1186/s13321-024-00934-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 11/22/2024] [Indexed: 12/12/2024] Open
Abstract
Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.
Collapse
Affiliation(s)
- Igor V Tetko
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum Für Gesundheit Und Umwelt (GmbH), 86764, Neuherberg, Germany.
- BIGCHEM GmbH, Valerystr. 49, 85716, Unterschleißheim, Germany.
| | | | | |
Collapse
|
2
|
Ramos MC, White AD. Predicting small molecules solubility on endpoint devices using deep ensemble neural networks. DIGITAL DISCOVERY 2024; 3:786-795. [PMID: 38638648 PMCID: PMC11022985 DOI: 10.1039/d3dd00217a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/07/2024] [Indexed: 04/20/2024]
Abstract
Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is useable at https://mol.dev.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| | - Andrew D White
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| |
Collapse
|
3
|
Kim Y, Jung H, Kumar S, Paton RS, Kim S. Designing solvent systems using self-evolving solubility databases and graph neural networks. Chem Sci 2024; 15:923-939. [PMID: 38239675 PMCID: PMC10793204 DOI: 10.1039/d3sc03468b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 12/04/2023] [Indexed: 01/22/2024] Open
Abstract
Designing solvent systems is key to achieving the facile synthesis and separation of desired products from chemical processes, so many machine learning models have been developed to predict solubilities. However, breakthroughs are needed to address deficiencies in the model's predictive accuracy and generalizability; this can be addressed by expanding and integrating experimental and computational solubility databases. To maximize predictive accuracy, these two databases should not be trained separately, and they should not be simply combined without reconciling the discrepancies from different magnitudes of errors and uncertainties. Here, we introduce self-evolving solubility databases and graph neural networks developed through semi-supervised self-training approaches. Solubilities from quantum-mechanical calculations are referred to during semi-supervised learning, but they are not directly added to the experimental database. Dataset augmentation is performed from 11 637 experimental solubilities to >900 000 data points in the integrated database, while correcting for the discrepancies between experiment and computation. Our model was successfully applied to study solvent selection in organic reactions and separation processes. The accuracy (mean absolute error around 0.2 kcal mol-1 for the test set) is quantitatively useful in exploring Linear Free Energy Relationships between reaction rates and solvation free energies for 11 organic reactions. Our model also accurately predicted the partition coefficients of lignin-derived monomers and drug-like molecules. While there is room for expanding solubility predictions to transition states, radicals, charged species, and organometallic complexes, this approach will be attractive to predictive chemistry areas where experimental, computational, and other heterogeneous data should be combined.
Collapse
Affiliation(s)
- Yeonjoon Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
- Department of Chemistry, Pukyong National University Busan 48513 Republic of Korea
| | - Hojin Jung
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Sabari Kumar
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Robert S Paton
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Seonah Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| |
Collapse
|
4
|
Ahmad W, Tayara H, Shim H, Chong KT. SolPredictor: Predicting Solubility with Residual Gated Graph Neural Network. Int J Mol Sci 2024; 25:715. [PMID: 38255790 PMCID: PMC10815788 DOI: 10.3390/ijms25020715] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 12/26/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024] Open
Abstract
Computational methods play a pivotal role in the pursuit of efficient drug discovery, enabling the rapid assessment of compound properties before costly and time-consuming laboratory experiments. With the advent of technology and large data availability, machine and deep learning methods have proven efficient in predicting molecular solubility. High-precision in silico solubility prediction has revolutionized drug development by enhancing formulation design, guiding lead optimization, and predicting pharmacokinetic parameters. These benefits result in considerable cost and time savings, resulting in a more efficient and shortened drug development process. The proposed SolPredictor is designed with the aim of developing a computational model for solubility prediction. The model is based on residual graph neural network convolution (RGNN). The RGNNs were designed to capture long-range dependencies in graph-structured data. Residual connections enable information to be utilized over various layers, allowing the model to capture and preserve essential features and patterns scattered throughout the network. The two largest datasets available to date are compiled, and the model uses a simplified molecular-input line-entry system (SMILES) representation. SolPredictor uses the ten-fold split cross-validation Pearson correlation coefficient R2 0.79±0.02 and root mean square error (RMSE) 1.03±0.04. The proposed model was evaluated using five independent datasets. Error analysis, hyperparameter optimization analysis, and model explainability were used to determine the molecular features that were most valuable for prediction.
Collapse
Affiliation(s)
- Waqar Ahmad
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - HyunJoo Shim
- School of Pharmacy, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
5
|
Chaka MD, Mekonnen YS, Wu Q, Geffe CA. Advancing energy storage through solubility prediction: leveraging the potential of deep learning. Phys Chem Chem Phys 2023; 25:31836-31847. [PMID: 37966375 DOI: 10.1039/d3cp03992g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2023]
Abstract
Solubility prediction plays a crucial role in energy storage applications, such as redox flow batteries, because it directly affects the efficiency and reliability. Researchers have developed various methods that utilize quantum calculations and descriptors to predict the aqueous solubilities of organic molecules. Notably, machine learning models based on descriptors have shown promise for solubility prediction. As deep learning tools, graph neural networks (GNNs) have emerged to capture complex structure-property relationships for material property prediction. Specifically, MolGAT, a type of GNN model, was designed to incorporate n-dimensional edge attributes, enabling the modeling of intricacies in molecular graphs and enhancing the prediction capabilities. In a previous study, MolGAT successfully screened 23 467 promising redox-active molecules from a database of over 500 000 compounds, based on redox potential predictions. This study focused on applying the MolGAT model to predict the aqueous solubility (log S) of a broad range of organic compounds, including those previously screened for redox activity. The model was trained on a diverse sample of 8494 organic molecules from AqSolDB and benchmarked against literature data, demonstrating superior accuracy compared with other state of the art graph-based and descriptor-based models. Subsequently, the trained MolGAT model was employed to screen redox-active organic compounds identified in the first phase of high-throughput virtual screening, targeting favorable solubility in energy storage applications. The second round of screening, which considered solubility, yielded 12 332 promising redox-active and soluble organic molecules suitable for use in aqueous redox flow batteries. Thus, the two-phase high-throughput virtual screening approach utilizing MolGAT, specifically trained for redox potential and solubility, is an effective strategy for selecting suitable intrinsically soluble redox-active molecules from extensive databases, potentially advancing energy storage through reliable material development. This indicates that the model is reliable for predicting the solubility of various molecules and provides valuable insights for energy storage, pharmaceutical, environmental, and chemical applications.
Collapse
Affiliation(s)
- Mesfin Diro Chaka
- Department of Physics, College of Natural and Computational Sciences, Addis Ababa University, P. O. Box 1176, Addis Ababa, Ethiopia.
- Computational Data Science Program, College of Natural and Computational Sciences, Addis Ababa University, P. O. Box 1176, Addis Ababa, Ethiopia
| | - Yedilfana Setarge Mekonnen
- Center for Environmental Science, College of Natural and Computational Sciences, Addis Ababa University, P. O. Box 1176, Addis Ababa, Ethiopia
| | - Qin Wu
- Center for Functional Nanomaterials, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Chernet Amente Geffe
- Department of Physics, College of Natural and Computational Sciences, Addis Ababa University, P. O. Box 1176, Addis Ababa, Ethiopia.
| |
Collapse
|
6
|
Reinhardt A, Chew PY, Cheng B. A streamlined molecular-dynamics workflow for computing solubilities of molecular and ionic crystals. J Chem Phys 2023; 159:184110. [PMID: 37962445 DOI: 10.1063/5.0173341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 10/20/2023] [Indexed: 11/15/2023] Open
Abstract
Computing the solubility of crystals in a solvent using atomistic simulations is notoriously challenging due to the complexities and convergence issues associated with free-energy methods, as well as the slow equilibration in direct-coexistence simulations. This paper introduces a molecular-dynamics workflow that simplifies and robustly computes the solubility of molecular or ionic crystals. This method is considerably more straightforward than the state-of-the-art, as we have streamlined and optimised each step of the process. Specifically, we calculate the chemical potential of the crystal using the gas-phase molecule as a reference state, and employ the S0 method to determine the concentration dependence of the chemical potential of the solute. We use this workflow to predict the solubilities of sodium chloride in water, urea polymorphs in water, and paracetamol polymorphs in both water and ethanol. Our findings indicate that the predicted solubility is sensitive to the chosen potential energy surface. Furthermore, we note that the harmonic approximation often fails for both molecular crystals and gas molecules at or above room temperature, and that the assumption of an ideal solution becomes less valid for highly soluble substances.
Collapse
Affiliation(s)
- Aleks Reinhardt
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Pin Yu Chew
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Bingqing Cheng
- Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Austria
| |
Collapse
|
7
|
Conn JM, Carter JW, Conn JJA, Subramanian V, Baxter A, Engkvist O, Llinas A, Ratkova EL, Pickett SD, McDonagh JL, Palmer DS. Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models. J Chem Inf Model 2023; 63:1099-1113. [PMID: 36758178 PMCID: PMC9976279 DOI: 10.1021/acs.jcim.2c01189] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.
Collapse
Affiliation(s)
- Jonathan
G. M. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - James W. Carter
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Justin J. A. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Vigneshwari Subramanian
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Andrew Baxter
- GSK
Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K.
| | - Ola Engkvist
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden,Department
of Computer Science and Engineering, Chalmers
University of Technology, SE-412 96 Göteborg, Sweden
| | - Antonio Llinas
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Ekaterina L. Ratkova
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden
| | - Stephen D. Pickett
- Computational
Sciences, GlaxoSmithKline R&D Pharmaceuticals, Stevenage SG1 2NY, U.K.
| | - James L. McDonagh
- IBM Research
Europe, Hartree Centre, SciTech Daresbury, Warrington, Cheshire WA4 4AD, U.K.
| | - David S. Palmer
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.,E-mail:
| |
Collapse
|
8
|
Ahmad W, Tayara H, Chong KT. Attention-Based Graph Neural Network for Molecular Solubility Prediction. ACS OMEGA 2023; 8:3236-3244. [PMID: 36713733 PMCID: PMC9878542 DOI: 10.1021/acsomega.2c06702] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 12/23/2022] [Indexed: 06/18/2023]
Abstract
Drug discovery (DD) research is aimed at the discovery of new medications. Solubility is an important physicochemical property in drug development. Active pharmaceutical ingredients (APIs) are essential substances for high drug efficacy. During DD research, aqueous solubility (AS) is a key physicochemical attribute required for API characterization. High-precision in silico solubility prediction reduces the experimental cost and time of drug development. Several artificial tools have been employed for solubility prediction using machine learning and deep learning techniques. This study aims to create different deep learning models that can predict the solubility of a wide range of molecules using the largest currently available solubility data set. Simplified molecular-input line-entry system (SMILES) strings were used as molecular representation, models developed using simple graph convolution, graph isomorphism network, graph attention network, and AttentiveFP network. Based on the performance of the models, the AttentiveFP-based network model was finally selected. The model was trained and tested on 9943 compounds. The model outperformed on 62 anticancer compounds with metric Pearson correlation R 2 and root-mean-square error values of 0.52 and 0.61, respectively. AS can be improved by graph algorithm improvement or more molecular properties addition.
Collapse
Affiliation(s)
- Waqar Ahmad
- Department
of Electronics and Information Engineering, Jeonbuk National University, Jeonju54896, South Korea
| | - Hilal Tayara
- School
of International Engineering and Science, Jeonbuk National University, Jeonju54896, South Korea
| | - Kil To Chong
- Department
of Electronics and Information Engineering, Jeonbuk National University, Jeonju54896, South Korea
- Advanced
Electronics and Information Research Center, Jeonbuk National University, Jeonju54896, South Korea
| |
Collapse
|
9
|
Chew PY, Reinhardt A. Phase diagrams-Why they matter and how to predict them. J Chem Phys 2023; 158:030902. [PMID: 36681642 DOI: 10.1063/5.0131028] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Understanding the thermodynamic stability and metastability of materials can help us to, for example, gauge whether crystalline polymorphs in pharmaceutical formulations are likely to be durable. It can also help us to design experimental routes to novel phases with potentially interesting properties. In this Perspective, we provide an overview of how thermodynamic phase behavior can be quantified both in computer simulations and machine-learning approaches to determine phase diagrams, as well as combinations of the two. We review the basic workflow of free-energy computations for condensed phases, including some practical implementation advice, ranging from the Frenkel-Ladd approach to thermodynamic integration and to direct-coexistence simulations. We illustrate the applications of such methods on a range of systems from materials chemistry to biological phase separation. Finally, we outline some challenges, questions, and practical applications of phase-diagram determination which we believe are likely to be possible to address in the near future using such state-of-the-art free-energy calculations, which may provide fundamental insight into separation processes using multicomponent solvents.
Collapse
Affiliation(s)
- Pin Yu Chew
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Aleks Reinhardt
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
10
|
Hamre JR, Jafri MS. Optimizing peptide inhibitors of SARS-Cov-2 nsp10/nsp16 methyltransferase predicted through molecular simulation and machine learning. INFORMATICS IN MEDICINE UNLOCKED 2022; 29:100886. [PMID: 35252541 PMCID: PMC8883729 DOI: 10.1016/j.imu.2022.100886] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 02/04/2022] [Accepted: 02/16/2022] [Indexed: 11/30/2022] Open
Abstract
Coronaviruses, including the recent pandemic strain SARS-Cov-2, use a multifunctional 2'-O-methyltransferase (2'-O-MTase) to restrict the host defense mechanism and to methylate RNA. The nonstructural protein 16 2'-O-MTase (nsp16) becomes active when nonstructural protein 10 (nsp10) and nsp16 interact. Novel peptide drugs have shown promise in the treatment of numerous diseases and new research has established that nsp10 derived peptides can disrupt viral methyltransferase activity via interaction of nsp16. This study had the goal of optimizing new analogous nsp10 peptides that have the ability to bind nsp16 with equal to or higher affinity than those naturally occurring. The following research demonstrates that in silico molecular simulations can shed light on peptide structures and predict the potential of new peptides to interrupt methyltransferase activity via the nsp10/nsp16 interface. The simulations suggest that misalignments at residues F68, H80, I81, D94, and Y96 or rotation at H80 abrogate MTase function. We develop a new set of peptides based on conserved regions of the nsp10 protein in the Coronaviridae species and test these to known MTase variant values. This results in the prediction that the H80R variant is a solid new candidate for potential new testing. We envision that this new lead is the beginning of a reputable foundation of a new computational method that combats coronaviruses and that is beneficial for new peptide drug development.
Collapse
Affiliation(s)
- John R Hamre
- School of Systems Biology, George Mason University, Fairfax, VA, 22030, USA
| | - M Saleet Jafri
- School of Systems Biology, George Mason University, Fairfax, VA, 22030, USA
- Center for Biomedical Engineering and Technology, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| |
Collapse
|
11
|
Blow KE, Quigley D, Sosso GC. The seven deadly sins: When computing crystal nucleation rates, the devil is in the details. J Chem Phys 2021; 155:040901. [PMID: 34340373 DOI: 10.1063/5.0055248] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The formation of crystals has proven to be one of the most challenging phase transformations to quantitatively model-let alone to actually understand-be it by means of the latest experimental technique or the full arsenal of enhanced sampling approaches at our disposal. One of the most crucial quantities involved with the crystallization process is the nucleation rate, a single elusive number that is supposed to quantify the average probability for a nucleus of critical size to occur within a certain volume and time span. A substantial amount of effort has been devoted to attempt a connection between the crystal nucleation rates computed by means of atomistic simulations and their experimentally measured counterparts. Sadly, this endeavor almost invariably fails to some extent, with the venerable classical nucleation theory typically blamed as the main culprit. Here, we review some of the recent advances in the field, focusing on a number of perhaps more subtle details that are sometimes overlooked when computing nucleation rates. We believe it is important for the community to be aware of the full impact of aspects, such as finite size effects and slow dynamics, that often introduce inconspicuous and yet non-negligible sources of uncertainty into our simulations. In fact, it is key to obtain robust and reproducible trends to be leveraged so as to shed new light on the kinetics of a process, that of crystal nucleation, which is involved into countless practical applications, from the formulation of pharmaceutical drugs to the manufacturing of nano-electronic devices.
Collapse
Affiliation(s)
- Katarina E Blow
- Department of Physics, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - David Quigley
- Department of Physics, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Gabriele C Sosso
- Department of Chemistry, University of Warwick, Coventry CV4 7AL, United Kingdom
| |
Collapse
|
12
|
Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN. Interdiscip Sci 2021; 13:703-716. [PMID: 34236625 DOI: 10.1007/s12539-021-00456-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 06/21/2021] [Accepted: 06/23/2021] [Indexed: 10/20/2022]
Abstract
BACKGROUND Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models. METHODS The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks. RESULTS The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew's correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang . CONCLUSION The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.
Collapse
|
13
|
Fowles DJ, Palmer DS, Guo R, Price SL, Mitchell JBO. Toward Physics-Based Solubility Computation for Pharmaceuticals to Rival Informatics. J Chem Theory Comput 2021; 17:3700-3709. [PMID: 33988381 PMCID: PMC8190954 DOI: 10.1021/acs.jctc.1c00130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
![]()
We demonstrate that
physics-based calculations of intrinsic aqueous
solubility can rival cheminformatics-based machine learning predictions.
A proof-of-concept was developed for a physics-based approach via
a sublimation thermodynamic cycle, building upon previous work that
relied upon several thermodynamic approximations, notably the 2RT approximation, and limited conformational sampling. Here,
we apply improvements to our sublimation free-energy model with the
use of crystal phonon mode calculations to capture the contributions
of the vibrational modes of the crystal. Including these improvements
with lattice energies computed using the model-potential-based Ψmol method leads to accurate estimates of sublimation free
energy. Combining these with hydration free energies obtained from
either molecular dynamics free-energy perturbation simulations or
density functional theory calculations, solubilities comparable to
both experiment and informatics predictions are obtained. The application
to coronene, succinic acid, and the pharmaceutical desloratadine shows
how the methods must be adapted for the adoption of different conformations
in different phases. The approach has the flexibility to extend to
applications that cannot be covered by informatics methods.
Collapse
Affiliation(s)
- Daniel J Fowles
- Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow, Scotland G1 1XL, U.K
| | - David S Palmer
- Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow, Scotland G1 1XL, U.K
| | - Rui Guo
- Department of Chemistry, University College London, 20 Gordon Street, London WC1H 0AJ, U.K
| | - Sarah L Price
- Department of Chemistry, University College London, 20 Gordon Street, London WC1H 0AJ, U.K
| | - John B O Mitchell
- EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, St Andrews, Scotland KY16 9ST, U.K
| |
Collapse
|
14
|
Francoeur PG, Koes DR. SolTranNet-A Machine Learning Tool for Fast Aqueous Solubility Prediction. J Chem Inf Model 2021; 61:2530-2536. [PMID: 34038123 DOI: 10.1021/acs.jcim.1c00331] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
While accurate prediction of aqueous solubility remains a challenge in drug discovery, machine learning (ML) approaches have become increasingly popular for this task. For instance, in the Second Challenge to Predict Aqueous Solubility (SC2), all groups utilized machine learning methods in their submissions. We present SolTranNet, a molecule attention transformer to predict aqueous solubility from a molecule's SMILES representation. Atypically, we demonstrate that larger models perform worse at this task, with SolTranNet's final architecture having 3,393 parameters while outperforming linear ML approaches. SolTranNet has a 3-fold scaffold split cross-validation root-mean-square error (RMSE) of 1.459 on AqSolDB and an RMSE of 1.711 on a withheld test set. We also demonstrate that, when used as a classifier to filter out insoluble compounds, SolTranNet achieves a sensitivity of 94.8% on the SC2 data set and is competitive with the other methods submitted to the competition. SolTranNet is distributed via pip, and its source code is available at https://github.com/gnina/SolTranNet.
Collapse
Affiliation(s)
- Paul G Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
15
|
Synergistic Computational Modeling Approaches as Team Players in the Game of Solubility Predictions. J Pharm Sci 2020; 110:22-34. [PMID: 33217423 DOI: 10.1016/j.xphs.2020.10.068] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 10/23/2020] [Accepted: 10/28/2020] [Indexed: 11/23/2022]
Abstract
Several approaches to predict and model drug solubility have been used in the drug discovery and development processes during the last decades. Each of these approaches have their own benefits and place, and are typically used as standalone approaches rather than in concert. The synergistic effects of these are often overlooked, partly due to the need of computational experts to perform the modeling and simulations as well as analyzing the data obtained. Here we provide our views on how these different approaches can be used to retrieve more information on drug solubility, ranging from multivariate data analysis over thermodynamic cycle modeling to molecular dynamics simulations. We are discussing aqueous solubility as well as solubility in more complex mixed solvents and media with colloidal structures present. We conclude that the field of computational pharmaceutics is in its early days but with a bright future ahead. However, education of computational formulators with broad knowledge of modeling and simulation approaches is imperative if computational pharmaceutics is to reach its full potential.
Collapse
|
16
|
Ansari N, Karmakar T, Parrinello M. Molecular Mechanism of Gas Solubility in Liquid: Constant Chemical Potential Molecular Dynamics Simulations. J Chem Theory Comput 2020; 16:5279-5286. [PMID: 32551636 DOI: 10.1021/acs.jctc.0c00450] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Accurate prediction of gas solubility in a liquid is crucial in many areas of chemistry, and a detailed understanding of the molecular mechanism of the gas solvation continues to be an active area of research. Here, we extend the idea of the constant chemical potential molecular dynamics (CμMD) approach to the calculation of the gas solubility in the liquid under constant gas chemical potential conditions. As a representative example, we utilize this method to calculate the isothermal solubility of carbon dioxide in water. Additionally, we provide microscopic insight into the mechanism of solvation that preferentially occurs in areas of the surface where the hydrogen network is broken.
Collapse
Affiliation(s)
- Narjes Ansari
- Department of Chemistry and Applied Biosciences, ETH Zurich, 8092 Zurich, Switzerland.,Facoltà di informatica, Istituto di Scienze Computazionali, Università della Svizzera Italiana, CH-6900 Lugano, Switzerland
| | - Tarak Karmakar
- Department of Chemistry and Applied Biosciences, ETH Zurich, 8092 Zurich, Switzerland.,Facoltà di informatica, Istituto di Scienze Computazionali, Università della Svizzera Italiana, CH-6900 Lugano, Switzerland
| | - Michele Parrinello
- Department of Chemistry and Applied Biosciences, ETH Zurich, 8092 Zurich, Switzerland.,Facoltà di informatica, Istituto di Scienze Computazionali, Università della Svizzera Italiana, CH-6900 Lugano, Switzerland.,Italian Institute of Technology, Via Morego 30, 16163 Genova, Italy
| |
Collapse
|
17
|
Wyttenbach N, Niederquell A, Kuentz M. Machine Estimation of Drug Melting Properties and Influence on Solubility Prediction. Mol Pharm 2020; 17:2660-2671. [DOI: 10.1021/acs.molpharmaceut.0c00355] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Nicole Wyttenbach
- Roche Pharmaceutical Research & Early Development, Pre-Clinical CMC, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4000 Basel, Switzerland
| | - Andreas Niederquell
- University of Applied Sciences and Arts Northwestern Switzerland, Institute of Pharma Technology, Hofackerstr. 30, CH-4132 Muttenz, Switzerland
| | - Martin Kuentz
- University of Applied Sciences and Arts Northwestern Switzerland, Institute of Pharma Technology, Hofackerstr. 30, CH-4132 Muttenz, Switzerland
| |
Collapse
|
18
|
Mulligan VK. The emerging role of computational design in peptide macrocycle drug discovery. Expert Opin Drug Discov 2020; 15:833-852. [PMID: 32345066 DOI: 10.1080/17460441.2020.1751117] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Drug discovery is a laborious process with rising cost per new drug. Peptide macrocycles are promising therapeutics, though conformational flexibility can reduce target affinity and specificity. Recent computational advancements address this problem by enabling rational design of rigidly folded peptide macrocycles. AREAS COVERED This review summarizes currently approved peptide macrocycle therapeutics and discusses advantages of mesoscale drugs over small molecules or protein therapeutics. It describes the history, rationale, and state of the art of computational tools, such as Rosetta, that allow the design of rigidly structured peptide macrocycles. The emerging pipeline for designing peptide macrocycle drugs is described, including current challenges in designing permeable molecules that can emulate the chameleonic behavior of natural macrocycles. Prospects for reducing computational cost and improving accuracy with emerging computational technologies are also discussed. EXPERT OPINION To embrace computational design of peptide macrocycle drugs, we must shift current attitudes regarding the role of computation in drug discovery, and move beyond Lipinski's rules. This technology has the potential to shift failures to earlier in silico stages of the drug discovery process, improving success rates in costly clinical trials. Given the available tools, now is the time for drug developers to incorporate peptide macrocycle design into drug discovery pipelines.
Collapse
Affiliation(s)
- Vikram K Mulligan
- Systems Biology, Center for Computational Biology, Flatiron Institute , New York, NY, USA
| |
Collapse
|
19
|
Anwar J, Leitold C, Peters B. Solid–solid phase equilibria in the NaCl–KCl system. J Chem Phys 2020; 152:144109. [DOI: 10.1063/5.0003224] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Affiliation(s)
- Jamshed Anwar
- Department of Chemistry, Lancaster University, Lancaster LA1 4YW, United Kingdom
| | - Christian Leitold
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Baron Peters
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Department of Chemistry and Biochemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
20
|
Abramov YA, Sun G, Zeng Q, Zeng Q, Yang M. Guiding Lead Optimization for Solubility Improvement with Physics-Based Modeling. Mol Pharm 2020; 17:666-673. [DOI: 10.1021/acs.molpharmaceut.9b01138] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Affiliation(s)
- Yuriy A. Abramov
- XtalPi Inc, 245 Main Street, Cambridge, Massachusetts 02142, United States
- Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina 27599, United States
| | - Guangxu Sun
- XtalPi Inc., Shenzhen Jingtai Technology Co., Ltd., Floor 4, No. 9, Hualian Industrial Zone, Dalang Street, Longhua District, Shenzhen 518100, China
| | - Qiao Zeng
- XtalPi Inc., Shenzhen Jingtai Technology Co., Ltd., Floor 4, No. 9, Hualian Industrial Zone, Dalang Street, Longhua District, Shenzhen 518100, China
| | - Qun Zeng
- XtalPi Inc., Shenzhen Jingtai Technology Co., Ltd., Floor 4, No. 9, Hualian Industrial Zone, Dalang Street, Longhua District, Shenzhen 518100, China
| | - Mingjun Yang
- XtalPi Inc., Shenzhen Jingtai Technology Co., Ltd., Floor 4, No. 9, Hualian Industrial Zone, Dalang Street, Longhua District, Shenzhen 518100, China
| |
Collapse
|
21
|
Kumoro AC, Retnowati DS, Ratnawati R, Widiyanti M. Estimation of aqueous solubility of starch from various botanical sources using Flory Huggins theory approach. CHEM ENG COMMUN 2019. [DOI: 10.1080/00986445.2019.1691539] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Andri Cahyo Kumoro
- Department of Chemical Engineering, Faculty of Engineering, Universitas Diponegoro, Semarang, Indonesia
- Institute of Food and Remedies Bio-Materials, Universitas Diponegoro, Semarang, Indonesia
| | - Diah Susetyo Retnowati
- Department of Chemical Engineering, Faculty of Engineering, Universitas Diponegoro, Semarang, Indonesia
| | - Ratnawati Ratnawati
- Department of Chemical Engineering, Faculty of Engineering, Universitas Diponegoro, Semarang, Indonesia
| | - Marissa Widiyanti
- Department of Chemical Engineering, Faculty of Engineering, Universitas Diponegoro, Semarang, Indonesia
| |
Collapse
|
22
|
Boothroyd S, Anwar J. Solubility prediction for a soluble organic molecule via chemical potentials from density of states. J Chem Phys 2019; 151:184113. [PMID: 31731842 DOI: 10.1063/1.5117281] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
While the solubility of a substance is a fundamental property of widespread significance, its prediction from first principles (starting from only the knowledge of the molecular structure of the solute and solvent) remains a challenge. Recently, we proposed a robust and efficient method to predict the solubility from the density of states of a solute-solvent system using classical molecular simulation. The efficiency, and indeed the generality, of the method has now been enhanced by extending it to calculate solution chemical potentials (rather than probability distributions as done previously), from which solubility may be accessed. The method has been employed to predict the chemical potential of Form 1 of urea in both water and methanol for a range of concentrations at ambient conditions and for two charge models. The chemical potential calculations were validated by thermodynamic integration with the two sets of values being in excellent agreement. The solubility determined from the chemical potentials for urea in water ranged from 0.46 to 0.50 mol kg-1, while that for urea in methanol ranged from 0.62 to 0.85 mol kg-1, over the temperature range 298-328 K. In common with other recent studies of solubility prediction from molecular simulation, the predicted solubilities differ markedly from experimental values, reflecting limitations of current forcefields.
Collapse
Affiliation(s)
- Simon Boothroyd
- Chemical Theory and Computation, Department of Chemistry, Lancaster University, Lancaster LA1 4YB, United Kingdom
| | - Jamshed Anwar
- Chemical Theory and Computation, Department of Chemistry, Lancaster University, Lancaster LA1 4YB, United Kingdom
| |
Collapse
|