1
|
Batool M, Azam NA, Zhu J, Haraguchi K, Zhao L, Akutsu T. A unified approach to inferring chemical compounds with the desired aqueous solubility. J Cheminform 2025; 17:37. [PMID: 40140978 PMCID: PMC11938699 DOI: 10.1186/s13321-025-00966-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 02/02/2025] [Indexed: 03/28/2025] Open
Abstract
Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR), and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1166] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at https://github.com/ku-dml/mol-infer/tree/master/AqSol .
Collapse
Affiliation(s)
- Muniba Batool
- Discrete Mathematics and Computational Intelligence Laboratory, Department of Mathematics, Quaid-i-Azam University, Islamabad, Pakistan
| | - Naveed Ahmed Azam
- Discrete Mathematics and Computational Intelligence Laboratory, Department of Mathematics, Quaid-i-Azam University, Islamabad, Pakistan.
| | - Jianshen Zhu
- Discrete Mathematics Laboratory, Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, 606-8501, Kyoto, Japan
| | - Kazuya Haraguchi
- Discrete Mathematics Laboratory, Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, 606-8501, Kyoto, Japan
| | - Liang Zhao
- Graduate School of Advanced Integrated Studies in Human Survivability (Shishu-Kan), Kyoto University, 606-8306, Kyoto, Japan
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, 611-0011, Uji, Japan
| |
Collapse
|
2
|
Pitakbut T, Munkert J, Xi W, Wei Y, Fuhrmann G. Utilizing machine learning-based QSAR model to overcome standalone consensus docking limitation in beta-lactamase inhibitors screening: a proof-of-concept study. BMC Chem 2024; 18:249. [PMID: 39707439 DOI: 10.1186/s13065-024-01324-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Accepted: 10/16/2024] [Indexed: 12/23/2024] Open
Abstract
In virtual drug screening, consensus docking is a standard in-silico approach consisting of a combined result from optimized docking experiments, a minimum of two results combination. Therefore, consensus docking is subjected to a lower success rate than the best docking method due to its mathematical nature, an unavoidable limitation. This study aims to overcome this drawback via random forest, an ensemble machine learning model. First, in vitro beta-lactamase inhibitory screening was performed using an in-house chemical library. The in vitro results were later used as a validation. Consequently, we optimized docking protocols for AutoDock Vina and DOCK6 programs. With an appropriate scoring function, we found that DOCK6 could identify up to 70% of all active molecules, double the inappropriate. Further consensus analysis reduced the success rate to 50%. Simultaneously, a false positive rate was down to 16%, which was experimentally favorable for a drug search. Finally, we trained two quantitative structure-activity relationship (QSAR) models using logistic regression as a reference model and a random forest as a test model. After combining consensus docking results, random forest-based QSAR outperformed a logistic regression by restoring the success rate to 70% and maintaining a low false positive rate of around 21%. In conclusion, this study demonstrated the benefit of using a random forest (machine learning)-based QSAR model to overcome a standard consensus docking limitation in beta-lactamase inhibitor search as a proof-of-concept.
Collapse
Affiliation(s)
- Thanet Pitakbut
- Department of Biology, Pharmaceutical Biology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Staudtstr. 5, 91058, Erlangen, Germany
- Shenzhen Key Laboratory of Intelligent Bioinformatics and Center for High - Performance Computing, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Jennifer Munkert
- Department of Biology, Pharmaceutical Biology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Staudtstr. 5, 91058, Erlangen, Germany
- FAU NeW - Research Center New Bioactive Compounds, Nikolaus-Fiebiger-Str. 10, 91058, Erlangen, Germany
| | - Wenhui Xi
- Shenzhen Key Laboratory of Intelligent Bioinformatics and Center for High - Performance Computing, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yanjie Wei
- Shenzhen Key Laboratory of Intelligent Bioinformatics and Center for High - Performance Computing, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Gregor Fuhrmann
- Department of Biology, Pharmaceutical Biology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Staudtstr. 5, 91058, Erlangen, Germany.
- FAU NeW - Research Center New Bioactive Compounds, Nikolaus-Fiebiger-Str. 10, 91058, Erlangen, Germany.
| |
Collapse
|
3
|
Chen H, Lu D, Xiao Z, Li S, Zhang W, Luan X, Zhang W, Zheng G. Comprehensive applications of the artificial intelligence technology in new drug research and development. Health Inf Sci Syst 2024; 12:41. [PMID: 39130617 PMCID: PMC11310389 DOI: 10.1007/s13755-024-00300-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 07/27/2024] [Indexed: 08/13/2024] Open
Abstract
Purpose Target-based strategy is a prevalent means of drug research and development (R&D), since targets provide effector molecules of drug action and offer the foundation of pharmacological investigation. Recently, the artificial intelligence (AI) technology has been utilized in various stages of drug R&D, where AI-assisted experimental methods show higher efficiency than sole experimental ones. It is a critical need to give a comprehensive review of AI applications in drug R &D for biopharmaceutical field. Methods Relevant literatures about AI-assisted drug R&D were collected from the public databases (Including Google Scholar, Web of Science, PubMed, IEEE Xplore Digital Library, Springer, and ScienceDirect) through a keyword searching strategy with the following terms [("Artificial Intelligence" OR "Knowledge Graph" OR "Machine Learning") AND ("Drug Target Identification" OR "New Drug Development")]. Results In this review, we first introduced common strategies and novel trends of drug R&D, followed by characteristic description of AI algorithms widely used in drug R&D. Subsequently, we depicted detailed applications of AI algorithms in target identification, lead compound identification and optimization, drug repurposing, and drug analytical platform construction. Finally, we discussed the challenges and prospects of AI-assisted methods for drug discovery. Conclusion Collectively, this review provides comprehensive overview of AI applications in drug R&D and presents future perspectives for biopharmaceutical field, which may promote the development of drug industry.
Collapse
Affiliation(s)
- Hongyu Chen
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Dong Lu
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Ziyi Xiao
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Shensuo Li
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Wen Zhang
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Xin Luan
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Weidong Zhang
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Guangyong Zheng
- Shanghai Frontiers Science Center for Chinese Medicine Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| |
Collapse
|
4
|
Zhao J, Hermans E, Sepassi K, Tistaert C, Bergström CAS, Ahmad M, Larsson P. Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set. Mol Pharm 2024; 21:5261-5271. [PMID: 39267585 PMCID: PMC11462503 DOI: 10.1021/acs.molpharmaceut.4c00685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/05/2024] [Accepted: 09/05/2024] [Indexed: 09/17/2024]
Abstract
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
Collapse
Affiliation(s)
- Jiaxi Zhao
- Department
of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden
| | - Eline Hermans
- Pharmaceutical
& Material Sciences, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | - Kia Sepassi
- Discovery
Pharmaceutics, Janssen Research & Development,
LLC, La Jolla, California 92121, United States
| | - Christophe Tistaert
- Pharmaceutical
& Material Sciences, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | | | - Mazen Ahmad
- In
Silico Discovery, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | - Per Larsson
- Department
of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden
| |
Collapse
|
5
|
Kim Y, Jung H, Kumar S, Paton RS, Kim S. Designing solvent systems using self-evolving solubility databases and graph neural networks. Chem Sci 2024; 15:923-939. [PMID: 38239675 PMCID: PMC10793204 DOI: 10.1039/d3sc03468b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 12/04/2023] [Indexed: 01/22/2024] Open
Abstract
Designing solvent systems is key to achieving the facile synthesis and separation of desired products from chemical processes, so many machine learning models have been developed to predict solubilities. However, breakthroughs are needed to address deficiencies in the model's predictive accuracy and generalizability; this can be addressed by expanding and integrating experimental and computational solubility databases. To maximize predictive accuracy, these two databases should not be trained separately, and they should not be simply combined without reconciling the discrepancies from different magnitudes of errors and uncertainties. Here, we introduce self-evolving solubility databases and graph neural networks developed through semi-supervised self-training approaches. Solubilities from quantum-mechanical calculations are referred to during semi-supervised learning, but they are not directly added to the experimental database. Dataset augmentation is performed from 11 637 experimental solubilities to >900 000 data points in the integrated database, while correcting for the discrepancies between experiment and computation. Our model was successfully applied to study solvent selection in organic reactions and separation processes. The accuracy (mean absolute error around 0.2 kcal mol-1 for the test set) is quantitatively useful in exploring Linear Free Energy Relationships between reaction rates and solvation free energies for 11 organic reactions. Our model also accurately predicted the partition coefficients of lignin-derived monomers and drug-like molecules. While there is room for expanding solubility predictions to transition states, radicals, charged species, and organometallic complexes, this approach will be attractive to predictive chemistry areas where experimental, computational, and other heterogeneous data should be combined.
Collapse
Affiliation(s)
- Yeonjoon Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
- Department of Chemistry, Pukyong National University Busan 48513 Republic of Korea
| | - Hojin Jung
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Sabari Kumar
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Robert S Paton
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Seonah Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| |
Collapse
|
6
|
Gong Y, Ding W, Wang P, Wu Q, Yao X, Yang Q. Evaluating Machine Learning Methods of Analyzing Multiclass Metabolomics. J Chem Inf Model 2023; 63:7628-7641. [PMID: 38079572 DOI: 10.1021/acs.jcim.3c01525] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2023]
Abstract
Multiclass metabolomic studies have become popular for revealing the differences in multiple stages of complex diseases, various lifestyles, or the effects of specific treatments. In multiclass metabolomics, there are multiple data manipulation steps for analyzing raw data, which consist of data filtering, the imputation of missing values, data normalization, marker identification, sample separation, classification, and so on. In each step, several to dozens of machine learning methods can be chosen for the given data set, with potentially hundreds or thousands of method combinations in the whole data processing chain. Therefore, a clear understanding of these machine learning methods is helpful for selecting an appropriate method combination for obtaining stable and reliable analytical results of specific data. However, there has rarely been an overall introduction or evaluation of these methods based on multiclass metabolomic data. Herein, detailed descriptions of these machine learning methods in multiple data manipulation steps are reviewed. Moreover, an assessment of these methods was performed using a benchmark data set for multiclass metabolomics. First, 12 imputation methods for imputing missing values were evaluated based on the PSS (Procrustes statistical shape analysis) and NRMSE (normalized root-mean-square error) values. Second, 17 normalization methods for processing multiclass metabolomic data were evaluated by applying the PMAD (pooled median absolute deviation) value. Third, different methods of identifying markers of multiclass metabolomics were evaluated based on the CWrel (relative weighted consistency) value. Fourth, nine classification methods for constructing multiclass models were assessed using the AUC (area under the curve) value. Performance evaluations of machine learning methods are highly recommended to select the most appropriate method combination before performing the final analysis of the given data. Overall, detailed descriptions and evaluation of various machine learning methods are expected to improve analyses of multiclass metabolomic data.
Collapse
Affiliation(s)
- Yaguo Gong
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Wei Ding
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Panpan Wang
- College of Chemistry and Pharmaceutical Engineering, Huanghuai University, Zhumadian 463000, China
| | - Qibiao Wu
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Xiaojun Yao
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- Department of Bioinformatics, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| |
Collapse
|
7
|
Hong RS, Rojas AV, Bhardwaj RM, Wang L, Mattei A, Abraham NS, Cusack KP, Pierce MO, Mondal S, Mehio N, Bordawekar S, Kym PR, Abel R, Sheikh AY. Free Energy Perturbation Approach for Accurate Crystalline Aqueous Solubility Predictions. J Med Chem 2023; 66:15883-15893. [PMID: 38016916 DOI: 10.1021/acs.jmedchem.3c01339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
Early assessment of crystalline thermodynamic solubility continues to be elusive for drug discovery and development despite its critical importance, especially for the ever-increasing fraction of poorly soluble drug candidates. Here we present a detailed evaluation of a physics-based free energy perturbation (FEP+) approach for computing the thermodynamic aqueous solubility. The predictive power of this approach is assessed across diverse chemical spaces, spanning pharmaceutically relevant literature compounds and more complex AbbVie compounds. Our approach achieves predictive (RMSE = 0.86) and differentiating power (R2 = 0.69) and therefore provides notably improved correlations to experimental solubility compared to state-of-the-art machine learning approaches that utilize quantum mechanics-based descriptors. The importance of explicit considerations of crystalline packing in predicting solubility by the FEP+ approach is also highlighted in this study. Finally, we show how computed energetics, including hydration and sublimation free energies, can provide further insights into molecule design to feed the medicinal chemistry DMTA cycle.
Collapse
Affiliation(s)
- Richard S Hong
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Ana V Rojas
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Rajni Miglani Bhardwaj
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Lingle Wang
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Alessandra Mattei
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Nathan S Abraham
- Ventus Therapeutics 100 Beaver St, Waltham, Massachusetts 02453, United States
| | - Kevin P Cusack
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - M Olivia Pierce
- Bristol Myer Squibb, 100 Binney Street, Cambridge, Massachusetts 02142, United States
| | - Sayan Mondal
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Nada Mehio
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Shailendra Bordawekar
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Philip R Kym
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Robert Abel
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Ahmad Y Sheikh
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| |
Collapse
|