1
|
Dey S, Wallqvist A, AbdulHameed MDM. Developing muscarinic receptor M1 classification models utilizing transfer learning and generative AI techniques. Sci Rep 2025; 15:16486. [PMID: 40355481 PMCID: PMC12069682 DOI: 10.1038/s41598-025-00972-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/02/2025] [Indexed: 05/14/2025] Open
Abstract
Muscarinic receptor subtype 1 (M1) is a G protein-coupled receptor (GPCR) and a key pharmacological target for peripheral neuropathy, chronic obstructive pulmonary disease, nerve agent exposures, and cognitive disorders. Screening and identifying compounds with potential to interact with M1 will aid in rational drug design for these disorders. In this work, we developed machine learning-based M1 classification models utilizing publicly available bioactivity data. As inactive compounds are rarely reported in the literature, we encountered the problem of imbalanced datasets. We investigated two strategies to overcome this bottleneck: 1) transfer learning and 2) using generative models to oversample the inactive class. Our analysis shows that these approaches reduced misclassification of the inactive class not only for M1 but also for other GPCR targets. Overall, we have developed classification models for M1 receptor that will enable rapid screening of large chemical databases and advance drug discovery.
Collapse
Affiliation(s)
- Souvik Dey
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Defense Health Agency Research and Development, Medical Research and Development Command, 504 Scott Street, Fort Detrick, MD, 21702-5012, USA
- The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, USA
| | - Anders Wallqvist
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Defense Health Agency Research and Development, Medical Research and Development Command, 504 Scott Street, Fort Detrick, MD, 21702-5012, USA.
| | - Mohamed Diwan M AbdulHameed
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Defense Health Agency Research and Development, Medical Research and Development Command, 504 Scott Street, Fort Detrick, MD, 21702-5012, USA.
- The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, USA.
| |
Collapse
|
2
|
Li J, Zhang J, Guo R, Dai J, Niu Z, Wang Y, Wang T, Jiang X, Hu W. Progress of machine learning in the application of small molecule druggability prediction. Eur J Med Chem 2025; 285:117269. [PMID: 39808972 DOI: 10.1016/j.ejmech.2025.117269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 01/07/2025] [Accepted: 01/08/2025] [Indexed: 01/16/2025]
Abstract
Machine learning (ML) has become an important tool for predicting the pharmaceutical properties of small molecules. Recent advancements in ML algorithms enable the rapid and accurate evaluation of solubility, activity, toxicity, pharmacokinetics, and other molecular properties through ML-based models. By conducting virtual screening of drug targets and elucidating drug-target protein interactions, researchers can conduct preliminary evaluations of the activity and safety of compounds from the ultra-large drug compound libraries, thereby accelerating the screening process for lead compounds. Moreover, ML leverages existing experimental data to train and generate new datasets, addressing the challenge of limited compounds and protein target data. This review provided a concise overview of ML applications in predicting small molecule properties, focusing on model construction principles, molecular feature selection, and other essential aspects. It also discussed the potential applications of ML in the screening of pharmaceutical small molecules.
Collapse
Affiliation(s)
- Junyao Li
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China; School of Life Sciences, Huaiyin Normal University, Huaian, 223300, China; Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Jianmei Zhang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China
| | - Rui Guo
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China; Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Jiawei Dai
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Zhiqiang Niu
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Yan Wang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China
| | - Taoyun Wang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China.
| | - Xiaojian Jiang
- School of Life Sciences, Huaiyin Normal University, Huaian, 223300, China.
| | - Weicheng Hu
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China.
| |
Collapse
|
3
|
Wellnitz J, Jain S, Hochuli JE, Maxfield T, Muratov EN, Tropsha A, Zakharov AV. One size does not fit all: revising traditional paradigms for assessing accuracy of QSAR models used for virtual screening. J Cheminform 2025; 17:7. [PMID: 39819357 PMCID: PMC11740363 DOI: 10.1186/s13321-025-00948-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Accepted: 01/03/2025] [Indexed: 01/19/2025] Open
Abstract
Traditional best practices for quantitative structure activity relationship (QSAR) modeling recommend dataset balancing and balanced accuracy (BA) as the key desired objective of model development. This study explores the value of the conventional norms in the context of using QSAR models for virtual screening of modern large and ultra-large chemical libraries. For this increasingly common task, we now recommend the use of models with the highest positive predictive value (PPV) built on imbalanced training sets as preferred virtual screening tools. This recommendation stems from practical considerations of how the results of virtual screening are used in experimental laboratories where only a small fraction of virtually screened molecules can be tested using standard well plates. As a proof of concept, we have developed QSAR models for five expansive datasets with different ratios of active and inactive molecules and compared model performance in virtual screening using BA, PPV, and other metrics. We show that training on imbalanced datasets achieves a hit rate at least 30% higher than using balanced datasets, and that the PPV metric captured this difference of performance with no parameter tuning. Importantly, hit rates were estimated for top scoring compounds organized in batches of the size of plates (for instance, 128 molecules) used in the experimental high throughput screening. Based on the results of our studies, we posit that QSAR models trained on imbalanced datasets with the highest PPV should be relied upon to identify and test hit compounds in early drug discovery studies.
Collapse
Affiliation(s)
- James Wellnitz
- Division of Chemical Biology and Medicinal Chemistry, Laboratory for Molecular Modeling,, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Sankalp Jain
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, MD, 20850, USA
| | - Joshua E Hochuli
- Division of Chemical Biology and Medicinal Chemistry, Laboratory for Molecular Modeling,, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Travis Maxfield
- Division of Chemical Biology and Medicinal Chemistry, Laboratory for Molecular Modeling,, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Eugene N Muratov
- Division of Chemical Biology and Medicinal Chemistry, Laboratory for Molecular Modeling,, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA.
| | - Alexander Tropsha
- Division of Chemical Biology and Medicinal Chemistry, Laboratory for Molecular Modeling,, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA.
| | - Alexey V Zakharov
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, MD, 20850, USA.
| |
Collapse
|
4
|
Yin Y, Lam HYI, Mu Y, Li HY, Kong AWK. Advancing Bioactivity Prediction Through Molecular Docking and Self-Attention. IEEE J Biomed Health Inform 2024; 28:7599-7610. [PMID: 39178096 DOI: 10.1109/jbhi.2024.3448455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
Bioactivity refers to the ability of a substance to induce biological effects within living systems, often describing the influence of molecules, drugs, or chemicals on organisms. In drug discovery, predicting bioactivity streamlines early-stage candidate screening by swiftly identifying potential active molecules. The popular deep learning methods in bioactivity prediction primarily model the ligand structure-bioactivity relationship under the premise of Quantitative Structure-Activity Relationship (QSAR). However, bioactivity is determined by multiple factors, including not only the ligand structure but also drug-target interactions, signaling pathways, reaction environments, pharmacokinetic properties, and species differences. Our study first integrates drug-target interactions into bioactivity prediction using protein-ligand complex data from molecular docking. We devise a Drug-Target Interaction Graph Neural Network (DTIGN), infusing interatomic forces into intermolecular graphs. DTIGN employs multi-head self-attention to identify native-like binding pockets and poses within molecular docking results. To validate the fidelity of the self-attention mechanism, we gather ground truth data from crystal structure databases. Subsequently, we employ these limited native structures to refine bioactivity prediction via semi-supervised learning. For this study, we establish a unique benchmark dataset for evaluating bioactivity prediction models in the context of protein-ligand complexes, showcasing the superior performance of our method (with an average improvement of 27.03%) through comparison with 9 leading deep learning-based bioactivity prediction methods.
Collapse
|
5
|
Gangwal A, Lavecchia A. Unleashing the power of generative AI in drug discovery. Drug Discov Today 2024; 29:103992. [PMID: 38663579 DOI: 10.1016/j.drudis.2024.103992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/22/2024] [Accepted: 04/18/2024] [Indexed: 05/04/2024]
Abstract
Artificial intelligence (AI) is revolutionizing drug discovery by enhancing precision, reducing timelines and costs, and enabling AI-driven computer-aided drug design. This review focuses on recent advancements in deep generative models (DGMs) for de novo drug design, exploring diverse algorithms and their profound impact. It critically analyses the challenges that are intricately interwoven into these technologies, proposing strategies to unlock their full potential. It features case studies of both successes and failures in advancing drugs to clinical trials with AI assistance. Last, it outlines a forward-looking plan for optimizing DGMs in de novo drug design, thereby fostering faster and more cost-effective drug development.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule 424001, Maharashtra, India
| | - Antonio Lavecchia
- "Drug Discovery" Laboratory, Department of Pharmacy, University of Naples Federico II, I-80131 Naples, Italy.
| |
Collapse
|
6
|
Brocidiacono M, Francoeur P, Aggarwal R, Popov KI, Koes DR, Tropsha A. BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening. J Chem Inf Model 2024; 64:2488-2495. [PMID: 38113513 DOI: 10.1021/acs.jcim.3c01211] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Deep learning methods that predict protein-ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein-ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind's test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.
Collapse
Affiliation(s)
- Michael Brocidiacono
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Paul Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Rishal Aggarwal
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Konstantin I Popov
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Alexander Tropsha
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
7
|
Cui S, Gao Y, Huang Y, Shen L, Zhao Q, Pan Y, Zhuang S. Advances and applications of machine learning and deep learning in environmental ecology and health. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2023; 335:122358. [PMID: 37567408 DOI: 10.1016/j.envpol.2023.122358] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 08/02/2023] [Accepted: 08/08/2023] [Indexed: 08/13/2023]
Abstract
Machine learning (ML) and deep learning (DL) possess excellent advantages in data analysis (e.g., feature extraction, clustering, classification, regression, image recognition and prediction) and risk assessment and management in environmental ecology and health (EEH). Considering the rapid growth and increasing complexity of data in EEH, it is of significance to summarize recent advances and applications of ML and DL in EEH. This review summarized the basic processes and fundamental algorithms of the ML and DL modeling, and indicated the urgent needs of ML and DL in EEH. Recent research hotspots such as environmental ecology and restoration, environmental fate of new pollutants, chemical exposures and risks, chemical hazard identification and control were highlighted. Various applications of ML and DL in EEH demonstrate their versatility and technological revolution, and present some challenges. The perspective of ML and DL in EEH were further outlined to promote the innovative analysis and cultivation of the ML-driven research paradigm.
Collapse
Affiliation(s)
- Shixuan Cui
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China; Women's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310006, China
| | - Yuchen Gao
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yizhou Huang
- Women's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310006, China
| | - Lilai Shen
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Qiming Zhao
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yaru Pan
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Shulin Zhuang
- Key Laboratory of Environment Remediation and Ecological Health, Ministry of Education, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China; Women's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310006, China.
| |
Collapse
|
8
|
Nittinger E, Clark A, Gaulton A, Zdrazil B. Biomedical data analyses facilitated by open cheminformatics workflows. J Cheminform 2023; 15:46. [PMID: 37069670 PMCID: PMC10108476 DOI: 10.1186/s13321-023-00718-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2023] Open
Affiliation(s)
- Eva Nittinger
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden.
| | - Alex Clark
- Research Informatics, Collaborative Drug Discovery, Inc., Ottawa, Canada
| | | | - Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK.
| |
Collapse
|
9
|
Kalemati M, Zamani Emani M, Koohi S. BiComp-DTA: Drug-target binding affinity prediction through complementary biological-related and compression-based featurization approach. PLoS Comput Biol 2023; 19:e1011036. [PMID: 37000857 PMCID: PMC10096306 DOI: 10.1371/journal.pcbi.1011036] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Revised: 04/12/2023] [Accepted: 03/20/2023] [Indexed: 04/03/2023] Open
Abstract
Drug-target binding affinity prediction plays a key role in the early stage of drug discovery. Numerous experimental and data-driven approaches have been developed for predicting drug-target binding affinity. However, experimental methods highly rely on the limited structural-related information from drug-target pairs, domain knowledge, and time-consuming assays. On the other hand, learning-based methods have shown an acceptable prediction performance. However, most of them utilize several simple and complex types of proteins and drug compounds data, ranging from the protein sequences to the topology of a graph representation of drug compounds, employing multiple deep neural networks for encoding and feature extraction, and so, leads to the computational overheads. In this study, we propose a unified measure for protein sequence encoding, named BiComp, which provides compression-based and evolutionary-related features from the protein sequences. Specifically, we employ Normalized Compression Distance and Smith-Waterman measures for capturing complementary information from the algorithmic information theory and biological domains, respectively. We utilize the proposed measure to encode the input proteins feeding a new deep neural network-based method for drug-target binding affinity prediction, named BiComp-DTA. BiComp-DTA is evaluated utilizing four benchmark datasets for drug-target binding affinity prediction. Compared to the state-of-the-art methods, which employ complex models for protein encoding and feature extraction, BiComp-DTA provides superior efficiency in terms of accuracy, runtime, and the number of trainable parameters. The latter achievement facilitates execution of BiComp-DTA on a normal desktop computer in a fast fashion. As a comparative study, we evaluate BiComp's efficiency against its components for drug-target binding affinity prediction. The results have shown superior accuracy of BiComp due to the orthogonality and complementary nature of Smith-Waterman and Normalized Compression Distance measures for protein sequences. Such a protein sequence encoding provides efficient representation with no need for multiple sources of information, deep domain knowledge, and complex neural networks.
Collapse
Affiliation(s)
- Mahmood Kalemati
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Mojtaba Zamani Emani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
10
|
Ricardo F, Ruiz-Puentes P, Reyes LH, Cruz JC, Alvarez O, Pradilla D. Estimation and prediction of the air–water interfacial tension in conventional and peptide surface-active agents by random Forest regression. Chem Eng Sci 2023. [DOI: 10.1016/j.ces.2022.118208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
11
|
Fassio AV, Shub L, Ponzoni L, McKinley J, O’Meara MJ, Ferreira RS, Keiser MJ, de Melo Minardi RC. Prioritizing Virtual Screening with Interpretable Interaction Fingerprints. J Chem Inf Model 2022; 62:4300-4318. [DOI: 10.1021/acs.jcim.2c00695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Alexandre V. Fassio
- São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo 13563-120, Brazil
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais 31270-901, Brazil
| | - Laura Shub
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94143, United States
| | - Luca Ponzoni
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94143, United States
| | - Jessica McKinley
- Gilead Sciences, Inc., Foster City, California 94404, United States
| | - Matthew J. O’Meara
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Rafaela S. Ferreira
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais 31270-901, Brazil
| | - Michael J. Keiser
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94143, United States
| | - Raquel C. de Melo Minardi
- Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais 31270-901, Brazil
| |
Collapse
|
12
|
López-López E, Fernández-de Gortari E, Medina-Franco JL. Yes SIR! On the structure-inactivity relationships in drug discovery. Drug Discov Today 2022; 27:2353-2362. [PMID: 35561964 DOI: 10.1016/j.drudis.2022.05.005] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/09/2022] [Accepted: 05/05/2022] [Indexed: 12/12/2022]
Abstract
In analogy with structure-activity relationships (SARs), which are at the core of medicinal chemistry, studying structure-inactivity relationships (SIRs) is essential to understanding and predicting biological activity. Current computational methods should predict or distinguish 'activity' and 'inactivity' with the same confidence because both concepts are complementary. However, the lack of inactivity data, in particular in the public domain, limits the development of predictive models and its broad application. In this review, we encourage the scientific community to disclose and analyze high-confidence activity data considering both the labeled 'active' and 'inactive' compounds.
Collapse
Affiliation(s)
- Edgar López-López
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico; Department of Chemistry and Graduate Program in Pharmacology, Center for Research and Advanced Studies of the National Polytechnic Institute, Mexico City 07000, Mexico.
| | - Eli Fernández-de Gortari
- Department of Nanosafety, International Iberian Nanotechnology Laboratory, Braga 4715-330, Portugal
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico.
| |
Collapse
|
13
|
Duan C, Nandy A, Kulik HJ. Machine Learning for the Discovery, Design, and Engineering of Materials. Annu Rev Chem Biomol Eng 2022; 13:405-429. [PMID: 35320698 DOI: 10.1146/annurev-chembioeng-092320-120230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Machine learning (ML) has become a part of the fabric of high-throughput screening and computational discovery of materials. Despite its increasingly central role, challenges remain in fully realizing the promise of ML. This is especially true for the practical acceleration of the engineering of robust materials and the development of design strategies that surpass trial and error or high-throughput screening alone. Depending on the quantity being predicted and the experimental data available, ML can either outperform physics-based modes, be used to accelerate such models, or be integrated with them to improve their performance. We cover recent advances in algorithms and in their application that are starting to make inroads toward (a) the discovery of new materials through large-scale enumerative screening, (b) the design of materials through identification of rules and principles that govern materials properties, and (c) the engineering of practical materials by satisfying multiple objectives. We conclude with opportunities for further advancement to realize ML as a widespread tool for practical computational materials design. Expected final online publication date for the Annual Review of Chemical and Biomolecular Engineering, Volume 13 is October 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , , .,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , , .,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , ,
| |
Collapse
|
14
|
Patrick Walters W. Comparing classification models-a practical tutorial. J Comput Aided Mol Des 2021; 36:381-389. [PMID: 34549368 DOI: 10.1007/s10822-021-00417-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 08/18/2021] [Indexed: 01/17/2023]
Abstract
While machine learning models have become a mainstay in Cheminformatics, the field has yet to agree on standards for model evaluation and comparison. In many cases, authors compare methods by performing multiple folds of cross-validation and reporting the mean value for an evaluation metric such as the area under the receiver operating characteristic. These comparisons of mean values often lack statistical rigor and can lead to inaccurate conclusions. In the interest of encouraging best practices, this tutorial provides an example of how multiple methods can be compared in a statistically rigorous fashion.
Collapse
|
15
|
Nandy A, Duan C, Taylor MG, Liu F, Steeves AH, Kulik HJ. Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning. Chem Rev 2021; 121:9927-10000. [PMID: 34260198 DOI: 10.1021/acs.chemrev.1c00347] [Citation(s) in RCA: 107] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transition-metal complexes are attractive targets for the design of catalysts and functional materials. The behavior of the metal-organic bond, while very tunable for achieving target properties, is challenging to predict and necessitates searching a wide and complex space to identify needles in haystacks for target applications. This review will focus on the techniques that make high-throughput search of transition-metal chemical space feasible for the discovery of complexes with desirable properties. The review will cover the development, promise, and limitations of "traditional" computational chemistry (i.e., force field, semiempirical, and density functional theory methods) as it pertains to data generation for inorganic molecular discovery. The review will also discuss the opportunities and limitations in leveraging experimental data sources. We will focus on how advances in statistical modeling, artificial intelligence, multiobjective optimization, and automation accelerate discovery of lead compounds and design rules. The overall objective of this review is to showcase how bringing together advances from diverse areas of computational chemistry and computer science have enabled the rapid uncovering of structure-property relationships in transition-metal chemistry. We aim to highlight how unique considerations in motifs of metal-organic bonding (e.g., variable spin and oxidation state, and bonding strength/nature) set them and their discovery apart from more commonly considered organic molecules. We will also highlight how uncertainty and relative data scarcity in transition-metal chemistry motivate specific developments in machine learning representations, model training, and in computational chemistry. Finally, we will conclude with an outlook of areas of opportunity for the accelerated discovery of transition-metal complexes.
Collapse
Affiliation(s)
- Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Michael G Taylor
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Fang Liu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Adam H Steeves
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
16
|
Duan C, Liu F, Nandy A, Kulik HJ. Putting Density Functional Theory to the Test in Machine-Learning-Accelerated Materials Discovery. J Phys Chem Lett 2021; 12:4628-4637. [PMID: 33973793 DOI: 10.1021/acs.jpclett.1c00631] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Accelerated discovery with machine learning (ML) has begun to provide the advances in efficiency needed to overcome the combinatorial challenge of computational materials design. Nevertheless, ML-accelerated discovery both inherits the biases of training data derived from density functional theory (DFT) and leads to many attempted calculations that are doomed to fail. Many compelling functional materials and catalytic processes involve strained chemical bonds, open-shell radicals and diradicals, or metal-organic bonds to open-shell transition-metal centers. Although promising targets, these materials present unique challenges for electronic structure methods and combinatorial challenges for their discovery. In this Perspective, we describe the advances needed in accuracy, efficiency, and approach beyond what is typical in conventional DFT-based ML workflows. These challenges have begun to be addressed through ML models trained to predict the results of multiple methods or the differences between them, enabling quantitative sensitivity analysis. For DFT to be trusted for a given data point in a high-throughput screen, it must pass a series of tests. ML models that predict the likelihood of calculation success and detect the presence of strong correlation will enable rapid diagnoses and adaptation strategies. These "decision engines" represent the first steps toward autonomous workflows that avoid the need for expert determination of the robustness of DFT-based materials discoveries.
Collapse
Affiliation(s)
- Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Fang Liu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|