1
|
Lv Q, Chen G, Yang Z, Zhong W, Chen CYC. Meta-MolNet: A Cross-Domain Benchmark for Few Examples Drug Discovery. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4849-4863. [PMID: 40038923 DOI: 10.1109/tnnls.2024.3359657] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Predicting the pharmacological activity, toxicity, and pharmacokinetic properties of molecules is a central task in drug discovery. Existing machine learning methods are transferred from one resource rich molecular property to another data scarce property in the same scaffold dataset. However, existing models may produce fragile and highly uncertain predictions for new scaffold molecules. And these models were tested on different benchmarks, which seriously affected the quality of their evaluation results. In this article, we introduce Meta-MolNet, a collection of data benchmark and algorithms, which is a standard benchmark platform for measuring model generalization and uncertainty quantification capabilities. Meta-MolNet manages a wide range of molecular datasets with high ratio of molecules/scaffolds, which often leads to more difficult data shift and generalization problems. Furthermore, we propose a graph attention network based on cross-domain meta-learning, Meta-GAT, which uses bilevel optimization to learn meta-knowledge from the scaffold family molecular dataset in the source domain. Meta-GAT benefits from meta-knowledge that reduces the requirement of sample complexity to enable reliable predictions of new scaffold molecules in the target domain through internal iteration of a few examples. We evaluate existing methods as baselines for the community, and the Meta-MolNet benchmark demonstrates the effectiveness of measuring the proposed algorithm in domain generalization and uncertainty quantification. Extensive experiments demonstrate that the Meta-GAT model has state-of-the-art domain generalization performance and robustly estimates uncertainty under few examples constraints. By publishing AI-ready data, evaluation frameworks, and baseline results, we hope to see the Meta-MolNet suite become a comprehensive resource for the AI-assisted drug discovery community. Meta-MolNet is freely accessible at https://github.com/lol88/Meta-MolNet.
Collapse
|
2
|
Ochoa R, Deibler K. PepFuNN: Novo Nordisk Open-Source Toolkit to Enable Peptide in Silico Analysis. J Pept Sci 2025; 31:e3666. [PMID: 39777768 PMCID: PMC11706630 DOI: 10.1002/psc.3666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 12/04/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
We present PepFuNN, a new open-source version of the PepFun package with functions to study the chemical space of peptide libraries and perform structure-activity relationship analyses. PepFuNN is a Python package comprising five modules to study peptides with natural amino acids and, in some cases, sequences with non-natural amino acids based on the availability of a public monomer dictionary. The modules allow calculating physicochemical properties, performing similarity analysis using different peptide representations, clustering peptides using molecular fingerprints or calculated descriptors, designing peptide libraries based on specific requirements, and a module dedicated to extracting matched pairs from experimental campaigns to guide the selection of the most relevant mutations in design new rounds. The code and tutorials are available at https://github.com/novonordisk-research/pepfunn.
Collapse
Affiliation(s)
| | - Kristine Deibler
- Novo Nordisk Research Center Seattle, Novo Nordisk A/SSeattleWashingtonUSA
| |
Collapse
|
3
|
Dosajh A, Agrawal P, Chatterjee P, Priyakumar UD. Modern machine learning methods for protein property prediction. Curr Opin Struct Biol 2025; 90:102990. [PMID: 39881454 DOI: 10.1016/j.sbi.2025.102990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 12/06/2024] [Accepted: 01/04/2025] [Indexed: 01/31/2025]
Abstract
Recent progress and development of artificial intelligence and machine learning (AI/ML) techniques have enabled addressing complex biomolecular problems. AI/ML models learn the underlying distribution of data they are trained on and when exposed to new inputs, they make predictions based on patterns and relationships previously observed in the training set. Further, generative artificial intelligence (GenAI) can be used to accurately generate protein structure or sequence from specific selected properties. This review specifically focuses on the applications of AI/ML in predicting important functional properties of proteins, and the potential prospects of reverse-engineering in depicting the sequence and structure, from available protein-property information.
Collapse
Affiliation(s)
- Arjun Dosajh
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - Prakul Agrawal
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - Prathit Chatterjee
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India.
| |
Collapse
|
4
|
Kashafutdinova IM, Poyezzhayeva A, Gimadiev T, Madzhidov T. Active learning approaches in molecule pKi prediction. Mol Inform 2025; 44:e202400154. [PMID: 39105614 DOI: 10.1002/minf.202400154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 06/21/2024] [Accepted: 06/23/2024] [Indexed: 08/07/2024]
Abstract
During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.
Collapse
Affiliation(s)
- I M Kashafutdinova
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - A Poyezzhayeva
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Gimadiev
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Madzhidov
- Chemistry Solutions, Elsevier, London, EC2Y 5AS, UK
| |
Collapse
|
5
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
6
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
7
|
Deng H, Ding M, Wang Y, Li W, Liu G, Tang Y. ACP-MLC: A two-level prediction engine for identification of anticancer peptides and multi-label classification of their functional types. Comput Biol Med 2023; 158:106844. [PMID: 37058760 DOI: 10.1016/j.compbiomed.2023.106844] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 03/09/2023] [Accepted: 03/30/2023] [Indexed: 04/07/2023]
Abstract
Anticancer peptides (ACPs), a series of short bioactive peptides, are promising candidates in fighting against cancer due to their high activity, low toxicity, and not likely cause drug resistance. The accurate identification of ACPs and classification of their functional types is of great importance for investigating their mechanisms of action and developing peptide-based anticancer therapies. Here, we provided a computational tool, called ACP-MLC, to address binary classification and multi-label classification of ACPs for a given peptide sequence. Briefly, ACP-MLC is a two-level prediction engine, in which the 1st-level model predicts whether a query sequence is an ACP or not by random forest algorithm, and the 2nd-level model predicts which tissue types the sequence might target by the binary relevance algorithm. Development and evaluation by high-quality datasets, our ACP-MLC yielded an area under the receiver operating characteristic curve (AUC) of 0.888 on the independent test set for the 1st-level prediction, and obtained 0.157 hamming loss, 0.577 subset accuracy, 0.802 F1-scoremacro, and 0.826 F1-scoremicro on the independent test set for the 2nd-level prediction. A systematic comparison demonstrated that ACP-MLC outperformed existing binary classifiers and other multi-label learning classifiers for ACP prediction. Finally, we interpreted the important features of ACP-MLC by the SHAP method. User-friendly software and the datasets are available at https://github.com/Nicole-DH/ACP-MLC. We believe that the ACP-MLC would be a powerful tool in ACP discovery.
Collapse
|
8
|
Eastwood JRB, Weisberg EI, Katz D, Zuckermann RN, Kirshenbaum K. Guidelines for designing peptoid structures: Insights from the
Peptoid Data Bank. Pept Sci (Hoboken) 2023. [DOI: 10.1002/pep2.24307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Affiliation(s)
| | | | - Dana Katz
- Department of Chemistry New York University New York New York USA
| | | | - Kent Kirshenbaum
- Department of Chemistry New York University New York New York USA
| |
Collapse
|
9
|
Guo JL, Januszyk M, Longaker MT. Machine Learning in Tissue Engineering. Tissue Eng Part A 2023; 29:2-19. [PMID: 35943870 PMCID: PMC9885550 DOI: 10.1089/ten.tea.2022.0128] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Accepted: 08/02/2022] [Indexed: 02/03/2023] Open
Abstract
Machine learning (ML) and artificial intelligence have accelerated scientific discovery, augmented clinical practice, and deepened fundamental understanding of many biological phenomena. ML technologies have now been applied to diverse areas of tissue engineering research, including biomaterial design, scaffold fabrication, and cell/tissue modeling. Emerging ML-empowered strategies include machine-optimized polymer synthesis, predictive modeling of scaffold fabrication processes, complex analyses of structure-function relationships, and deep learning of spatialized cell phenotypes and tissue composition. The emergence of ML in tissue engineering, while relatively recent, has already enabled increasingly complex and multivariate analyses of the relationships between biological, chemical, and physical factors in driving tissue regenerative outcomes. This review highlights the novel methodologies, emerging strategies, and areas of potential growth within this rapidly evolving area of research. Impact statement Machine learning (ML) has accelerated scientific discovery and augmented clinical practice across multiple fields. Now, ML has driven exciting new paradigms in tissue engineering research, including machine-optimized biomaterial design, predictive modeling of scaffold fabrication, and spatiotemporal analysis of cell and tissue systems. The emergence of ML in tissue engineering, while relatively recent, has already enabled increasingly complex analyses of the relationships between biological, chemical, and physical factors in driving tissue regenerative outcomes. This review highlights the novel methodologies, emerging strategies, and areas of potential growth within this rapidly evolving area of research.
Collapse
Affiliation(s)
- Jason L. Guo
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, California, USA
| | - Michael Januszyk
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, California, USA
| | - Michael T. Longaker
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
10
|
Zhang Y, Qiu L, Ren Y, Cheng Z, Li L, Yao S, Zhang C, Luo Z, Lu H. A meta-learning approach to improving radiation response prediction in cancers. Comput Biol Med 2022; 150:106163. [PMID: 37070625 DOI: 10.1016/j.compbiomed.2022.106163] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/18/2022] [Accepted: 10/01/2022] [Indexed: 11/03/2022]
Abstract
PURPOSE Predicting the efficacy of radiotherapy in individual patients has drawn widespread attention, but the limited sample size remains a bottleneck for utilizing high-dimensional multi-omics data to guide personalized radiotherapy. We hypothesize the recently developed meta-learning framework could address this limitation. METHODS AND MATERIALS By combining gene expression, DNA methylation, and clinical data of 806 patients who had received radiotherapy from The Cancer Genome Atlas (TCGA), we applied the Model-Agnostic Meta-Learning (MAML) framework to tasks consisting of pan-cancer data, to obtain the best initial parameters of a neural network for a specific cancer with smaller number of samples. The performance of meta-learning framework was compared with four traditional machine learning methods based on two training schemes, and tested on Cancer Cell Line Encyclopedia (CCLE) and Chinese Glioma Genome Atlas (CGGA) datasets. Moreover, biological significance of the models was investigated by survival analysis and feature interpretation. RESULTS The mean AUC (Area under the ROC Curve) [95% confidence interval] of our models across nine cancer types was 0.702 [0.691-0.713], which improved by 0.166 on average over other the four machine learning methods on two training schemes. Our models performed significantly better (p < 0.05) in seven cancer types and performed comparable to the other predictors in the rest of two cancer types. The more pan-cancer samples were used to transfer meta-knowledge, the greater the performance improved (p < 0.05). The predicted response scores that our models generated were negatively correlated with cell radiosensitivity index in four cancer types (p < 0.05), while not statistically significant in the other three cancer types. Moreover, the predicted response scores were shown to be prognostic factors in seven cancer types and eight potential radiosensitivity-related genes were identified. CONCLUSIONS For the first time, we established the meta-learning approach to improving individual radiation response prediction by transferring common knowledge from pan-cancer data with MAML framework. The results demonstrated the superiority, generalizability, and biological significance of our approach.
Collapse
|
11
|
Nguyen D, Tao L, Li Y. Integration of Machine Learning and Coarse-Grained Molecular Simulations for Polymer Materials: Physical Understandings and Molecular Design. Front Chem 2022; 9:820417. [PMID: 35141207 PMCID: PMC8819075 DOI: 10.3389/fchem.2021.820417] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 12/31/2021] [Indexed: 12/21/2022] Open
Abstract
In recent years, the synthesis of monomer sequence-defined polymers has expanded into broad-spectrum applications in biomedical, chemical, and materials science fields. Pursuing the characterization and inverse design of these polymer systems requires our fundamental understanding not only at the individual monomer level, but also considering the chain scales, such as polymer configuration, self-assembly, and phase separation. However, our accessibility to this field is still rudimentary due to the limitations of traditional design approaches, the complexity of chemical space along with the burdened cost and time issues that prevent us from unveiling the underlying monomer sequence-structure-property relationships. Fortunately, thanks to the recent advancements in molecular dynamics simulations and machine learning (ML) algorithms, the bottlenecks in the tasks of establishing the structure-function correlation of the polymer chains can be overcome. In this review, we will discuss the applications of the integration between ML techniques and coarse-grained molecular dynamics (CGMD) simulations to solve the current issues in polymer science at the chain level. In particular, we focus on the case studies in three important topics-polymeric configuration characterization, feed-forward property prediction, and inverse design-in which CGMD simulations are leveraged to generate training datasets to develop ML-based surrogate models for specific polymer systems and designs. By doing so, this computational hybridization allows us to well establish the monomer sequence-functional behavior relationship of the polymers as well as guide us toward the best polymer chain candidates for the inverse design in undiscovered chemical space with reasonable computational cost and time. Even though there are still limitations and challenges ahead in this field, we finally conclude that this CGMD/ML integration is very promising, not only in the attempt of bridging the monomeric and macroscopic characterizations of polymer materials, but also enabling further tailored designs for sequence-specific polymers with superior properties in many practical applications.
Collapse
Affiliation(s)
- Danh Nguyen
- Department of Mechanical Engineering, University of Connecticut, Mansfield, CT, United States
| | - Lei Tao
- Department of Mechanical Engineering, University of Connecticut, Mansfield, CT, United States
| | - Ying Li
- Department of Mechanical Engineering, University of Connecticut, Mansfield, CT, United States
- Polymer Program, Institute of Materials Science, University of Connecticut, Mansfield, CT, United States
| |
Collapse
|
12
|
Shekar V, Nicholas G, Ani Najeeb M, Zeile M, Yu V, Wang X, Slack D, Li Z, Nega PW, Chan E, Norquist AJ, Schrier J, Friedler SA. Active Meta-Learning for Predicting and Selecting Perovskite Crystallization Experiments. J Chem Phys 2022; 156:064108. [DOI: 10.1063/5.0076636] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
| | | | | | | | - Vincent Yu
- Haverford College, United States of America
| | | | | | - Zhi Li
- E O Lawrence Berkeley National Laboratory, United States of America
| | - Philip W. Nega
- E O Lawrence Berkeley National Laboratory, United States of America
| | - Emory Chan
- Lawrence Berkeley National Laboratory, United States of America
| | | | - Joshua Schrier
- Department of Chemistry, Fordham University - Rose Hill Campus, United States of America
| | | |
Collapse
|
13
|
Tynes M, Gao W, Burrill DJ, Batista ER, Perez D, Yang P, Lubbers N. Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search. J Chem Inf Model 2021; 61:3846-3857. [PMID: 34347460 DOI: 10.1021/acs.jcim.1c00670] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Machine learning (ML) plays a growing role in the design and discovery of chemicals, aiming to reduce the need to perform expensive experiments and simulations. ML for such applications is promising but difficult, as models must generalize to vast chemical spaces from small training sets and must have reliable uncertainty quantification metrics to identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take advantage of differences between chemical conditions, rather than their absolute structure or state, to generate more reliable results. We have developed an analogous comparison-based approach for ML regression, called pairwise difference regression (PADRE), which is applicable to arbitrary underlying learning models and operates on pairs of input data points. During training, the model learns to predict differences between all possible pairs of input points. During prediction, the test points are paired with all training set points, giving rise to a set of predictions that can be treated as a distribution of which the mean is treated as a final prediction and the dispersion is treated as an uncertainty measure. Pairwise difference regression was shown to reliably improve the performance of the random forest algorithm across five chemical ML tasks. Additionally, the pair-derived dispersion is both well correlated with model error and performs well in active learning. We also show that this method is competitive with state-of-the-art neural network techniques. Thus, pairwise difference regression is a promising tool for candidate selection algorithms used in chemical discovery.
Collapse
Affiliation(s)
- Michael Tynes
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Wenhao Gao
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Daniel J Burrill
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Enrique R Batista
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Danny Perez
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Ping Yang
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Nicholas Lubbers
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| |
Collapse
|