1
|
Tan YS, Singh C, Nasseri K, Agarwal A, Duncan J, Ronen O, Epland M, Kornblith A, Yu B. Fast Interpretable Greedy-Tree Sums. Proc Natl Acad Sci U S A 2025; 122:e2310151122. [PMID: 39951504 PMCID: PMC11848335 DOI: 10.1073/pnas.2310151122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 12/10/2024] [Indexed: 02/16/2025] Open
Abstract
Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the Classification and Regression Trees (CART) algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS adapts to additive structure while remaining highly interpretable. Experiments on real-world datasets show FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding decision-making. Specifically, we introduce a variant of FIGS known as Group Probability-Weighted Tree Sums (G-FIGS) that accounts for heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. Theoretically, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that tree-sum models leverage disentanglement to generalize more efficiently than single tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS performs competitively with random forests and XGBoost on real-world datasets.
Collapse
Affiliation(s)
- Yan Shuo Tan
- Department of Statistics and Data Science, National University of Singapore, Singapore119077, Republic of Singapore
| | - Chandan Singh
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA94720
- Microsoft Research, Redmond, Washington, WA98052
| | - Keyan Nasseri
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA94720
| | - Abhineet Agarwal
- Statistics Department, University of California, Berkeley, CA94720
| | - James Duncan
- Graduate Group in Biostatistics, University of California, Berkeley, CA94720
| | - Omer Ronen
- Statistics Department, University of California, Berkeley, CA94720
| | | | - Aaron Kornblith
- Department of Emergency Medicine, University of California, San Francisco, CA94113
- Department of Pediatrics, University of California, San Francisco, CA94113
| | - Bin Yu
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA94720
- Microsoft Research, Redmond, Washington, WA98052
- Statistics Department, University of California, Berkeley, CA94720
| |
Collapse
|
2
|
Chernigovskaya M, Pavlović M, Kanduri C, Gielis S, Robert P, Scheffer L, Slabodkin A, Haff IH, Meysman P, Yaari G, Sandve GK, Greiff V. Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning. Nucleic Acids Res 2025; 53:gkaf025. [PMID: 39873270 PMCID: PMC11773363 DOI: 10.1093/nar/gkaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/25/2025] [Indexed: 01/30/2025] Open
Abstract
Machine learning (ML) has shown great potential in the adaptive immune receptor repertoire (AIRR) field. However, there is a lack of large-scale ground-truth experimental AIRR data suitable for AIRR-ML-based disease diagnostics and therapeutics discovery. Simulated ground-truth AIRR data are required to complement the development and benchmarking of robust and interpretable AIRR-ML methods where experimental data is currently inaccessible or insufficient. The challenge for simulated data to be useful is incorporating key features observed in experimental repertoires. These features, such as antigen or disease-associated immune information, cause AIRR-ML problems to be challenging. Here, we introduce LIgO, a software suite, which simulates AIRR data for the development and benchmarking of AIRR-ML methods. LIgO incorporates different types of immune information both on the receptor and the repertoire level and preserves native-like generation probability distribution. Additionally, LIgO assists users in determining the computational feasibility of their simulations. We show two examples where LIgO supports the development and validation of AIRR-ML methods: (i) how individuals carrying out-of-distribution immune information impacts receptor-level prediction performance and (ii) how immune information co-occurring in the same AIRs impacts the performance of conventional receptor-level encoding and repertoire-level classification approaches. LIgO guides the advancement and assessment of interpretable AIRR-ML methods.
Collapse
Affiliation(s)
- Maria Chernigovskaya
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| | - Milena Pavlović
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Chakravarthi Kanduri
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Sofie Gielis
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, 2020, Belgium
| | - Philippe A Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
- Department of Biomedicine, University of Basel, Basel, 4031, Switzerland
| | - Lonneke Scheffer
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
| | - Andrei Slabodkin
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| | | | - Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, 2020, Belgium
| | - Gur Yaari
- Faculty of Engineering, Bar-Ilan University, Ramat Gan, 5290002, Israel
| | - Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| |
Collapse
|
3
|
Peterson RA, McGrath M, Cavanaugh JE. Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets. ENTROPY (BASEL, SWITZERLAND) 2024; 26:746. [PMID: 39330080 PMCID: PMC11431724 DOI: 10.3390/e26090746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 08/27/2024] [Accepted: 08/28/2024] [Indexed: 09/28/2024]
Abstract
We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.
Collapse
Affiliation(s)
- Ryan A Peterson
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA
| | - Max McGrath
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA
| | - Joseph E Cavanaugh
- Department of Biostatistics, College of Public Health, University of Iowa, 145 N. Riverside Dr., Iowa City, IA 52245, USA
| |
Collapse
|
4
|
Tjaden J, Tjaden B. MLpronto: A tool for democratizing machine learning. PLoS One 2023; 18:e0294924. [PMID: 38032968 PMCID: PMC10688639 DOI: 10.1371/journal.pone.0294924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Accepted: 11/11/2023] [Indexed: 12/02/2023] Open
Abstract
The democratization of machine learning is a popular and growing movement. In a world with a wealth of publicly available data, it is important that algorithms for analysis of data are accessible and usable by everyone. We present MLpronto, a system for machine learning analysis that is designed to be easy to use so as to facilitate engagement with machine learning algorithms. With its web interface, MLpronto requires no computer programming or machine learning background, and it normally returns results in a matter of seconds. As input, MLpronto takes a file of data to be analyzed. MLpronto then executes some of the more commonly used supervised machine learning algorithms on the data and reports the results of the analyses. As part of its execution, MLpronto generates computer programming code corresponding to its machine learning analysis, which it also supplies as output. Thus, MLpronto can be used as a no-code solution for citizen data scientists with no machine learning or programming background, as an educational tool for those learning about machine learning, and as a first step for those who prefer to engage with programming code in order to facilitate rapid development of machine learning projects. MLpronto is freely available for use at https://mlpronto.org/.
Collapse
Affiliation(s)
- Jacob Tjaden
- Computer Science Department, Colby College, Waterville, ME, United States of America
| | - Brian Tjaden
- Department of Computer Science, Wellesley College, Wellesley, MA, United States of America
| |
Collapse
|
5
|
John M, Schuhmacher J, Barkoutsos P, Tavernelli I, Tacchino F. Optimizing Quantum Classification Algorithms on Classical Benchmark Datasets. ENTROPY (BASEL, SWITZERLAND) 2023; 25:860. [PMID: 37372204 PMCID: PMC10297005 DOI: 10.3390/e25060860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 05/24/2023] [Accepted: 05/24/2023] [Indexed: 06/29/2023]
Abstract
The discovery of quantum algorithms offering provable advantages over the best known classical alternatives, together with the parallel ongoing revolution brought about by classical artificial intelligence, motivates a search for applications of quantum information processing methods to machine learning. Among several proposals in this domain, quantum kernel methods have emerged as particularly promising candidates. However, while some rigorous speedups on certain highly specific problems have been formally proven, only empirical proof-of-principle results have been reported so far for real-world datasets. Moreover, no systematic procedure is known, in general, to fine tune and optimize the performances of kernel-based quantum classification algorithms. At the same time, certain limitations such as kernel concentration effects-hindering the trainability of quantum classifiers-have also been recently pointed out. In this work, we propose several general-purpose optimization methods and best practices designed to enhance the practical usefulness of fidelity-based quantum classification algorithms. Specifically, we first describe a data pre-processing strategy that, by preserving the relevant relationships between data points when processed through quantum feature maps, substantially alleviates the effect of kernel concentration on structured datasets. We also introduce a classical post-processing method that, based on standard fidelity measures estimated on a quantum processor, yields non-linear decision boundaries in the feature Hilbert space, thus achieving the quantum counterpart of the radial basis functions technique that is widely employed in classical kernel methods. Finally, we apply the so-called quantum metric learning protocol to engineer and adjust trainable quantum embeddings, demonstrating substantial performance improvements on several paradigmatic real-world classification tasks.
Collapse
Affiliation(s)
- Manuel John
- IBM Quantum, IBM Research Europe—Zurich, 8803 Rüschlikon, Switzerland
- Institute for Theoretical Physics, ETH Zürich, 8093 Zurich, Switzerland
| | | | | | - Ivano Tavernelli
- IBM Quantum, IBM Research Europe—Zurich, 8803 Rüschlikon, Switzerland
| | | |
Collapse
|
6
|
Bertini C, Leporini R. Quantum-Inspired Applications for Classification Problems. ENTROPY (BASEL, SWITZERLAND) 2023; 25:404. [PMID: 36981293 PMCID: PMC10047587 DOI: 10.3390/e25030404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/19/2023] [Accepted: 02/22/2023] [Indexed: 06/18/2023]
Abstract
In the context of quantum-inspired machine learning, quantum state discrimination is a useful tool for classification problems. We implement a local approach combining the k-nearest neighbors algorithm with some quantum-inspired classifiers. We compare the performance with respect to well-known classifiers applied to benchmark datasets.
Collapse
Affiliation(s)
- Cesarino Bertini
- Department of Management, University of Bergamo, via dei Caniana 2, I-24127 Bergamo, Italy
| | - Roberto Leporini
- Department of Economics, University of Bergamo, via dei Caniana 2, I-24127 Bergamo, Italy
| |
Collapse
|
7
|
Sha Z, Chen Y, Hu T. NSPA: characterizing the disease association of multiple genetic interactions at single-subject resolution. BIOINFORMATICS ADVANCES 2023; 3:vbad010. [PMID: 36818729 PMCID: PMC9927570 DOI: 10.1093/bioadv/vbad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/02/2023] [Accepted: 02/02/2023] [Indexed: 02/10/2023]
Abstract
Motivation The interaction between genetic variables is one of the major barriers to characterizing the genetic architecture of complex traits. To consider epistasis, network science approaches are increasingly being used in research to elucidate the genetic architecture of complex diseases. Network science approaches associate genetic variables' disease susceptibility to their topological importance in the network. However, this network only represents genetic interactions and does not describe how these interactions attribute to disease association at the subject-scale. We propose the Network-based Subject Portrait Approach (NSPA) and an accompanying feature transformation method to determine the collective risk impact of multiple genetic interactions for each subject. Results The feature transformation method converts genetic variants of subjects into new values that capture how genetic variables interact with others to attribute to a subject's disease association. We apply this approach to synthetic and genetic datasets and learn that (1) the disease association can be captured using multiple disjoint sets of genetic interactions and (2) the feature transformation method based on NSPA improves predictive performance comparing with using the original genetic variables. Our findings confirm the role of genetic interaction in complex disease and provide a novel approach for gene-disease association studies to identify genetic architecture in the context of epistasis. Availability and implementation The codes of NSPA are now available in: https://github.com/MIB-Lab/Network-based-Subject-Portrait-Approach. Contact ting.hu@queensu.ca. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Zhendong Sha
- School of Computing, Queen’s University, Kingston, Ontario, Canada K7L 2N8
| | - Yuanzhu Chen
- School of Computing, Queen’s University, Kingston, Ontario, Canada K7L 2N8
| | - Ting Hu
- To whom correspondence should be addressed.
| |
Collapse
|
8
|
Alòs J, Ansótegui C, Torres E. Interpretable decision trees through MaxSAT. Artif Intell Rev 2022; 56:1-21. [PMID: 36590759 PMCID: PMC9794111 DOI: 10.1007/s10462-022-10377-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2022] [Indexed: 12/29/2022]
Abstract
We present an approach to improve the accuracy-interpretability trade-off of Machine Learning (ML) Decision Trees (DTs). In particular, we apply Maximum Satisfiability technology to compute Minimum Pure DTs (MPDTs). We improve the runtime of previous approaches and, show that these MPDTs can outperform the accuracy of DTs generated with the ML framework sklearn.
Collapse
Affiliation(s)
- Josep Alòs
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| | - Carlos Ansótegui
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| | - Eduard Torres
- Logic & Optimization Group (LOG), University of Lleida, Lleida, Spain
| |
Collapse
|
9
|
Duong-Trung N, Born S, Kim JW, Schermeyer MT, Paulick K, Borisyak M, Cruz-Bournazou MN, Werner T, Scholz R, Schmidt-Thieme L, Neubauer P, Martinez E. When Bioprocess Engineering Meets Machine Learning: A Survey from the Perspective of Automated Bioprocess Development. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
10
|
Leporini R, Pastorello D. An efficient geometric approach to quantum-inspired classifications. Sci Rep 2022; 12:8781. [PMID: 35610272 PMCID: PMC9130267 DOI: 10.1038/s41598-022-12392-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 05/05/2022] [Indexed: 11/09/2022] Open
Abstract
Optimal measurements for the discrimination of quantum states are useful tools for classification problems. In order to exploit the potential of quantum computers, feature vectors have to be encoded into quantum states represented by density operators. However, quantum-inspired classifiers based on nearest mean and on Helstrom state discrimination are implemented on classical computers. We show a geometric approach that improves the efficiency of quantum-inspired classification in terms of space and time acting on quantum encoding and allows one to compare classifiers correctly in the presence of multiple preparations of the same quantum state as input. We also introduce the nearest mean classification based on Bures distance, Hellinger distance and Jensen-Shannon distance comparing the performance with respect to well-known classifiers applied to benchmark datasets.
Collapse
Affiliation(s)
- Roberto Leporini
- Department of Economics, University of Bergamo, via dei Caniana 2, 24127, Bergamo, Italy.
| | - Davide Pastorello
- Department of Information Engineering and Computer Science, University of Trento, via Sommarive 9, 38123, Povo, Italy
| |
Collapse
|
11
|
La Cava W, Burlacu B, Virgolin M, Kommenda M, Orzechowski P, de França FO, Jin Y, Moore JH. Contemporary Symbolic Regression Methods and their Relative Performance. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2021; 2021:1-16. [PMID: 38715933 PMCID: PMC11074949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/12/2024]
Abstract
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
Collapse
Affiliation(s)
| | - Bogdan Burlacu
- Josef Ressel Center for Symbolic Regression, University of Applied Sciences Upper Austria
| | - Marco Virgolin
- Life Sciences and Health Group, Centrum Wiskunde & Informatica
| | - Michael Kommenda
- Josef Ressel Center for Symbolic Regression, University of Applied Sciences Upper Austria
| | | | | | - Ying Jin
- Department of Statistics, Stanford University
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania
| |
Collapse
|