1
|
Bassetti D, Pospíšil L, Horenko I. On Entropic Learning from Noisy Time Series in the Small Data Regime. ENTROPY (BASEL, SWITZERLAND) 2024; 26:553. [PMID: 39056915 PMCID: PMC11276242 DOI: 10.3390/e26070553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 06/24/2024] [Accepted: 06/25/2024] [Indexed: 07/28/2024]
Abstract
In this work, we present a novel methodology for performing the supervised classification of time-ordered noisy data; we call this methodology Entropic Sparse Probabilistic Approximation with Markov regularization (eSPA-Markov). It is an extension of entropic learning methodologies, allowing the simultaneous learning of segmentation patterns, entropy-optimal feature space discretizations, and Bayesian classification rules. We prove the conditions for the existence and uniqueness of the learning problem solution and propose a one-shot numerical learning algorithm that-in the leading order-scales linearly in dimension. We show how this technique can be used for the computationally scalable identification of persistent (metastable) regime affiliations and regime switches from high-dimensional non-stationary and noisy time series, i.e., when the size of the data statistics is small compared to their dimensionality and when the noise variance is larger than the variance in the signal. We demonstrate its performance on a set of toy learning problems, comparing eSPA-Markov to state-of-the-art techniques, including deep learning and random forests. We show how this technique can be used for the analysis of noisy time series from DNA and RNA Nanopore sequencing.
Collapse
Affiliation(s)
- Davide Bassetti
- Faculty of Mathematics, RPTU Kaiserslautern-Landau, Gottlieb-Daimler-Str. 48, 67663 Kaiserslautern, Germany
| | - Lukáš Pospíšil
- Department of Mathematics, Faculty of Civil Engineering, VŠB-TUO, Ludvika Podeste 1875/17, 708 33 Ostrava, Czech Republic;
| | - Illia Horenko
- Faculty of Mathematics, RPTU Kaiserslautern-Landau, Gottlieb-Daimler-Str. 48, 67663 Kaiserslautern, Germany
| |
Collapse
|
2
|
Vecchi E, Bassetti D, Graziato F, Pospíšil L, Horenko I. Gauge-Optimal Approximate Learning for Small Data Classification. Neural Comput 2024; 36:1198-1227. [PMID: 38669692 DOI: 10.1162/neco_a_01664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 01/16/2024] [Indexed: 04/28/2024]
Abstract
Small data learning problems are characterized by a significant discrepancy between the limited number of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information and cannot derive an appropriate learning rule that allows discriminating among different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the gauge-optimal approximate learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation, and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space and that it can be approximated through a monotonically convergent algorithm that presents-under the assumption of a discrete segmentation of the feature space-a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Niño Southern Oscillation and inference of epigenetically induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems in both learning performance and computational cost.
Collapse
Affiliation(s)
- Edoardo Vecchi
- Università della Svizzera Italiana, Faculty of Informatics, Institute of Computing, 6962 Lugano, Switzerland
| | - Davide Bassetti
- Technical University of Kaiserslautern, Faculty of Mathematics, Group of Mathematics of AI, 67663 Kaiserslautern, Germany
| | | | - Lukáš Pospíšil
- VSB Ostrava, Department of Mathematics, Ludvika Podeste 1875/17 708 33 Ostrava, Czech Republic
| | - Illia Horenko
- Technical University of Kaiserslautern, Faculty of Mathematics, Group of Mathematics of AI, 67663 Kaiserslautern, Germany
| |
Collapse
|
3
|
Abstract
Regression learning is one of the long-standing problems in statistics, machine learning, and deep learning (DL). We show that writing this problem as a probabilistic expectation over (unknown) feature probabilities - thus increasing the number of unknown parameters and seemingly making the problem more complex-actually leads to its simplification, and allows incorporating the physical principle of entropy maximization. It helps decompose a very general setting of this learning problem (including discretization, feature selection, and learning multiple piece-wise linear regressions) into an iterative sequence of simple substeps, which are either analytically solvable or cheaply computable through an efficient second-order numerical solver with a sublinear cost scaling. This leads to the computationally cheap and robust non-DL second-order Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, that can be efficiently applied to problems with millions of feature dimensions on a commodity laptop, when the state-of-the-art learning tools would require supercomputers. SPARTAn is compared to a range of commonly used regression learning tools on synthetic problems and on the prediction of the El Niño Southern Oscillation, the dominant interannual mode of tropical climate variability. The obtained SPARTAn learners provide more predictive, sparse, and physically explainable data descriptions, clearly discerning the important role of ocean temperature variability at the thermocline in the equatorial Pacific. SPARTAn provides an easily interpretable description of the timescales by which these thermocline temperature features evolve and eventually express at the surface, thereby enabling enhanced predictability of the key drivers of the interannual climate.
Collapse
|
4
|
Horenko I, Pospíšil L, Vecchi E, Albrecht S, Gerber A, Rehbock B, Stroh A, Gerber S. Low-Cost Probabilistic 3D Denoising with Applications for Ultra-Low-Radiation Computed Tomography. J Imaging 2022; 8:jimaging8060156. [PMID: 35735955 PMCID: PMC9224620 DOI: 10.3390/jimaging8060156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 05/18/2022] [Accepted: 05/19/2022] [Indexed: 12/04/2022] Open
Abstract
We propose a pipeline for synthetic generation of personalized Computer Tomography (CT) images, with a radiation exposure evaluation and a lifetime attributable risk (LAR) assessment. We perform a patient-specific performance evaluation for a broad range of denoising algorithms (including the most popular deep learning denoising approaches, wavelets-based methods, methods based on Mumford−Shah denoising, etc.), focusing both on accessing the capability to reduce the patient-specific CT-induced LAR and on computational cost scalability. We introduce a parallel Probabilistic Mumford−Shah denoising model (PMS) and show that it markedly-outperforms the compared common denoising methods in denoising quality and cost scaling. In particular, we show that it allows an approximately 22-fold robust patient-specific LAR reduction for infants and a 10-fold LAR reduction for adults. Using a normal laptop, the proposed algorithm for PMS allows cheap and robust (with a multiscale structural similarity index >90%) denoising of very large 2D videos and 3D images (with over 107 voxels) that are subject to ultra-strong noise (Gaussian and non-Gaussian) for signal-to-noise ratios far below 1.0. The code is provided for open access.
Collapse
Affiliation(s)
- Illia Horenko
- Faculty of Mathematics, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence: (I.H.); (S.G.)
| | - Lukáš Pospíšil
- Department of Mathematics, VSB Ostrava, Ludvika Podeste 1875/17, 708 33 Ostrava, Czech Republic;
| | - Edoardo Vecchi
- Institute of Computing, Faculty of Informatics, Universitá della Svizzera Italiana (USI), 6962 Viganello, Switzerland;
| | - Steffen Albrecht
- Institute of Physiology, University Medical Center of the Johannes Gutenberg—University Mainz, 55128 Mainz, Germany;
| | - Alexander Gerber
- Institute of Occupational Medicine, Faculty of Medicine, GU Frankfurt, 60590 Frankfurt am Main, Germany;
| | - Beate Rehbock
- Lung Radiology Center Berlin, 10627 Berlin, Germany;
| | - Albrecht Stroh
- Institute of Pathophysiology, University Medical Center of the Johannes Gutenberg—University Mainz, 55128 Mainz, Germany;
| | - Susanne Gerber
- Institute for Human Genetics, University Medical Center of the Johannes Gutenberg—University Mainz, 55128 Mainz, Germany
- Correspondence: (I.H.); (S.G.)
| |
Collapse
|
5
|
Vecchi E, Pospíšil L, Albrecht S, O'Kane TJ, Horenko I. eSPA+: Scalable Entropy-Optimal Machine Learning Classification for Small Data Problems. Neural Comput 2022; 34:1220-1255. [PMID: 35344997 DOI: 10.1162/neco_a_01490] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 12/20/2021] [Indexed: 11/04/2022]
Abstract
Classification problems in the small data regime (with small data statistic T and relatively large feature space dimension D) impose challenges for the common machine learning (ML) and deep learning (DL) tools. The standard learning methods from these areas tend to show a lack of robustness when applied to data sets with significantly fewer data points than dimensions and quickly reach the overfitting bound, thus leading to poor performance beyond the training set. To tackle this issue, we propose eSPA+, a significant extension of the recently formulated entropy-optimal scalable probabilistic approximation algorithm (eSPA). Specifically, we propose to change the order of the optimization steps and replace the most computationally expensive subproblem of eSPA with its closed-form solution. We prove that with these two enhancements, eSPA+ moves from the polynomial to the linear class of complexity scaling algorithms. On several small data learning benchmarks, we show that the eSPA+ algorithm achieves a many-fold speed-up with respect to eSPA and even better performance results when compared to a wide array of ML and DL tools. In particular, we benchmark eSPA+ against the standard eSPA and the main classes of common learning algorithms in the small data regime: various forms of support vector machines, random forests, and long short-term memory algorithms. In all the considered applications, the common learning methods and eSPA are markedly outperformed by eSPA+, which achieves significantly higher prediction accuracy with an orders-of-magnitude lower computational cost.
Collapse
Affiliation(s)
- Edoardo Vecchi
- Universitá della Svizzera Italiana, Faculty of Informatics, TI-6900 Lugano, Switzerland
| | - Lukáš Pospíšil
- VSB Ostrava, Department of Mathematics, Ludvika Podeste 1875/17 708 33 Ostrava, Czech Republic
| | - Steffen Albrecht
- University Medical Center of the Johannes Gutenberg-Universität, Institute of Physiology, 55128 Mainz, Germany
| | | | - Illia Horenko
- Universitá della Svizzera Italiana, Faculty of Informatics, TI-6900 Lugano, Switzerland
| |
Collapse
|
6
|
Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification. Proc Natl Acad Sci U S A 2022; 119:2119659119. [PMID: 35197293 PMCID: PMC8917346 DOI: 10.1073/pnas.2119659119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/30/2022] [Indexed: 11/21/2022] Open
Abstract
Entropic outlier sparsification (EOS) is proposed as a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS dwells on the derived analytic solution of the (weighted) expected loss minimization problem subject to Shannon entropy regularization. An identified closed-form solution is proven to impose additional costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically symmetric Gaussians—used heuristically in many popular data analysis algorithms—represent an optimal and least-biased choice for the nonparametric probability distributions when working with squared Euclidean distances. The performance of EOS is compared to a range of commonly used tools on synthetic problems and on partially mislabeled supervised classification problems from biomedicine. Applying EOS for coinference of data anomalies during learning is shown to allow reaching an accuracy of 97%±2% when predicting patient mortality after heart failure, statistically significantly outperforming predictive performance of common learning tools for the same data.
Collapse
|
7
|
Gerber S, Pospisil L, Sys S, Hewel C, Torkamani A, Horenko I. Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics. Front Artif Intell 2022; 4:739432. [PMID: 35072059 PMCID: PMC8766632 DOI: 10.3389/frai.2021.739432] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Accepted: 11/19/2021] [Indexed: 11/13/2022] Open
Abstract
Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.
Collapse
Affiliation(s)
- Susanne Gerber
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- *Correspondence: Susanne Gerber, ; Illia Horenko,
| | - Lukas Pospisil
- Faculty of Informatics, Institute of Computational Science, Università Della Svizzera Italiana, Lugano, Switzerland
| | - Stanislav Sys
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Charlotte Hewel
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Ali Torkamani
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, United States
| | - Illia Horenko
- Faculty of Informatics, Institute of Computational Science, Università Della Svizzera Italiana, Lugano, Switzerland
- *Correspondence: Susanne Gerber, ; Illia Horenko,
| |
Collapse
|
8
|
Pfenninger M, Reuss F, Kiebler A, Schönnenbeck P, Caliendo C, Gerber S, Cocchiararo B, Reuter S, Blüthgen N, Mody K, Mishra B, Bálint M, Thines M, Feldmeyer B. Genomic basis for drought resistance in European beech forests threatened by climate change. eLife 2021; 10:e65532. [PMID: 34132196 PMCID: PMC8266386 DOI: 10.7554/elife.65532] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 06/07/2021] [Indexed: 12/30/2022] Open
Abstract
In the course of global climate change, Central Europe is experiencing more frequent and prolonged periods of drought. The drought years 2018 and 2019 affected European beeches (Fagus sylvatica L.) differently: even in the same stand, drought-damaged trees neighboured healthy trees, suggesting that the genotype rather than the environment was responsible for this conspicuous pattern. We used this natural experiment to study the genomic basis of drought resistance with Pool-GWAS. Contrasting the extreme phenotypes identified 106 significantly associated single-nucleotide polymorphisms (SNPs) throughout the genome. Most annotated genes with associated SNPs (>70%) were previously implicated in the drought reaction of plants. Non-synonymous substitutions led either to a functional amino acid exchange or premature termination. An SNP assay with 70 loci allowed predicting drought phenotype in 98.6% of a validation sample of 92 trees. Drought resistance in European beech is a moderately polygenic trait that should respond well to natural selection, selective management, and breeding.
Collapse
Affiliation(s)
- Markus Pfenninger
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Institute for Organismic and Molecular Evolution, Johannes Gutenberg UniversityMainzGermany
- LOEWE Centre for Translational Biodiversity GenomicsFrankfurt am MainGermany
| | - Friederike Reuss
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
| | - Angelika Kiebler
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
| | - Philipp Schönnenbeck
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Institute of Human Genetics, University Medical Center, Johannes Gutenberg UniversityMainzGermany
| | - Cosima Caliendo
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Institute of Human Genetics, University Medical Center, Johannes Gutenberg UniversityMainzGermany
| | - Susanne Gerber
- Institute of Human Genetics, University Medical Center, Johannes Gutenberg UniversityMainzGermany
| | - Berardino Cocchiararo
- LOEWE Centre for Translational Biodiversity GenomicsFrankfurt am MainGermany
- Conservation Genetics Section, Senckenberg Research Institute and Natural History Museum FrankfurtGelnhausenGermany
| | - Sabrina Reuter
- Ecological Networks lab, Department of Biology, Technische Universität DarmstadtDarmstadtGermany
| | - Nico Blüthgen
- Ecological Networks lab, Department of Biology, Technische Universität DarmstadtDarmstadtGermany
| | - Karsten Mody
- Ecological Networks lab, Department of Biology, Technische Universität DarmstadtDarmstadtGermany
- Department of Applied Ecology, Hochschule Geisenheim UniversityGeisenheimGermany
| | - Bagdevi Mishra
- Biological Archives, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
| | - Miklós Bálint
- LOEWE Centre for Translational Biodiversity GenomicsFrankfurt am MainGermany
- Functional Environmental Genomics, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Agricultural Sciences, Nutritional Sciences, and Environmental Management, Universität GiessenGiessenGermany
| | - Marco Thines
- LOEWE Centre for Translational Biodiversity GenomicsFrankfurt am MainGermany
- Biological Archives, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Institute for Ecology, Evolution and Diversity, Johann Wolfgang Goethe-UniversityFrankfurt am MainGermany
| | - Barbara Feldmeyer
- Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
| |
Collapse
|
9
|
Rodrigues DR, Everschor-Sitte K, Gerber S, Horenko I. A deeper look into natural sciences with physics-based and data-driven measures. iScience 2021; 24:102171. [PMID: 33665584 PMCID: PMC7907479 DOI: 10.1016/j.isci.2021.102171] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
With the development of machine learning in recent years, it is possible to glean much more information from an experimental data set to study matter. In this perspective, we discuss some state-of-the-art data-driven tools to analyze latent effects in data and explain their applicability in natural science, focusing on two recently introduced, physics-motivated computationally cheap tools-latent entropy and latent dimension. We exemplify their capabilities by applying them on several examples in the natural sciences and show that they reveal so far unobserved features such as, for example, a gradient in a magnetic measurement and a latent network of glymphatic channels from the mouse brain microscopy data. What sets these techniques apart is the relaxation of restrictive assumptions typical of many machine learning models and instead incorporating aspects that best fit the dynamical systems at hand.
Collapse
Affiliation(s)
- Davi Röhe Rodrigues
- Institute of Physics, Johannes Gutenberg University of Mainz, 55128 Mainz, Germany
| | | | - Susanne Gerber
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg University Mainz, 55131 Mainz, Germany
| | - Illia Horenko
- Università della Svizzera Italiana, Faculty of Informatics, Via G. Buffi 13, 6900 Lugano, Switzerland
| |
Collapse
|
10
|
Weißbach S, Sys S, Hewel C, Todorov H, Schweiger S, Winter J, Pfenninger M, Torkamani A, Evans D, Burger J, Everschor-Sitte K, May-Simera HL, Gerber S. Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines. BMC Genomics 2021; 22:62. [PMID: 33468057 PMCID: PMC7814447 DOI: 10.1186/s12864-020-07362-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 12/30/2020] [Indexed: 12/14/2022] Open
Abstract
Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-020-07362-8.
Collapse
Affiliation(s)
- Stephan Weißbach
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.,Institute of Developmental Biology and Neurobiology, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Stanislav Sys
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Charlotte Hewel
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Hristo Todorov
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Susann Schweiger
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.,Leibniz Institute for Resilience Research, Mainz, Germany
| | - Jennifer Winter
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.,Leibniz Institute for Resilience Research, Mainz, Germany
| | - Markus Pfenninger
- Department of Molecular Ecology, Senckenberg Biodiversity and Climate Research Centre, Senckenberganlage 25, 60325, Frankfurt am Main, Germany.,Institute for Molecular and Organismic Evolution, Johannes Gutenberg-University Mainz, Johann-Joachim-Becher-Weg 7, 55128, Mainz, Germany.,LOEWE Centre for Translational Biodiversity Genomics, Senckenberg Biodiversity, and Climate Research Centre, Senckenberganlage 25, 60325, Frankfurt am Main, Germany
| | - Ali Torkamani
- Department of Integrative Structural and Computational Biology, Scripps Research Translational Institute, California Campus, San Diego, USA
| | - Doug Evans
- Department of Integrative Structural and Computational Biology, Scripps Research Translational Institute, California Campus, San Diego, USA
| | - Joachim Burger
- Institute of Anthropology, Johannes Gutenberg-University Mainz, Mainz, Germany
| | | | | | - Susanne Gerber
- Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
| |
Collapse
|
11
|
Horenko I. On a Scalable Entropic Breaching of the Overfitting Barrier for Small Data Problems in Machine Learning. Neural Comput 2020; 32:1563-1579. [PMID: 32521216 DOI: 10.1162/neco_a_01296] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Overfitting and treatment of small data are among the most challenging problems in machine learning (ML), when a relatively small data statistics size T is not enough to provide a robust ML fit for a relatively large data feature dimension D. Deploying a massively parallel ML analysis of generic classification problems for different D and T, we demonstrate the existence of statistically significant linear overfitting barriers for common ML methods. The results reveal that for a robust classification of bioinformatics-motivated generic problems with the long short-term memory deep learning classifier (LSTM), one needs in the best case a statistics T that is at least 13.8 times larger than the feature dimension D. We show that this overfitting barrier can be breached at a 10-12 fraction of the computational cost by means of the entropy-optimal scalable probabilistic approximations algorithm (eSPA), performing a joint solution of the entropy-optimal Bayesian network inference and feature space segmentation problems. Application of eSPA to experimental single cell RNA sequencing data exhibits a 30-fold classification performance boost when compared to standard bioinformatics tools and a 7-fold boost when compared to the deep learning LSTM classifier.
Collapse
Affiliation(s)
- Illia Horenko
- Università della Svizzera Italiana, Faculty of Informatics, TI-6900 Lugano, Switzerland
| |
Collapse
|