1
|
Valdés JJ, Tchagang AB. Novel machine learning insights into the QM7b and QM9 quantum mechanics datasets. J Comput Chem 2024; 45:1193-1214. [PMID: 38329198 DOI: 10.1002/jcc.27295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/06/2023] [Accepted: 12/12/2023] [Indexed: 02/09/2024]
Abstract
This paper (i) explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties, and (ii) further explores an inverse design approach to molecular design consisting of using machine learning methods to approximate the atomic composition of molecules, using QM9 data. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from physical-chemical properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well-defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner, core region that concentrates clustered inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inliner nature. The spatial structure exhibits a relationship with molecular weight. Despite the structural differences between the two datasets, the predictability of variables of interest for inverse molecular design is high. This is exemplified by models estimating the number of atoms of the molecule from both the original properties and from lower dimensional embedding spaces. In the generative approach the input is given by a set of desired properties of the molecule and the output is an approximation of the atomic composition in terms of its constituent chemical elements. This could serve as the starting region for further search in the huge space determined by the set of possible chemical compounds. The quantum mechanic's dataset QM9 is used in the study, composed of 133,885 small organic molecules and 19 electronic properties. Different multi-target regression approaches were considered for predicting the atomic composition from the properties, including feature engineering techniques in an auto-machine learning framework. High-quality models were found that predict the atomic composition of the molecules from their electronic properties, as well as from a subset of only 52.6% size. Feature selection worked better than feature generation. The results validate the generative approach to inverse molecular design.
Collapse
Affiliation(s)
- Julio J Valdés
- National Research Council Canada, Digital Technologies Research Centre, Ottawa, Canada
| | - Alain B Tchagang
- National Research Council Canada, Digital Technologies Research Centre, Ottawa, Canada
| |
Collapse
|
2
|
The generalized ratios intrinsic dimension estimator. Sci Rep 2022; 12:20005. [PMID: 36411305 PMCID: PMC9678878 DOI: 10.1038/s41598-022-20991-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 09/21/2022] [Indexed: 11/23/2022] Open
Abstract
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.
Collapse
|
3
|
Bailey J, Houle ME, Ma X. Local Intrinsic Dimensionality, Entropy and Statistical Divergences. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1220. [PMID: 36141105 PMCID: PMC9497584 DOI: 10.3390/e24091220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 08/22/2022] [Accepted: 08/26/2022] [Indexed: 06/16/2023]
Abstract
Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.
Collapse
Affiliation(s)
- James Bailey
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Michael E. Houle
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Xingjun Ma
- School of Computer Science, Fudan University, Shanghai 200437, China
| |
Collapse
|
4
|
Benkő Z, Stippinger M, Rehus R, Bencze A, Fabó D, Hajnal B, Eröss LG, Telcs A, Somogyvári Z. Manifold-adaptive dimension estimation revisited. PeerJ Comput Sci 2022; 8:e790. [PMID: 35111907 PMCID: PMC8771813 DOI: 10.7717/peerj-cs.790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 11/01/2021] [Indexed: 06/14/2023]
Abstract
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected median-FSA estimator with kNN estimators: maximum likelihood (Levina-Bickel), the 2NN and two implementations of DANCo (R and MATLAB). We show that corrected median-FSA estimator beats the maximum likelihood estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.
Collapse
Affiliation(s)
- Zsigmond Benkő
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- János Szentágothai Doctoral School of Neurosciences, Semmelweis University, Budapest, Hungary
| | - Marcell Stippinger
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Roberta Rehus
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Attila Bencze
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Dániel Fabó
- Epilepsy Center, Department of Neurology, National Institute of Clinical Neurosciences, Budapest, Hungary
| | - Boglárka Hajnal
- János Szentágothai Doctoral School of Neurosciences, Semmelweis University, Budapest, Hungary
- Epilepsy Center, Department of Neurology, National Institute of Clinical Neurosciences, Budapest, Hungary
| | - Loránd G. Eröss
- Department of Functional Neurosurgery, National Institute of Clinical Neurosciences, Budapest, Hungary
- Faculty of Information Technology and Bionics, Péter Pázmány Catholic University, Budapest, Hungary
| | - András Telcs
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- Department of Computer Science and Information Theory, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, Budapest, Hungary
- Department of Quantitative Methods, Faculty of Business and Economics,, University of Pannonia, Veszprém, Hungary
| | - Zoltán Somogyvári
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- Neuromicrosystems ltd., Budapest, Hungary
| |
Collapse
|
5
|
Qiu H, Yang Y, Rezakhah S. Intrinsic dimension estimation method based on correlation dimension and kNN method. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
6
|
Thordsen E, Schubert E. ABID: Angle Based Intrinsic Dimensionality — Theory and analysis. INFORM SYST 2022. [DOI: 10.1016/j.is.2022.101989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
7
|
Aumüller M, Ceccarello M. The role of local dimensionality measures in benchmarking nearest neighbor search. INFORM SYST 2021. [DOI: 10.1016/j.is.2021.101807] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Bac J, Mirkes EM, Gorban AN, Tyukin I, Zinovyev A. Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1368. [PMID: 34682092 PMCID: PMC8534554 DOI: 10.3390/e23101368] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/10/2021] [Accepted: 10/16/2021] [Indexed: 02/07/2023]
Abstract
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data.
Collapse
Affiliation(s)
- Jonathan Bac
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
| | - Evgeny M. Mirkes
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Alexander N. Gorban
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Ivan Tyukin
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| |
Collapse
|
9
|
|