1
|
Hozana GN, Díaz Mirón G, Hassanali A. Data-Driven Discovery of the Origins of UV Absorption in the Alpha-3C Protein. J Phys Chem B 2025; 129:4728-4737. [PMID: 40312142 DOI: 10.1021/acs.jpcb.5c00532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Over the past decade, there has been a growing body of experimental work showing that proteins devoid of aromatic and conjugated groups can absorb light in the near-UV beyond 300 nm and emit visible light. Understanding the origins of this phenomenon offers the possibility of designing noninvasive spectroscopic probes for local interactions in biological systems. It was recently found that the synthetic protein α3C displays UV-vis absorption between 250 and 800 nm, which was shown to arise from charge-transfer excitations between charged amino acids. In this work, we use data-driven approach to re-examine the origins of these features using a combination of molecular dynamics and excited-state simulations. Specifically, an unsupervised learning approach beginning with encoding protein environments with local atomic descriptors is employed to automatically detect relevant structural motifs. We identify three main motifs corresponding to different hydrogen-bonding patterns that are subsequently used to perform QM/MM simulations, including the entire protein and solvent bath with the density-functional tight-binding (DFTB) approach. Hydrogen-bonding structures involving arginine and carboxylate groups appear to be the most prone to near-UV absorption. We show that the magnitude of the UV-vis absorption predicted from the simulations is rather sensitive to the size of the QM region employed as well as to the inclusion of explicit solvation.
Collapse
Affiliation(s)
- Germaine Neza Hozana
- International Centre for Theoretical Physics (ICTP), Strada Costiera 11, Trieste 34151, Italy
- Dipartimento di Fisica, Universitá degli Studi di Trieste, Via Alfonso Valerio 2, Trieste 34127, Italy
| | - Gonzalo Díaz Mirón
- International Centre for Theoretical Physics (ICTP), Strada Costiera 11, Trieste 34151, Italy
| | - Ali Hassanali
- International Centre for Theoretical Physics (ICTP), Strada Costiera 11, Trieste 34151, Italy
| |
Collapse
|
2
|
Mazza F, Dalfovo D, Bartocci A, Lattanzi G, Romanel A. Integrative Computational Analysis of Common EXO5 Haplotypes: Impact on Protein Dynamics, Genome Stability, and Cancer Progression. J Chem Inf Model 2025; 65:3640-3654. [PMID: 40115981 PMCID: PMC12004521 DOI: 10.1021/acs.jcim.5c00067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 02/19/2025] [Accepted: 03/13/2025] [Indexed: 03/23/2025]
Abstract
Understanding the impact of common germline variants on protein structure, function, and disease progression is crucial in cancer research. This study presents a comprehensive analysis of the EXO5 gene, which encodes a DNA exonuclease involved in DNA repair that was previously associated with cancer susceptibility. We employed an integrated approach combining genomic and clinical data analysis, deep learning variant effect prediction, and molecular dynamics (MD) simulations to investigate the effects of common EXO5 haplotypes on protein structure, dynamics, and cancer outcomes. We characterized the haplotype structure of EXO5 across diverse human populations, identifying five common haplotypes, and studied their impact on the EXO5 protein. Extensive, all-atom MD simulations revealed significant structural and dynamic differences among the EXO5 protein variants, particularly in their catalytic region. The L151P EXO5 protein variant exhibited the most substantial conformational changes, potentially disruptive for EXO5's function and nuclear localization. Analysis of The Cancer Genome Atlas data showed that cancer patients carrying L151P EXO5 had significantly shorter progression-free survival in prostate and pancreatic cancers and exhibited increased genomic instability. This study highlights the strength of our methodology in uncovering the effects of common genetic variants on protein function and their implications for disease outcomes.
Collapse
Affiliation(s)
- Fabio Mazza
- Department
of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Via Sommarive 9, Trento 38123, Italy
| | - Davide Dalfovo
- Department
of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Via Sommarive 9, Trento 38123, Italy
| | - Alessio Bartocci
- Department
of Physics, University of Trento, Via Sommarive 9, Trento 38123, Italy
- INFN-TIFPA,
Trento Institute for Fundamental Physics and Applications, Via Sommarive 14, Trento 38123, Italy
| | - Gianluca Lattanzi
- Department
of Physics, University of Trento, Via Sommarive 9, Trento 38123, Italy
- INFN-TIFPA,
Trento Institute for Fundamental Physics and Applications, Via Sommarive 14, Trento 38123, Italy
| | - Alessandro Romanel
- Department
of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Via Sommarive 9, Trento 38123, Italy
| |
Collapse
|
3
|
Wild R, Wodaczek F, Del Tatto V, Cheng B, Laio A. Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance. Nat Commun 2025; 16:270. [PMID: 39747013 PMCID: PMC11696465 DOI: 10.1038/s41467-024-55449-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 12/12/2024] [Indexed: 01/04/2025] Open
Abstract
Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.
Collapse
Affiliation(s)
- Romina Wild
- International School for Advanced Studies (SISSA), Trieste, Italy
| | - Felix Wodaczek
- The Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria
| | | | - Bingqing Cheng
- The Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria
- Department of Chemistry, University of California, Berkeley, CA, USA
| | - Alessandro Laio
- International School for Advanced Studies (SISSA), Trieste, Italy.
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Trieste, Italy.
| |
Collapse
|
4
|
Semelak JA, Gallo M, González Flecha FL, Di Pino S, Pertinhez TA, Zeida A, Gout I, Estrin DA, Trujillo M. Mg 2+ binding to coenzyme A. Arch Biochem Biophys 2025; 763:110202. [PMID: 39536960 DOI: 10.1016/j.abb.2024.110202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 09/16/2024] [Accepted: 10/30/2024] [Indexed: 11/16/2024]
Abstract
Magnesium (Mg2+), the second most abundant intracellular cation, plays a crucial role in cellular functions. In this study, we investigate the interaction between Mg2+ and coenzyme A (CoA), a thiol-containing cofactor central to cellular metabolism also involved in protein modifications. Isothermal titration calorimetry revealed a 1:1 binding stoichiometry between Mg2+ and free CoA under biologically relevant conditions. Association constants of (537 ± 20) M-1 and (312 ± 7) M-1 were determined at 25 °C and pH 7.2 and 7.8, respectively, suggesting that a significant fraction of CoA is likely bound to Mg2+ both in the cytosol and in the mitochondrial matrix. Additionally, the process is entropically-driven, and our results support that the origin of the entropy gain is solvent-related. On the other hand, the combination of 1- and 2-dimensional nuclear magnetic resonance spectroscopy with molecular dynamics simulations and unsupervised learning demonstrate a direct coordination between Mg2+ and the phosphate groups of the 4-phosphopantothenate unit and bound to position 5' of the adenosine ring. Interestingly, the phosphate in position 3' only indirectly contributes to Mg2+ coordination. Finally, we discuss how the binding of Mg2+ to CoA perturbates the chemical environment of different CoA atoms, regardless of their apparent proximity to the coordination site, through the modulation of the CoA conformational landscape. This insight holds implications for understanding the impact on both CoA and Mg2+ functions in physiological and pathological processes.
Collapse
Affiliation(s)
- Jonathan A Semelak
- CONICET-Universidad de Buenos Aires, Instituto de Química-Física de los Materiales, Medio Ambiente y Energía (INQUIMAE), Buenos Aires, Argentina; Facultad de Ciencias Exactas y Naturales, Departamento de Química Inorgánica, Analítica y Química Física, Universidad de Buenos Aires, C1428EHA Buenos Aires, Argentina.
| | - Mariana Gallo
- Laboratory of Biochemistry and Metabolomics, Department of Medicine and Surgery, University of Parma, Italy
| | - F Luis González Flecha
- Laboratorio de Biofísica Molecular, Instituto de Química y Fisicoquímica Biológicas, Universidad de Buenos Aires, CONICET, Buenos Aires, Argentina
| | - Solana Di Pino
- CONICET-Universidad de Buenos Aires, Instituto de Química-Física de los Materiales, Medio Ambiente y Energía (INQUIMAE), Buenos Aires, Argentina; Facultad de Ciencias Exactas y Naturales, Departamento de Química Inorgánica, Analítica y Química Física, Universidad de Buenos Aires, C1428EHA Buenos Aires, Argentina
| | - Thelma A Pertinhez
- Laboratory of Biochemistry and Metabolomics, Department of Medicine and Surgery, University of Parma, Italy
| | - Ari Zeida
- Departamento de Bioquímica, Facultad de Medicina, Universidad de la República, Montevideo, 11800, Uruguay; Centro de Investigaciones Biomédicas (CEINBIO), Universidad de la República, Montevideo, 11800, Uruguay
| | - Ivan Gout
- Department of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK; Institute of Molecular Biology and Genetics, National Academy of Sciences of Ukraine, 03680, Kyiv, Ukraine
| | - Dario A Estrin
- CONICET-Universidad de Buenos Aires, Instituto de Química-Física de los Materiales, Medio Ambiente y Energía (INQUIMAE), Buenos Aires, Argentina; Facultad de Ciencias Exactas y Naturales, Departamento de Química Inorgánica, Analítica y Química Física, Universidad de Buenos Aires, C1428EHA Buenos Aires, Argentina
| | - Madia Trujillo
- Departamento de Bioquímica, Facultad de Medicina, Universidad de la República, Montevideo, 11800, Uruguay; Centro de Investigaciones Biomédicas (CEINBIO), Universidad de la República, Montevideo, 11800, Uruguay.
| |
Collapse
|
5
|
Zadoks A, Marrazzo A, Marzari N. Spectral operator representations. NPJ COMPUTATIONAL MATERIALS 2024; 10:278. [PMID: 39634056 PMCID: PMC11611740 DOI: 10.1038/s41524-024-01446-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 10/24/2024] [Indexed: 12/07/2024]
Abstract
Machine learning in atomistic materials science has grown to become a powerful tool, with most approaches focusing on atomic geometry, typically decomposed into local atomic environments. This approach, while well-suited for machine-learned interatomic potentials, is conceptually at odds with learning complex intrinsic properties of materials, often driven by spectral properties commonly represented in reciprocal space (e.g., band gaps or mobilities) which cannot be readily partitioned in real space. For such applications, methods that represent the electronic rather than the atomic structure could be more promising. In this work, we present a general framework focused on electronic-structure descriptors that take advantage of the natural symmetries and inherent interpretability of physical models. We apply this framework first to material similarity and then to accelerated screening, where a model trained on 217 materials correctly labels 75% of entries in the Materials Cloud 3D database, which meet common screening criteria for promising transparent-conducting materials.
Collapse
Affiliation(s)
- Austin Zadoks
- Theory and Simulation of Materials (THEOS), École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Antimo Marrazzo
- Dipartimento di Fisica, Università di Trieste, I-34151 Trieste, Italy
- Scuola Internazionale Superiore di Studi Avanzati (SISSA), I-34136 Trieste, Italy
| | - Nicola Marzari
- Theory and Simulation of Materials (THEOS), École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- Laboratory for Materials Simulations (LMS), Paul Scherrer Institut, CH-5232 Villigen, Switzerland
| |
Collapse
|
6
|
Omwansu W, Musembi R, Derese S. Graph-based analysis of H-bond networks and unsupervised learning reveal conformational coupling in prion peptide segments. Phys Chem Chem Phys 2024. [PMID: 39291469 DOI: 10.1039/d4cp02123a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/19/2024]
Abstract
In this study, we employed a comprehensive computational approach to investigate the physical chemistry of the water networks surrounding hydrated peptide segments, as derived from molecular dynamics simulations. Our analysis uncovers a complex interplay of direct and water-mediated hydrogen bonds that intricately weave through the peptides. We demonstrate that these hydrogen bond networks encode critical information about the peptides' conformational behavior, with the dimensionality of these networks showing sensitivity to the peptides' conformations. Additionally, we estimated the free-energy landscape of the peptides across various conformations, revealing that their structures are predominantly characterized by unfolded, partially folded, and folded configurations, resulting in broad and rugged free-energy surfaces due to the numerous degrees of freedom contributed by the surrounding solvent. Importantly, the structured nature of this free-energy landscape becomes obscured when conventional collective variables, such as the number of hydrogen bonds, are used. Our findings provide new insights into the molecular mechanisms that couple protein and solvent degrees of freedom, highlighting their significance in the functioning of biological systems.
Collapse
Affiliation(s)
- Wycliffe Omwansu
- Department of Physics, University of Nairobi, P.O. Box 30197-00100, Nairobi, Kenya.
- The Abdus Salam International Centre for Theoretical Physics, Strada Costiera 11, 34151 Trieste, Italy
| | - Robinson Musembi
- Department of Physics, University of Nairobi, P.O. Box 30197-00100, Nairobi, Kenya.
| | - Solomon Derese
- Department of Chemistry, University of Nairobi, P.O. Box 30197-00100, Nairobi, Kenya
| |
Collapse
|
7
|
Tamagnone S, Laio A, Gabrié M. Coarse-Grained Molecular Dynamics with Normalizing Flows. J Chem Theory Comput 2024. [PMID: 39223750 DOI: 10.1021/acs.jctc.4c00700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
We propose a sampling algorithm relying on a collective variable (CV) of midsize dimension modeled by a normalizing flow and using nonequilibrium dynamics to propose full configurational moves from the proposition of a refreshed value of the CV made by the flow. The algorithm takes the form of a Markov chain with nonlocal updates, allowing jumps through energy barriers across metastable states. The flow is trained throughout the algorithm to reproduce the free energy landscape of the CV. The output of the algorithm is a sample of thermalized configurations and the trained network that can be used to efficiently produce more configurations. We show the functioning of the algorithm first in a test case with a mixture of Gaussians. Then, we successfully tested it on a higher-dimensional system consisting of a polymer in solution with a compact state and an extended stable state separated by a high free energy barrier.
Collapse
Affiliation(s)
- Samuel Tamagnone
- International School for Advanced Studies (SISSA), Via Bonomea 265, Trieste 34136, Italy
| | - Alessandro Laio
- International School for Advanced Studies (SISSA), Via Bonomea 265, Trieste 34136, Italy
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, Trieste 34151, Italy
| | - Marylou Gabrié
- CMAP, CNRS, Institut Polytechnique de Paris, École Polytechnique, 91120 Palaiseau, France
| |
Collapse
|
8
|
Macocco I, Mira A, Laio A. Intrinsic dimension as a multi-scale summary statistics in network modeling. Sci Rep 2024; 14:17756. [PMID: 39085320 PMCID: PMC11291743 DOI: 10.1038/s41598-024-68113-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 07/19/2024] [Indexed: 08/02/2024] Open
Abstract
Complex networks are powerful mathematical tools for modelling and understanding the behaviour of highly interconnected systems. However, existing methods for analyzing these networks focus on local properties (e.g. degree distribution, clustering coefficient) or global properties (e.g. diameter, modularity) and fail to characterize the network structure across multiple scales. In this paper, we introduce a rigorous method for calculating the intrinsic dimension of unweighted networks. The intrinsic dimension is a feature that describes the network structure at all scales, from local to global. We propose using this measure as a summary statistic within an Approximate Bayesian Computation framework to infer the parameters of flexible and multi-purpose mechanistic models that generate complex networks. Furthermore, we present a new mechanistic model that can reproduce the intrinsic dimension of networks with large diameters, a task that has been challenging for existing models.
Collapse
Affiliation(s)
- Iuri Macocco
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136, Trieste, Italy
| | - Antonietta Mira
- Faculty of Economics, Euler Institute, Università della Svizzera italiana, Via Buffi 13, 6900, Lugano, Switzerland
- Department of Science and High Technology, Università degli Studi dell'Insubria, Via Valleggio 11, 22100, Como, Italy
| | - Alessandro Laio
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136, Trieste, Italy.
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014, Trieste, Italy.
| |
Collapse
|
9
|
Del Tatto V, Fortunato G, Bueti D, Laio A. Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks. Proc Natl Acad Sci U S A 2024; 121:e2317256121. [PMID: 38687797 PMCID: PMC11087807 DOI: 10.1073/pnas.2317256121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 03/01/2024] [Indexed: 05/02/2024] Open
Abstract
We introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables. This framework makes causality detection possible even between high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled chaotic dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings. We also show that the method can be used to robustly detect causality in human electroencephalography data.
Collapse
Affiliation(s)
- Vittorio Del Tatto
- Physics Section, Scuola Internazionale Superiore di Studi Avanzati, Trieste34136, Italy
| | - Gianfranco Fortunato
- Physics Section, Scuola Internazionale Superiore di Studi Avanzati, Trieste34136, Italy
| | - Domenica Bueti
- Physics Section, Scuola Internazionale Superiore di Studi Avanzati, Trieste34136, Italy
| | - Alessandro Laio
- Physics Section, Scuola Internazionale Superiore di Studi Avanzati, Trieste34136, Italy
- Condensed Matter and Statistical Physics Section, International Centre for Theoretical Physics, Trieste34151, Italy
| |
Collapse
|
10
|
Lee SC, Z Y. Interpretation of autoencoder-learned collective variables using Morse-Smale complex and sublevelset persistent homology: An application on molecular trajectories. J Chem Phys 2024; 160:144104. [PMID: 38591676 DOI: 10.1063/5.0191446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 03/22/2024] [Indexed: 04/10/2024] Open
Abstract
Dimensionality reduction often serves as the first step toward a minimalist understanding of physical systems as well as the accelerated simulations of them. In particular, neural network-based nonlinear dimensionality reduction methods, such as autoencoders, have shown promising outcomes in uncovering collective variables (CVs). However, the physical meaning of these CVs remains largely elusive. In this work, we constructed a framework that (1) determines the optimal number of CVs needed to capture the essential molecular motions using an ensemble of hierarchical autoencoders and (2) provides topology-based interpretations to the autoencoder-learned CVs with Morse-Smale complex and sublevelset persistent homology. This approach was exemplified using a series of n-alkanes and can be regarded as a general, explainable nonlinear dimensionality reduction method.
Collapse
Affiliation(s)
- Shao-Chun Lee
- Department of Nuclear, Plasma, and Radiological Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Y Z
- Department of Nuclear, Plasma, and Radiological Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
- Department of Nuclear Engineering and Radiological Sciences, Department of Materials Science and Engineering, Department of Robotics, and Applied Physics Program, University of Michigan, Ann Arbor, Michigan 48105, USA
| |
Collapse
|
11
|
Swinburne TD. Coarse-Graining and Forecasting Atomic Material Simulations with Descriptors. PHYSICAL REVIEW LETTERS 2023; 131:236101. [PMID: 38134806 DOI: 10.1103/physrevlett.131.236101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 07/21/2023] [Accepted: 11/13/2023] [Indexed: 12/24/2023]
Abstract
Atomic simulations of materials require significant resources to generate, store, and analyze. Here, descriptor functions are proposed as a general, metric latent space for atomic structures, ideal for use in large-scale simulations. Descriptors can regress a broad range of properties, including character-dependent dislocation densities, stress states, or radial distribution functions. A vector autoregressive model can generate trajectories over yield points, resample from new initial conditions and forecast trajectory futures. A forecast confidence, essential for practical application, is derived by propagating forecasts through the Mahalanobis outlier distance, providing a powerful tool to assess coarse-grained models. Application to nanoparticles and yielding of nanoscale dislocation networks confirms low uncertainty forecasts are accurate and resampling allows for the propagation of smooth property distributions. Yielding is associated with a collapse in the intrinsic dimension of the descriptor manifold, which is discussed in relation to the yield surface.
Collapse
Affiliation(s)
- Thomas D Swinburne
- Aix-Marseille Université, CNRS, CINaM UMR 7325, Campus de Luminy, 13288 Marseille, France
| |
Collapse
|
12
|
Zdybał K, Parente A, Sutherland JC. Improving reduced-order models through nonlinear decoding of projection-dependent outputs. PATTERNS (NEW YORK, N.Y.) 2023; 4:100859. [PMID: 38035196 PMCID: PMC10682754 DOI: 10.1016/j.patter.2023.100859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 07/12/2023] [Accepted: 09/14/2023] [Indexed: 12/02/2023]
Abstract
A fundamental hindrance to building data-driven reduced-order models (ROMs) is the poor topological quality of a low-dimensional data projection. This includes behavior such as overlapping, twisting, or large curvatures or uneven data density that can generate nonuniqueness and steep gradients in quantities of interest (QoIs). Here, we employ an encoder-decoder neural network architecture for dimensionality reduction. We find that nonlinear decoding of projection-dependent QoIs, when embedded in a dimensionality reduction technique, promotes improved low-dimensional representations of complex multiscale and multiphysics datasets. When data projection (encoding) is affected by forcing accurate nonlinear reconstruction of the QoIs (decoding), we minimize nonuniqueness and gradients in representing QoIs on a projection. This in turn leads to enhanced predictive accuracy of a ROM. Our findings are relevant to a variety of disciplines that develop data-driven ROMs of dynamical systems such as reacting flows, plasma physics, atmospheric physics, or computational neuroscience.
Collapse
Affiliation(s)
- Kamila Zdybał
- Université Libre de Bruxelles, École Polytechnique de Bruxelles, Aero-Thermo-Mechanics Laboratory, Brussels, Belgium
- BRITE: Brussels Institute for Thermal-Fluid Systems and Clean Energy, Brussels, Belgium
| | - Alessandro Parente
- Université Libre de Bruxelles, École Polytechnique de Bruxelles, Aero-Thermo-Mechanics Laboratory, Brussels, Belgium
- BRITE: Brussels Institute for Thermal-Fluid Systems and Clean Energy, Brussels, Belgium
| | - James C. Sutherland
- Department of Chemical Engineering, University of Utah, Salt Lake City, UT, USA
| |
Collapse
|
13
|
Macocco I, Glielmo A, Grilli J, Laio A. Intrinsic Dimension Estimation for Discrete Metrics. PHYSICAL REVIEW LETTERS 2023; 130:067401. [PMID: 36827575 DOI: 10.1103/physrevlett.130.067401] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 09/16/2022] [Accepted: 01/20/2023] [Indexed: 06/18/2023]
Abstract
Real-world datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this Letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high dimensionality of sequences' space.
Collapse
Affiliation(s)
- Iuri Macocco
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
| | - Aldo Glielmo
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
- Bank of Italy, DG for Information Technology, 00044 Rome, Italy
| | - Jacopo Grilli
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014 Trieste, Italy
| | - Alessandro Laio
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014 Trieste, Italy
| |
Collapse
|
14
|
The generalized ratios intrinsic dimension estimator. Sci Rep 2022; 12:20005. [PMID: 36411305 PMCID: PMC9678878 DOI: 10.1038/s41598-022-20991-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 09/21/2022] [Indexed: 11/23/2022] Open
Abstract
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.
Collapse
|
15
|
Glielmo A, Zeni C, Cheng B, Csányi G, Laio A. Ranking the information content of distance measures. PNAS NEXUS 2022; 1:pgac039. [PMID: 36713323 PMCID: PMC9802303 DOI: 10.1093/pnasnexus/pgac039] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 03/04/2022] [Accepted: 03/31/2022] [Indexed: 06/18/2023]
Abstract
Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.
Collapse
Affiliation(s)
- Aldo Glielmo
- Physics Department, International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
- Bank of Italy, 00187, Italy
| | - Claudio Zeni
- Physics Department, International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
| | - Bingqing Cheng
- The Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Austria
| | - Gábor Csányi
- Engineering Laboratory, University of Cambridge, Trumpington St, CB21PZ Cambridge, UK
| | - Alessandro Laio
- Physics Department, International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy
| |
Collapse
|