1
|
Weymuth T, Unsleber JP, Türtscher PL, Steiner M, Sobez JG, Müller CH, Mörchen M, Klasovita V, Grimmel SA, Eckhoff M, Csizi KS, Bosia F, Bensberg M, Reiher M. SCINE-Software for chemical interaction networks. J Chem Phys 2024; 160:222501. [PMID: 38857173 DOI: 10.1063/5.0206974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 05/09/2024] [Indexed: 06/12/2024] Open
Abstract
The software for chemical interaction networks (SCINE) project aims at pushing the frontier of quantum chemical calculations on molecular structures to a new level. While calculations on individual structures as well as on simple relations between them have become routine in chemistry, new developments have pushed the frontier in the field to high-throughput calculations. Chemical relations may be created by a search for specific molecular properties in a molecular design attempt, or they can be defined by a set of elementary reaction steps that form a chemical reaction network. The software modules of SCINE have been designed to facilitate such studies. The features of the modules are (i) general applicability of the applied methodologies ranging from electronic structure (no restriction to specific elements of the periodic table) to microkinetic modeling (with little restrictions on molecularity), full modularity so that SCINE modules can also be applied as stand-alone programs or be exchanged for external software packages that fulfill a similar purpose (to increase options for computational campaigns and to provide alternatives in case of tasks that are hard or impossible to accomplish with certain programs), (ii) high stability and autonomous operations so that control and steering by an operator are as easy as possible, and (iii) easy embedding into complex heterogeneous environments for molecular structures taken individually or in the context of a reaction network. A graphical user interface unites all modules and ensures interoperability. All components of the software have been made available as open source and free of charge.
Collapse
Affiliation(s)
- Thomas Weymuth
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Jan P Unsleber
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Paul L Türtscher
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Miguel Steiner
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Jan-Grimo Sobez
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Charlotte H Müller
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Maximilian Mörchen
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Veronika Klasovita
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Stephanie A Grimmel
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Marco Eckhoff
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Katja-Sophia Csizi
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Francesco Bosia
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Moritz Bensberg
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Markus Reiher
- ETH Zurich, Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|
2
|
López-Pérez K, Kim TD, Miranda-Quintana RA. iSIM: instant similarity. DIGITAL DISCOVERY 2024; 3:1160-1171. [PMID: 38873032 PMCID: PMC11167700 DOI: 10.1039/d4dd00041b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 05/06/2024] [Indexed: 06/15/2024]
Abstract
The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.
Collapse
Affiliation(s)
- Kenneth López-Pérez
- Department of Chemistry and Quantum Theory Project, University of Florida Gainesville Florida 32611 USA
| | - Taewon D Kim
- Department of Chemistry and Quantum Theory Project, University of Florida Gainesville Florida 32611 USA
| | | |
Collapse
|
3
|
Csizi KS, Reiher M. Automated preparation of nanoscopic structures: Graph-based sequence analysis, mismatch detection, and pH-consistent protonation with uncertainty estimates. J Comput Chem 2024; 45:761-776. [PMID: 38124290 DOI: 10.1002/jcc.27276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 11/14/2023] [Indexed: 12/23/2023]
Abstract
Structure and function in nanoscale atomistic assemblies are tightly coupled, and every atom with its specific position and even every electron will have a decisive effect on the electronic structure, and hence, on the molecular properties. Molecular simulations of nanoscopic atomistic structures therefore require accurately resolved three-dimensional input structures. If extracted from experiment, these structures often suffer from severe uncertainties, of which the lack of information on hydrogen atoms is a prominent example. Hence, experimental structures require careful review and curation, which is a time-consuming and error-prone process. Here, we present a fast and robust protocol for the automated structure analysis and pH-consistent protonation, in short, ASAP. For biomolecules as a target, the ASAP protocol integrates sequence analysis and error assessment of a given input structure. ASAP allows for pK a prediction from reference data through Gaussian process regression including uncertainty estimation and connects to system-focused atomistic modeling described in Brunken and Reiher (J. Chem. Theory Comput. 16, 2020, 1646). Although focused on biomolecules, ASAP can be extended to other nanoscopic objects, because most of its design elements rely on a general graph-based foundation guaranteeing transferability. The modular character of the underlying pipeline supports different degrees of automation, which allows for (i) efficient feedback loops for human-machine interaction with a low entrance barrier and for (ii) integration into autonomous procedures such as automated force field parametrizations. This facilitates fast switching of the pH-state through on-the-fly system-focused reparametrization during a molecular simulation at virtually no extra computational cost.
Collapse
Affiliation(s)
- Katja-Sophia Csizi
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Markus Reiher
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
4
|
Briling K, Calvino Alonso Y, Fabrizio A, Corminboeuf C. SPA HM(a,b): Encoding the Density Information from Guess Hamiltonian in Quantum Machine Learning Representations. J Chem Theory Comput 2024; 20:1108-1117. [PMID: 38227222 PMCID: PMC10867806 DOI: 10.1021/acs.jctc.3c01040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 12/20/2023] [Accepted: 12/26/2023] [Indexed: 01/17/2024]
Abstract
Recently, we introduced a class of molecular representations for kernel-based regression methods─the spectrum of approximated Hamiltonian matrices (SPAHM)─that takes advantage of lightweight one-electron Hamiltonians traditionally used as a self-consistent field initial guess. The original SPAHM variant is built from occupied-orbital energies (i.e., eigenvalues) and naturally contains all of the information about nuclear charges, atomic positions, and symmetry requirements. Its advantages were demonstrated on data sets featuring a wide variation of charge and spin, for which traditional structure-based representations commonly fail. SPAHM(a,b), as introduced here, expand the eigenvalue SPAHM into local and transferable representations. They rely upon one-electron density matrices to build fingerprints from atomic and bond density overlap contributions inspired from preceding state-of-the-art representations. The performance and efficiency of SPAHM(a,b) is assessed on the predictions for data sets of prototypical organic molecules (QM7) of different charges and azoheteroarene dyes in an excited state. Overall, both SPAHM(a) and SPAHM(b) outperform state-of-the-art representations on difficult prediction tasks such as the atomic properties of charged open-shell species and of π-conjugated systems.
Collapse
Affiliation(s)
- Ksenia
R. Briling
- Laboratory
for Computational Molecular Design, Institute of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Yannick Calvino Alonso
- Laboratory
for Computational Molecular Design, Institute of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Alberto Fabrizio
- Laboratory
for Computational Molecular Design, Institute of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, 1015 Lausanne, Switzerland
- National
Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale
de Lausanne, 1015 Lausanne, Switzerland
| | - Clemence Corminboeuf
- Laboratory
for Computational Molecular Design, Institute of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, 1015 Lausanne, Switzerland
- National
Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale
de Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
5
|
Yan W, Tan L, Meng-Shan L, Sheng S, Jun W, Fu-an W. SaPt-CNN-LSTM-AR-EA: a hybrid ensemble learning framework for time series-based multivariate DNA sequence prediction. PeerJ 2023; 11:e16192. [PMID: 37810796 PMCID: PMC10559882 DOI: 10.7717/peerj.16192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 09/06/2023] [Indexed: 10/10/2023] Open
Abstract
Biological sequence data mining is hot spot in bioinformatics. A biological sequence can be regarded as a set of characters. Time series is similar to biological sequences in terms of both representation and mechanism. Therefore, in the article, biological sequences are represented with time series to obtain biological time sequence (BTS). Hybrid ensemble learning framework (SaPt-CNN-LSTM-AR-EA) for BTS is proposed. Single-sequence and multi-sequence models are respectively constructed with self-adaption pre-training one-dimensional convolutional recurrent neural network and autoregressive fractional integrated moving average fused evolutionary algorithm. In DNA sequence experiments with six viruses, SaPt-CNN-LSTM-AR-EA realized the good overall prediction performance and the prediction accuracy and correlation respectively reached 1.7073 and 0.9186. SaPt-CNN-LSTM-AR-EA was compared with other five benchmark models so as to verify its effectiveness and stability. SaPt-CNN-LSTM-AR-EA increased the average accuracy by about 30%. The framework proposed in this article is significant in biology, biomedicine, and computer science, and can be widely applied in sequence splicing, computational biology, bioinformation, and other fields.
Collapse
Affiliation(s)
- Wu Yan
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Li Tan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Li Meng-Shan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Sheng Sheng
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wang Jun
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wu Fu-an
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| |
Collapse
|
6
|
Eckhoff M, Reiher M. Lifelong Machine Learning Potentials. J Chem Theory Comput 2023; 19:3509-3525. [PMID: 37288932 PMCID: PMC10308836 DOI: 10.1021/acs.jctc.3c00279] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Indexed: 06/09/2023]
Abstract
Machine learning potentials (MLPs) trained on accurate quantum chemical data can retain the high accuracy, while inflicting little computational demands. On the downside, they need to be trained for each individual system. In recent years, a vast number of MLPs have been trained from scratch because learning additional data typically requires retraining on all data to not forget previously acquired knowledge. Additionally, most common structural descriptors of MLPs cannot represent efficiently a large number of different chemical elements. In this work, we tackle these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs), which combine structural properties and element information from the periodic table. These eeACSFs are key for our development of a lifelong machine learning potential (lMLP). Uncertainty quantification can be exploited to transgress a fixed, pretrained MLP to arrive at a continuously adapting lMLP, because a predefined level of accuracy can be ensured. To extend the applicability of an lMLP to new systems, we apply continual learning strategies to enable autonomous and on-the-fly training on a continuous stream of new data. For the training of deep neural networks, we propose the continual resilient (CoRe) optimizer and incremental learning strategies relying on rehearsal of data, regularization of parameters, and the architecture of the model.
Collapse
Affiliation(s)
- Marco Eckhoff
- ETH Zürich, Departement Chemie und Angewandte Biowissenschaften, 8093 Zürich, Switzerland
| | - Markus Reiher
- ETH Zürich, Departement Chemie und Angewandte Biowissenschaften, 8093 Zürich, Switzerland
| |
Collapse
|