1
|
Heyndrickx W, Mervin L, Morawietz T, Sturm N, Friedrich L, Zalewski A, Pentina A, Humbeck L, Oldenhof M, Niwayama R, Schmidtke P, Fechner N, Simm J, Arany A, Drizard N, Jabal R, Afanasyeva A, Loeb R, Verma S, Harnqvist S, Holmes M, Pejo B, Telenczuk M, Holway N, Dieckmann A, Rieke N, Zumsande F, Clevert DA, Krug M, Luscombe C, Green D, Ertl P, Antal P, Marcus D, Do Huu N, Fuji H, Pickett S, Acs G, Boniface E, Beck B, Sun Y, Gohier A, Rippmann F, Engkvist O, Göller AH, Moreau Y, Galtier MN, Schuffenhauer A, Ceulemans H. MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information. J Chem Inf Model 2024; 64:2331-2344. [PMID: 37642660 PMCID: PMC11005050 DOI: 10.1021/acs.jcim.3c00799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Indexed: 08/31/2023]
Abstract
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Collapse
Affiliation(s)
| | - Lewis Mervin
- AstraZeneca
R&D, Biomedical Campus, 1 Francis Crick Ave, Cambridge CB2 0SL, U.K.
| | - Tobias Morawietz
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Noé Sturm
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Lukas Friedrich
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Adam Zalewski
- Amgen Research
(Munich) GmbH, Staffelseestraße
2, Munich 81477, Germany
| | - Anastasia Pentina
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Lina Humbeck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Martijn Oldenhof
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Ritsuya Niwayama
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | | | - Nikolas Fechner
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Jaak Simm
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Adam Arany
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Rama Jabal
- Iktos, 65 rue de Prony, Paris 75017, France
| | - Arina Afanasyeva
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Regis Loeb
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Shlok Verma
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Simon Harnqvist
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Matthew Holmes
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Balazs Pejo
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | | | - Nicholas Holway
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Arne Dieckmann
- Bayer
AG, API Production, Product Supply, Pharmaceuticals, Ernst-Schering-Straße 14, Bergkamen 59192, Germany
| | - Nicola Rieke
- NVIDIA
GmbH, Floessergasse 2, Munich 81369, Germany
| | | | - Djork-Arné Clevert
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Michael Krug
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Christopher Luscombe
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Darren Green
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Peter Ertl
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Peter Antal
- Budapest
University of Technology and Economics, Department of Measurement and Information Systems, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - David Marcus
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | | | - Hideyoshi Fuji
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Stephen Pickett
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Gergely Acs
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - Eric Boniface
- Substra
Foundation - Labelia Labs, 4 rue Voltaire, Nantes 44000, France
| | - Bernd Beck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Yax Sun
- Amgen
Research, 1 Amgen Center
Drive, Thousand Oaks, California 92130, United States
| | - Arnaud Gohier
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | - Friedrich Rippmann
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Ola Engkvist
- AstraZeneca, Molecular AI, Discovery Sciences,
R&D, Pepparedsleden
1, Mölndal 431 50, Sweden
| | - Andreas H. Göller
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Yves Moreau
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Ansgar Schuffenhauer
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Hugo Ceulemans
- Janssen
Pharmaceutica NV, Turnhoutseweg 30, Beerse 2340, Belgium
| |
Collapse
|
2
|
Simm J, Humbeck L, Zalewski A, Sturm N, Heyndrickx W, Moreau Y, Beck B, Schuffenhauer A. Splitting chemical structure data sets for federated privacy-preserving machine learning. J Cheminform 2021; 13:96. [PMID: 34876230 PMCID: PMC8650276 DOI: 10.1186/s13321-021-00576-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 11/22/2021] [Indexed: 11/10/2022] Open
Abstract
With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.
Collapse
Affiliation(s)
- Jaak Simm
- KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium
| | - Lina Humbeck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Adam Zalewski
- Amgen Research (Munich) GmbH, Staffelseestraße 2, 81477, Munich, Germany
| | - Noe Sturm
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland
| | - Wouter Heyndrickx
- Janssen Pharmaceutica N.V., Janssen Pharmaceutica, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Yves Moreau
- KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium
| | - Bernd Beck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Ansgar Schuffenhauer
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
| |
Collapse
|
3
|
Humbeck L, Morawietz T, Sturm N, Zalewski A, Harnqvist S, Heyndrickx W, Holmes M, Beck B. Don't Overweight Weights: Evaluation of Weighting Strategies for Multi-Task Bioactivity Classification Models. Molecules 2021; 26:6959. [PMID: 34834051 PMCID: PMC8620420 DOI: 10.3390/molecules26226959] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 11/11/2021] [Accepted: 11/12/2021] [Indexed: 11/17/2022] Open
Abstract
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
Collapse
Affiliation(s)
- Lina Humbeck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397 Biberach an der Riss, Germany
| | - Tobias Morawietz
- Bayer AG, Pharmaceuticals, R&D, Digital Technologies, Computational Molecular Design, 42096 Wuppertal, Germany
| | - Noe Sturm
- Novartis Institutes for BioMedical Research, CH-4002 Basel, Switzerland
| | - Adam Zalewski
- Amgen Research (Munich) GmbH, Staffelseestraße 2, 81477 Munich, Germany
| | - Simon Harnqvist
- Computational Sciences, GlaxoSmithKline, Gunnels Wood Road, Stevenage SG1 2NY, UK
| | | | - Matthew Holmes
- Computational Sciences, GlaxoSmithKline, Gunnels Wood Road, Stevenage SG1 2NY, UK
| | - Bernd Beck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397 Biberach an der Riss, Germany
| |
Collapse
|
6
|
Jasper JB, Humbeck L, Brinkjost T, Koch O. A novel interaction fingerprint derived from per atom score contributions: exhaustive evaluation of interaction fingerprint performance in docking based virtual screening. J Cheminform 2018; 10:15. [PMID: 29549526 PMCID: PMC5856854 DOI: 10.1186/s13321-018-0264-0] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Accepted: 02/17/2018] [Indexed: 01/28/2023] Open
Abstract
Protein ligand interaction fingerprints are a powerful approach for the analysis and assessment of docking poses to improve docking performance in virtual screening. In this study, a novel interaction fingerprint approach (PADIF, protein per atom score contributions derived interaction fingerprint) is presented which was specifically designed for utilising the GOLD scoring functions’ atom contributions together with a specific scoring scheme. This allows the incorporation of known protein–ligand complex structures for a target-specific scoring. Unlike many other methods, this approach uses weighting factors reflecting the relative frequency of a specific interaction in the references and penalizes destabilizing interactions. In addition, and for the first time, an exhaustive validation study was performed that assesses the performance of PADIF and two other interaction fingerprints in virtual screening. Here, PADIF shows superior results, and some rules of thumb for a successful use of interaction fingerprints could be identified.![]()
Collapse
Affiliation(s)
- Julia B Jasper
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, 44227, Dortmund, Germany
| | - Lina Humbeck
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, 44227, Dortmund, Germany
| | - Tobias Brinkjost
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, 44227, Dortmund, Germany.,Department of Computer Science, TU Dortmund University, Otto-Hahn-Str. 14, 44227, Dortmund, Germany
| | - Oliver Koch
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, 44227, Dortmund, Germany.
| |
Collapse
|
7
|
Humbeck L, Weigang S, Schäfer T, Mutzel P, Koch O. CHIPMUNK: A Virtual Synthesizable Small-Molecule Library for Medicinal Chemistry, Exploitable for Protein-Protein Interaction Modulators. ChemMedChem 2018; 13:532-539. [PMID: 29392860 DOI: 10.1002/cmdc.201700689] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Revised: 01/27/2018] [Indexed: 02/05/2023]
Abstract
A common issue during drug design and development is the discovery of novel scaffolds for protein targets. On the one hand the chemical space of purchasable compounds is rather limited; on the other hand artificially generated molecules suffer from a grave lack of accessibility in practice. Therefore, we generated a novel virtual library of small molecules which are synthesizable from purchasable educts, called CHIPMUNK (CHemically feasible In silico Public Molecular UNiverse Knowledge base). Altogether, CHIPMUNK covers over 95 million compounds and encompasses regions of the chemical space that are not covered by existing databases. The coverage of CHIPMUNK exceeds the chemical space spanned by the Lipinski rule of five to foster the exploration of novel and difficult target classes. The analysis of the generated property space reveals that CHIPMUNK is well suited for the design of protein-protein interaction inhibitors (PPIIs). Furthermore, a recently developed structural clustering algorithm (StruClus) for big data was used to partition the sub-libraries into meaningful subsets and assist scientists to process the large amount of data. These clustered subsets also contain the target space based on ChEMBL data which was included during clustering.
Collapse
Affiliation(s)
- Lina Humbeck
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, Dortmund, 44227, Germany
| | - Sebastian Weigang
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, Dortmund, 44227, Germany
| | - Till Schäfer
- Department of Computer Science, TU Dortmund University, Otto-Hahn-Straße 14, Dortmund, 44227, Germany
| | - Petra Mutzel
- Department of Computer Science, TU Dortmund University, Otto-Hahn-Straße 14, Dortmund, 44227, Germany
| | - Oliver Koch
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, Dortmund, 44227, Germany
| |
Collapse
|
8
|
Schäfer T, Kriege N, Humbeck L, Klein K, Koch O, Mutzel P. Scaffold Hunter: a comprehensive visual analytics framework for drug discovery. J Cheminform 2017; 9:28. [PMID: 29086162 PMCID: PMC5425364 DOI: 10.1186/s13321-017-0213-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Accepted: 04/10/2017] [Indexed: 01/31/2023] Open
Abstract
The era of big data is influencing the way how rational drug discovery and the development of bioactive molecules is performed and versatile tools are needed to assist in molecular design workflows. Scaffold Hunter is a flexible visual analytics framework for the analysis of chemical compound data and combines techniques from several fields such as data mining and information visualization. The framework allows analyzing high-dimensional chemical compound data in an interactive fashion, combining intuitive visualizations with automated analysis methods including versatile clustering methods. Originally designed to analyze the scaffold tree, Scaffold Hunter is continuously revised and extended. We describe recent extensions that significantly increase the applicability for a variety of tasks.
Collapse
Affiliation(s)
- Till Schäfer
- Department of Computer Science, TU Dortmund University, Otto-Hahn-Str. 14, Dortmund, 44227, Germany
| | - Nils Kriege
- Department of Computer Science, TU Dortmund University, Otto-Hahn-Str. 14, Dortmund, 44227, Germany
| | - Lina Humbeck
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, Dortmund, 44227, Germany
| | - Karsten Klein
- Department of Computer and Information Science, University of Konstanz, Universitaetsstrasse 10, Konstanz, 78464, Germany
| | - Oliver Koch
- Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Str. 6, Dortmund, 44227, Germany.
| | - Petra Mutzel
- Department of Computer Science, TU Dortmund University, Otto-Hahn-Str. 14, Dortmund, 44227, Germany.
| |
Collapse
|