1
|
Kanemaru K, Cranley J, Muraro D, Miranda AMA, Ho SY, Wilbrey-Clark A, Patrick Pett J, Polanski K, Richardson L, Litvinukova M, Kumasaka N, Qin Y, Jablonska Z, Semprich CI, Mach L, Dabrowska M, Richoz N, Bolt L, Mamanova L, Kapuge R, Barnett SN, Perera S, Talavera-López C, Mulas I, Mahbubani KT, Tuck L, Wang L, Huang MM, Prete M, Pritchard S, Dark J, Saeb-Parsy K, Patel M, Clatworthy MR, Hübner N, Chowdhury RA, Noseda M, Teichmann SA. Author Correction: Spatially resolved multiomics of human cardiac niches. Nature 2025; 640:E4. [PMID: 40113895 PMCID: PMC11981930 DOI: 10.1038/s41586-025-08886-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2025]
Affiliation(s)
- Kazumasa Kanemaru
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - James Cranley
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Daniele Muraro
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Siew Yen Ho
- Cardiac Morphology Unit, Royal Brompton Hospital and Imperial College London, London, UK
| | - Anna Wilbrey-Clark
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Jan Patrick Pett
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Krzysztof Polanski
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Laura Richardson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Monika Litvinukova
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
| | - Natsuhiko Kumasaka
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yue Qin
- National Heart and Lung Institute, Imperial College London, London, UK
| | - Zuzanna Jablonska
- National Heart and Lung Institute, Imperial College London, London, UK
| | - Claudia I Semprich
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Lukas Mach
- National Heart and Lung Institute, Imperial College London, London, UK
- Royal Brompton Hospital, London, UK
| | - Monika Dabrowska
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Nathan Richoz
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, MRC Laboratory of Molecular Biology, Cambridge, UK
| | - Liam Bolt
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Lira Mamanova
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Rakeshlal Kapuge
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Sam N Barnett
- National Heart and Lung Institute, Imperial College London, London, UK
| | - Shani Perera
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Carlos Talavera-López
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Würzburg Institute for Systems Immunology, Max Planck Research Group, Julius-Maximilian-Universität, Würzburg, Germany
| | - Ilaria Mulas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Krishnaa T Mahbubani
- Department of Surgery, University of Cambridge, and Cambridge Biorepository for Translational Medicine, NIHR Cambridge Biomedical Centre, Cambridge, UK
| | - Liz Tuck
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Lu Wang
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Margaret M Huang
- Department of Surgery, University of Cambridge, and Cambridge Biorepository for Translational Medicine, NIHR Cambridge Biomedical Centre, Cambridge, UK
| | - Martin Prete
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Sophie Pritchard
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - John Dark
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Kourosh Saeb-Parsy
- Department of Surgery, University of Cambridge, and Cambridge Biorepository for Translational Medicine, NIHR Cambridge Biomedical Centre, Cambridge, UK
| | - Minal Patel
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Menna R Clatworthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, MRC Laboratory of Molecular Biology, Cambridge, UK
| | - Norbert Hübner
- Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
- Charité-Universitätsmedizin, Berlin, Germany
- German Centre for Cardiovascular Research (DZHK), Partner Site Berlin, Berlin, Germany
| | | | - Michela Noseda
- National Heart and Lung Institute, Imperial College London, London, UK.
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
- Department of Physics, Cavendish Laboratory, University of Cambridge, Cambridge, UK.
| |
Collapse
|
2
|
Zdrazil B. Fifteen years of ChEMBL and its role in cheminformatics and drug discovery. J Cheminform 2025; 17:32. [PMID: 40065463 PMCID: PMC11895189 DOI: 10.1186/s13321-025-00963-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2024] [Accepted: 01/20/2025] [Indexed: 03/14/2025] Open
Abstract
In October 2024 we celebrated the 15th anniversary of the first launch of ChEMBL, Europe's most impactful, open-access drug discovery database, hosted by EMBL's European Bioinformatics Institute (EMBL-EBI). This is a good moment to reflect on ChEMBL's history, the role that ChEMBL plays in Cheminformatics and Drug Discovery as well as innovations accelerated using data extracted from it. The review closes by discussing current challenges and possible directions that need to be taken to guarantee that ChEMBL continues to be the pioneering resource for highly curated, open bioactivity data on the European continent and beyond.
Collapse
Affiliation(s)
- Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB101SD, UK.
| |
Collapse
|
3
|
Thakur M, Brooksbank C, Finn R, Firth H, Foreman J, Freeberg M, Gurwitz K, Harrison M, Hulcoop D, Hunt S, R. Leach A, Levchenko M, Marques D, McDonagh E, Mithani A, Parkinson H, Perez-Riverol Y, Perova Z, Sarkans U, Tirunagari S, Tzampatzopoulou E, Venkatesan A, Vizcaino JA, Wingfield B, Zdrazil B, McEntyre J. EMBL's European Bioinformatics Institute (EMBL-EBI) in 2024. Nucleic Acids Res 2025; 53:D10-D19. [PMID: 39607697 PMCID: PMC11701561 DOI: 10.1093/nar/gkae1089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/16/2024] [Accepted: 10/28/2024] [Indexed: 11/29/2024] Open
Abstract
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory, Europe's only intergovernmental life sciences organization. This overview summarizes the latest developments in services that EMBL-EBI data resources provide to scientific communities globally (https://www.ebi.ac.uk/services).
Collapse
Affiliation(s)
- Matthew Thakur
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Catherine Brooksbank
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Helen V Firth
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Cambridge University Hospitals NHS Foundation Trust, East Anglian Medical Genetics Service, Hills Road, Cambridge, CB2 0QQ, UK
| | - Julia Foreman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Mallory Freeberg
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Kim T Gurwitz
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Melissa Harrison
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - David Hulcoop
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Open Targets, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Andrew R. Leach
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Mariia Levchenko
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Diana Marques
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Ellen M McDonagh
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Open Targets, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Aziz Mithani
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Zinaida Perova
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Ugis Sarkans
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Santosh Tirunagari
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Eleni Tzampatzopoulou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Aravind Venkatesan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Juan-Antonio Vizcaino
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Benjamin Wingfield
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| |
Collapse
|
4
|
Morin L, Weber V, Meijer GI, Yu F, Staar PWJ. PatCID: an open-access dataset of chemical structures in patent documents. Nat Commun 2024; 15:6532. [PMID: 39095357 PMCID: PMC11297020 DOI: 10.1038/s41467-024-50779-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 07/19/2024] [Indexed: 08/04/2024] Open
Abstract
The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.
Collapse
Affiliation(s)
- Lucas Morin
- IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.
- Department of Information Technology and Electrical Engineering, ETH Zürich, Sternwartstrasse 7, 8092, Zürich, Switzerland.
| | - Valéry Weber
- IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland
| | | | - Fisher Yu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Sternwartstrasse 7, 8092, Zürich, Switzerland
| | - Peter W J Staar
- IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.
| |
Collapse
|
5
|
Oprea TI, Bologa C, Holmes J, Mathias S, Metzger VT, Waller A, Yang JJ, Leach AR, Jensen LJ, Kelleher KJ, Sheils TK, Mathé E, Avram S, Edwards JS. Overview of the Knowledge Management Center for Illuminating the Druggable Genome. Drug Discov Today 2024; 29:103882. [PMID: 38218214 PMCID: PMC10939799 DOI: 10.1016/j.drudis.2024.103882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 12/22/2023] [Accepted: 01/09/2024] [Indexed: 01/15/2024]
Abstract
The Knowledge Management Center (KMC) for the Illuminating the Druggable Genome (IDG) project aims to aggregate, update, and articulate protein-centric data knowledge for the entire human proteome, with emphasis on the understudied proteins from the three IDG protein families. KMC collates and analyzes data from over 70 resources to compile the Target Central Resource Database (TCRD), which is the web-based informatics platform (Pharos). These data include experimental, computational, and text-mined information on protein structures, compound interactions, and disease and phenotype associations. Based on this knowledge, proteins are classified into different Target Development Levels (TDLs) for identification of understudied targets. Additional work by the KMC focuses on enriching target knowledge and producing DrugCentral and other data visualization tools for expanding investigation of understudied targets.
Collapse
Affiliation(s)
- Tudor I Oprea
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Cristian Bologa
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Jayme Holmes
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Stephen Mathias
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Vincent T Metzger
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Anna Waller
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Jeremy J Yang
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
| | - Andrew R Leach
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Keith J Kelleher
- National Center for Advancing Translational Sciences (NCATS), NIH, Bethesda, MD, USA
| | - Timothy K Sheils
- National Center for Advancing Translational Sciences (NCATS), NIH, Bethesda, MD, USA
| | - Ewy Mathé
- National Center for Advancing Translational Sciences (NCATS), NIH, Bethesda, MD, USA
| | - Sorin Avram
- Coriolan Dragulescu Institute of Chemistry, Timisoara, Romania
| | - Jeremy S Edwards
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA; Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, NM, USA.
| |
Collapse
|
6
|
Zdrazil B, Felix E, Hunter F, Manners EJ, Blackshaw J, Corbett S, de Veij M, Ioannidis H, Lopez DM, Mosquera J, Magarinos M, Bosc N, Arcila R, Kizilören T, Gaulton A, Bento A, Adasme M, Monecke P, Landrum G, Leach A. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 2024; 52:D1180-D1192. [PMID: 37933841 PMCID: PMC10767899 DOI: 10.1093/nar/gkad1004] [Citation(s) in RCA: 266] [Impact Index Per Article: 266.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 10/09/2023] [Accepted: 10/23/2023] [Indexed: 11/08/2023] Open
Abstract
ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL's content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for ∼270 000 bioactivity measurements.
Collapse
Affiliation(s)
- Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Eloy Felix
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Fiona Hunter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Emma J Manners
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - James Blackshaw
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sybilla Corbett
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Marleen de Veij
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Harris Ioannidis
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - David Mendez Lopez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Juan F Mosquera
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Maria Paula Magarinos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Nicolas Bosc
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Ricardo Arcila
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Tevfik Kizilören
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Anna Gaulton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - A Patrícia Bento
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Melissa F Adasme
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Peter Monecke
- Sanofi, R&D, Preclinical Safety, Industriepark Höchst, 65926 Frankfurt am Main, Germany
| | - Gregory A Landrum
- Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Andrew R Leach
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| |
Collapse
|
7
|
Kosonocky CW, Wilke CO, Marcotte EM, Ellington AD. Mining Patents with Large Language Models Elucidates the Chemical Function Landscape. ARXIV 2023:arXiv:2309.08765v2. [PMID: 38196747 PMCID: PMC10775343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
Collapse
|