1
|
Zhu TF, Qian R, Wei X, Lu AP, Cao DS. PatentNetML: A Novel Framework for Predicting Key Compounds in Patents Using Network Science and Machine Learning. J Med Chem 2024; 67:1347-1359. [PMID: 38181431 DOI: 10.1021/acs.jmedchem.3c01893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
Patents play a crucial role in drug research and development, providing early access to unpublished data and offering unique insights. Identifying key compounds in patents is essential to finding novel lead compounds. This study collected a comprehensive data set comprising 1555 patents, encompassing 1000 key compounds, to explore innovative approaches for predicting these key compounds. Our novel PatentNetML framework integrated network science and machine learning algorithms, combining network measures, ADMET properties, and physicochemical properties, to construct robust classification models to identify key compounds. Through a model interpretation and an analysis of three compelling case studies, we showcase the potential of PatentNetML in unveiling hidden patterns and connections within diverse patents. While our framework is pioneering, we acknowledge its limitations when applied to patents that deviate from the assumed central pattern. This work serves as a promising foundation for future research endeavors aimed at efficiently identifying promising drug candidates and expediting drug discovery in the pharmaceutical industry.
Collapse
Affiliation(s)
- Ting-Fei Zhu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410003, Hunan, China
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, China
| | - Rong Qian
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410003, Hunan, China
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, China
| | - Xiao Wei
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410003, Hunan, China
| | - Ai-Ping Lu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, China
- Guangdong-Hong Kong-Macau Joint Lab on Chinese Medicine and Immune Disease Research, Guangzhou 510000, China
| | - Dong-Sheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410003, Hunan, China
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, China
| |
Collapse
|
2
|
Machi K, Akiyama S, Nagata Y, Yoshioka M. OSPAR: A Corpus for Extraction of Organic Synthesis Procedures with Argument Roles. J Chem Inf Model 2023; 63:6619-6628. [PMID: 37859303 PMCID: PMC10647022 DOI: 10.1021/acs.jcim.3c01449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/05/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
There is a pressing need for the automated extraction of chemical reaction information because of the rapid growth of scientific documents. The previously reported works in the literature for the procedure extraction either (a) did not consider the semantic relations between the action and argument or (b) defined a detailed schema for the extraction. The former method was insufficient for reproducing the reaction, while the latter methods were too specific to their own schema and did not consider the general semantic relation between the verb and argument. In addition, they did not provide an annotated text that aligned with the structured procedure. Along these lines, in this work, we propose a corpus named organic synthesis procedures with argument roles (OSPAR) that is annotated with rolesets to consider the semantic relation between the verb and argument. We also provide rolesets for chemical reactions, especially for organic synthesis, which represent the argument roles of actions in the corpus. More specifically, we annotated 112 organic synthesis procedures in journal articles from Organic Syntheses and defined 19 new rolesets in addition to 29 rolesets from an existing language resource (Proposition Bank). After that, we constructed a simple deep learning system trained on OSPAR and discussed the usefulness of the corpus by comparing it with chemical description language (XDL) generated by a natural language processing tool, namely, SynthReader. While our system's output required more detailed parsing, it covered comparable information against XDL. Moreover, we confirmed that the validation of the output action sequence was easy as it was aligned with the original text.
Collapse
Affiliation(s)
- Kojiro Machi
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| | - Seiji Akiyama
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Yuuya Nagata
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Masaharu Yoshioka
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
- Faculty
of Information Science and Technology, Hokkaido
University, Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| |
Collapse
|
3
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Vaškevičius A, Šlepikas L. Deep learning-based automatic action extraction from structured chemical synthesis procedures. PeerJ Comput Sci 2023; 9:e1511. [PMID: 37705639 PMCID: PMC10495970 DOI: 10.7717/peerj-cs.1511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 07/07/2023] [Indexed: 09/15/2023]
Abstract
This article proposes a methodology that uses machine learning algorithms to extract actions from structured chemical synthesis procedures, thereby bridging the gap between chemistry and natural language processing. The proposed pipeline combines ML algorithms and scripts to extract relevant data from USPTO and EPO patents, which helps transform experimental procedures into structured actions. This pipeline includes two primary tasks: classifying patent paragraphs to select chemical procedures and converting chemical procedure sentences into a structured, simplified format. We employ artificial neural networks such as long short-term memory, bidirectional LSTMs, transformers, and fine-tuned T5. Our results show that the bidirectional LSTM classifier achieved the highest accuracy of 0.939 in the first task, while the Transformer model attained the highest BLEU score of 0.951 in the second task. The developed pipeline enables the creation of a dataset of chemical reactions and their procedures in a structured format, facilitating the application of AI-based approaches to streamline synthetic pathways, predict reaction outcomes, and optimize experimental conditions. Furthermore, the developed pipeline allows for creating a structured dataset of chemical reactions and procedures, making it easier for researchers to access and utilize the valuable information in synthesis procedures.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania
- JSC Synhet, Kaunas, Lithuania
| | | | - Arnas Vaškevičius
- Faculty of Mechanical Engineering and Design, Kaunas University of Technology, Kaunas, Lithuania
| | | |
Collapse
|
4
|
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE. PubChem 2023 update. Nucleic Acids Res 2022; 51:D1373-D1380. [PMID: 36305812 PMCID: PMC9825602 DOI: 10.1093/nar/gkac956] [Citation(s) in RCA: 458] [Impact Index Per Article: 229.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/06/2022] [Accepted: 10/13/2022] [Indexed: 01/30/2023] Open
Abstract
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves a wide range of use cases. In the past two years, a number of changes were made to PubChem. Data from more than 120 data sources was added to PubChem. Some major highlights include: the integration of Google Patents data into PubChem, which greatly expanded the coverage of the PubChem Patent data collection; the creation of the Cell Line and Taxonomy data collections, which provide quick and easy access to chemical information for a given cell line and taxon, respectively; and the update of the bioassay data model. In addition, new functionalities were added to the PubChem programmatic access protocols, PUG-REST and PUG-View, including support for target-centric data download for a given protein, gene, pathway, cell line, and taxon and the addition of the 'standardize' option to PUG-REST, which returns the standardized form of an input chemical structure. A significant update was also made to PubChemRDF. The present paper provides an overview of these changes.
Collapse
Affiliation(s)
- Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jie Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Tiejun Cheng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Asta Gindulyte
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jia He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Siqian He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Benjamin A Shoemaker
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Paul A Thiessen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jian Zhang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Evan E Bolton
- To whom correspondence should be addressed. Tel: +1 301 451 1811; Fax: +1 301 480 4559;
| |
Collapse
|
5
|
Wang J, Shen Z, Liao Y, Yuan Z, Li S, He G, Lan M, Qian X, Zhang K, Li H. Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Brief Bioinform 2022; 23:6761958. [PMID: 36252922 PMCID: PMC9677486 DOI: 10.1093/bib/bbac461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/21/2022] [Accepted: 09/26/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of 'near-drug' properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords 'A61P, compound, structure' in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
Collapse
Affiliation(s)
| | | | - Yichen Liao
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Gaoqi He
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Man Lan
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Xuhong Qian
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| | - Kai Zhang
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| | - Honglin Li
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| |
Collapse
|
6
|
Old drugs, new tricks: leveraging known compounds to disrupt coronavirus-induced cytokine storm. NPJ Syst Biol Appl 2022; 8:38. [PMID: 36216820 PMCID: PMC9549818 DOI: 10.1038/s41540-022-00250-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 09/27/2022] [Indexed: 11/11/2022] Open
Abstract
A major complication in COVID-19 infection consists in the onset of acute respiratory distress fueled by a dysregulation of the host immune network that leads to a run-away cytokine storm. Here, we present an in silico approach that captures the host immune system’s complex regulatory dynamics, allowing us to identify and rank candidate drugs and drug pairs that engage with minimal subsets of immune mediators such that their downstream interactions effectively disrupt the signaling cascades driving cytokine storm. Drug–target regulatory interactions are extracted from peer-reviewed literature using automated text-mining for over 5000 compounds associated with COVID-induced cytokine storm and elements of the underlying biology. The targets and mode of action of each compound, as well as combinations of compounds, were scored against their functional alignment with sets of competing model-predicted optimal intervention strategies, as well as the availability of like-acting compounds and known off-target effects. Top-ranking individual compounds identified included a number of known immune suppressors such as calcineurin and mTOR inhibitors as well as compounds less frequently associated for their immune-modulatory effects, including antimicrobials, statins, and cholinergic agonists. Pairwise combinations of drugs targeting distinct biological pathways tended to perform significantly better than single drugs with dexamethasone emerging as a frequent high-ranking companion. While these predicted drug combinations aim to disrupt COVID-induced acute respiratory distress syndrome, the approach itself can be applied more broadly to other diseases and may provide a standard tool for drug discovery initiatives in evaluating alternative targets and repurposing approved drugs.
Collapse
|
7
|
Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents. Front Res Metr Anal 2022; 6:691105. [PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/02/2021] [Indexed: 11/28/2022] Open
Abstract
Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.
Collapse
Affiliation(s)
- Jingqi Wang
- Melax Technologies, Inc., Houston, TX, United States
| | - Yuankai Ren
- School of Medicine, Nantong University, Nantong, China
| | - Zhi Zhang
- School of Medicine, Nantong University, Nantong, China
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yaoyun Zhang
- Melax Technologies, Inc., Houston, TX, United States
| |
Collapse
|
8
|
Zhai Z, Druckenbrodt C, Thorne C, Akhondi SA, Nguyen DQ, Cohn T, Verspoor K. ChemTables: a dataset for semantic classification on tables in chemical patents. J Cheminform 2021; 13:97. [PMID: 34895295 PMCID: PMC8665561 DOI: 10.1186/s13321-021-00568-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 11/06/2021] [Indexed: 11/10/2022] Open
Abstract
Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called CHEMTABLES, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on CHEMTABLES. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The CHEMTABLES dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .
Collapse
Affiliation(s)
- Zenan Zhai
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | | | - Camilo Thorne
- Elsevier-Data Science, Life Science, Amsterdam, The Netherlands
| | | | - Dat Quoc Nguyen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- VinAI Research, Hanoi, Vietnam
| | - Trevor Cohn
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- Present Address: School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
9
|
|
10
|
Ohms J. Current methodologies for chemical compound searching in patents: A case study. WORLD PATENT INFORMATION 2021. [DOI: 10.1016/j.wpi.2021.102055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
11
|
Congenericity of Claimed Compounds in Patent Applications. Molecules 2021; 26:molecules26175253. [PMID: 34500686 PMCID: PMC8433967 DOI: 10.3390/molecules26175253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 08/17/2021] [Accepted: 08/18/2021] [Indexed: 12/04/2022] Open
Abstract
A method is presented to analyze quantitatively the degree of congenericity of claimed compounds in patent applications. The approach successfully differentiates patents exemplified with highly congeneric compounds of a structurally compact and well defined chemical series from patents containing a more diverse set of compounds around a more vaguely described patent claim. An application to 750 common patents available in SureChEMBL, SureChEMBLccs and ChEMBL is presented and the congenericity of patent compounds in those different sources discussed.
Collapse
|
12
|
Falaguera MJ, Mestres J. Identification of the Core Chemical Structure in SureChEMBL Patents. J Chem Inf Model 2021; 61:2241-2247. [PMID: 33929850 DOI: 10.1021/acs.jcim.1c00151] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The SureChEMBL database provides open access to 17 million chemical entities mentioned in 14 million patents published since 1970. However, alongside with molecules covered by patent claims, the database is full of starting materials and intermediate products of little pharmacological relevance. Herein, we introduce a new filtering protocol to automatically select the core chemical structures best representing a congeneric series of pharmacologically relevant molecules in patents. The protocol is first validated against a selection of 890 SureChEMBL patents for which a total of 51,738 manually curated molecules are deposited in ChEMBL. Our protocol was able to select 92.5% of the molecules in ChEMBL from all 270,968 molecules in SureChEMBL for those patents. Subsequently, the protocol was applied to all 240,988 US pharmacological patents for which 9,111,706 molecules are available in SureChEMBL. The unsupervised filtering process selected 5,949,214 molecules (65.3% of the total number of molecules) that form highly congeneric chemical series in 188,795 of those patents (78.3% of the total number of patents). A SureChEMBL version enriched with molecules of pharmacological relevance is available for download at https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBLccs.
Collapse
Affiliation(s)
- Maria J Falaguera
- Research Group on Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Medical Research Institute and University Pompeu Fabra, Parc de Recerca Biomèdica (PRBB), Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain
| | - Jordi Mestres
- Research Group on Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Medical Research Institute and University Pompeu Fabra, Parc de Recerca Biomèdica (PRBB), Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain
| |
Collapse
|
13
|
Milman BL, Zhurkovich IK. Statistics of the Popularity of Chemical Compounds in Relation to the Non-Target Analysis. Molecules 2021; 26:molecules26082394. [PMID: 33924131 PMCID: PMC8074313 DOI: 10.3390/molecules26082394] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Revised: 04/14/2021] [Accepted: 04/18/2021] [Indexed: 11/25/2022] Open
Abstract
The idea of popularity/abundance of chemical compounds is widely used in non-target chemical analysis involving environmental studies. To have a clear quantitative basis for this idea, frequency distributions of chemical compounds over indicators of their popularity/abundance are obtained and discussed. Popularity indicators are the number of information sources, the number of chemical vendors, counts of data records, and other variables assessed from two large databases, namely ChemSpider and PubChem. Distributions are approximated by power functions, special cases of Zipf distributions, which are characteristic of the results of human/social activity. Relatively small group of the most popular compounds has been denoted, conventionally accounting for a few percent (several million) of compounds. These compounds are most often explored in scientific research and are practically used. Accordingly, popular compounds have been taken into account as first analyte candidates for identification in non-target analysis.
Collapse
Affiliation(s)
- Boris L. Milman
- Institute of Experimental Medicine, Ul. Akad. Pavlova 12, 197376 Saint Petersburg, Russia
- Correspondence: or ; Tel.: +7-921-766-5296
| | - Inna K. Zhurkovich
- Institute of Toxicology, Ul. Bekhtereva 1, 192019 Saint Petersburg, Russia;
| |
Collapse
|
14
|
Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021; 8:91. [PMID: 33767203 PMCID: PMC7994842 DOI: 10.1038/s41597-021-00875-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 01/19/2021] [Indexed: 11/13/2022] Open
Abstract
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Sun Kim
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dongseop Kwon
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Donald C Comeau
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Yifan Peng
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Cathleen Coss
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Carol Fisher
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Rob Guzman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Preeti Gokal Kochar
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Stella Koppel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dorothy Trinh
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Deborah Whitman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Susan Schmidt
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
15
|
He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents. Front Res Metr Anal 2021; 6:654438. [PMID: 33870071 PMCID: PMC8028406 DOI: 10.3389/frma.2021.654438] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 02/24/2021] [Indexed: 11/21/2022] Open
Abstract
Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
Collapse
Affiliation(s)
- Jiayuan He
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | - Dat Quoc Nguyen
- The University of Melbourne, Parkville, VIC, Australia.,VinAI Research, Hanoi, Vietnam
| | | | | | - Camilo Thorne
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | - Ralph Hoessel
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | | | - Zenan Zhai
- The University of Melbourne, Parkville, VIC, Australia
| | - Biaoyan Fang
- The University of Melbourne, Parkville, VIC, Australia
| | - Hiyori Yoshikawa
- The University of Melbourne, Parkville, VIC, Australia.,Fujitsu Laboratories Ltd., Tokyo, Japan
| | - Ameer Albahem
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | | | - Trevor Cohn
- The University of Melbourne, Parkville, VIC, Australia
| | | | - Karin Verspoor
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| |
Collapse
|
16
|
Drug repurposing patent documents vs peer review: patent information comes more than 600 days earlier on average. FUTURE DRUG DISCOVERY 2020. [DOI: 10.4155/fdd-2020-0001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
|
17
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F, Akhondi SA, Cohn T, Baldwin T, Verspoor K. ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. ADVANCES IN INFORMATION RETRIEVAL 2020; 12036. [PMCID: PMC7148043 DOI: 10.1007/978-3-030-45442-5_74] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
We introduce a new evaluation lab named ChEMU (Cheminformatics Elsevier Melbourne University), part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU involves two key information extraction tasks over chemical reactions from patents. Task 1—Named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. Task 2—Event extraction over chemical reactions—involves event trigger detection and argument recognition. We briefly present the motivations and goals of the ChEMU tasks, as well as resources and evaluation methodology.
Collapse
|