1
|
Li DZ, Xu X, Pan JH, Gao W, Zhang SR. Image2InChI: Automated Molecular Optical Image Recognition. J Chem Inf Model 2024; 64:3640-3649. [PMID: 38359459 DOI: 10.1021/acs.jcim.3c02082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2024]
Abstract
The accurate identification and analysis of chemical structures in molecular images are prerequisites of artificial intelligence for drug discovery. It is important to efficiently and automatically convert molecular images into machine-readable representations. Therefore, in this paper, we propose an automated molecular optical image recognition model based on deep learning, called Image2InChI. Additionally, the proposed Image2InChI introduces a novel feature fusion network with attention to integrate image patch and InChI prediction. The improved SwinTransformer as an encoder and the Transformer Decoder as a decoder with patch embedding are applied to predict the image features for the corresponding InChI. The experimental results showed that the Image2InChI model achieves an accuracy of InChI (InChI acc) of 99.8%, a Morgan FP of 94.1%, an accuracy of maximum common structures (MCS acc) of 94.8%, and an accuracy of longest common subsequence (LCS acc) of 96.2%. The experiments demonstrated that the proposed Image2InChI model improves the accuracy and efficiency of molecular image recognition and provided a valuable reference about optical chemical structure recognition for InChI.
Collapse
Affiliation(s)
- Da-Zhou Li
- College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110000, China
| | - Xin Xu
- College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110000, China
| | - Jia-Heng Pan
- College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110000, China
| | - Wei Gao
- College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110000, China
| | - Shi-Rui Zhang
- College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110000, China
| |
Collapse
|
2
|
Tang B, Niu Z, Wang X, Huang J, Ma C, Peng J, Jiang Y, Ge R, Hu H, Lin L, Yang G. Automated molecular structure segmentation from documents using ChemSAM. J Cheminform 2024; 16:29. [PMID: 38475916 DOI: 10.1186/s13321-024-00823-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/03/2024] [Indexed: 03/14/2024] Open
Abstract
Chemical structure segmentation constitutes a pivotal task in cheminformatics, involving the extraction and abstraction of structural information of chemical compounds from text-based sources, including patents and scientific articles. This study introduces a deep learning approach to chemical structure segmentation, employing a Vision Transformer (ViT) to discern the structural patterns of chemical compounds from their graphical representations. The Chemistry-Segment Anything Model (ChemSAM) achieves state-of-the-art results on publicly available benchmark datasets and real-world tasks, underscoring its effectiveness in accurately segmenting chemical structures from text-based sources. Moreover, this deep learning-based approach obviates the need for handcrafted features and demonstrates robustness against variations in image quality and style. During the detection phase, a ViT-based encoder-decoder model is used to identify and locate chemical structure depictions on the input page. This model generates masks to ascertain whether each pixel belongs to a chemical structure, thereby offering a pixel-level classification and indicating the presence or absence of chemical structures at each position. Subsequently, the generated masks are clustered based on their connectivity, and each mask cluster is updated to encapsulate a single structure in the post-processing workflow. This two-step process facilitates the effective automatic extraction of chemical structure depictions from documents. By utilizing the deep learning approach described herein, it is demonstrated that effective performance on low-resolution and densely arranged molecular structural layouts in journal articles and patents is achievable.
Collapse
Affiliation(s)
- Bowen Tang
- College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Hangzhou Institute of Advanced Technology, Hangzhou, 310000, China
| | - Zhangming Niu
- MindRank AI Ltd., Hangzhou, 310000, China
- National Heart and Lung Institute, Imperial College London, London, SW7 2AZ, UK
| | | | | | - Chao Ma
- MindRank AI Ltd., Hangzhou, 310000, China
| | - Jing Peng
- Hunan University of Medicine, Huaihua, 4180001, Hunan, China
| | | | - Ruiquan Ge
- Hangzhou Dianzi University, Hangzhou, 310000, China
| | - Hongyu Hu
- Xingzhi College, Zhejiang Normal University, Lanxi, China.
| | - Luhao Lin
- Department of Pharmacy, The 910th Hospital of the Joint Logistics Support Force of the Chinese PLA, Quanzhou, 362000, Fujian, China.
| | - Guang Yang
- Bioengineering Department and Imperial-X, Imperial College London, London, W12 7SL, UK.
- National Heart and Lung Institute, Imperial College London, London, SW7 2AZ, UK.
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, WC2R 2LS, UK.
| |
Collapse
|
3
|
Wilary D, Cole JM. ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes. J Chem Inf Model 2023; 63:6053-6067. [PMID: 37729111 PMCID: PMC10565829 DOI: 10.1021/acs.jcim.3c00422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Indexed: 09/22/2023]
Abstract
Knowledge in the chemical domain is often disseminated graphically via chemical reaction schemes. The task of describing chemical transformations is greatly simplified by introducing reaction schemes that are composed of chemical diagrams and symbols. While intuitively understood by any chemist, like most graphical representations, such drawings are not easily understood by machines; this poses a challenge in the context of data extraction. Currently available tools are limited in their scope of extraction and require manual preprocessing, thus slowing down the speed of data extraction. We present a new tool, ReactionDataExtractor v2.0, which uses a combination of neural networks and symbolic artificial intelligence to effectively remove this barrier. We have evaluated our tool on a test set composed of reaction schemes that were taken from open-source journal articles and realized F1 score metrics between 75 and 96%. These evaluation metrics can be further improved by tuning our object-detection models to a specific chemical subdomain thanks to a data-driven approach that we have adopted with synthetically generated data. The system architecture of our tool is modular, which allows it to balance speed and accuracy to afford an autonomous, high-throughput solution for image-based chemical data extraction.
Collapse
Affiliation(s)
- Damian
M. Wilary
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| |
Collapse
|
4
|
Wang Y, Zhang R, Zhang S, Guo L, Zhou Q, Zhao B, Mo X, Yang Q, Huang Y, Li K, Fan Y, Huang L, Zhou F. OCMR: A comprehensive framework for optical chemical molecular recognition. Comput Biol Med 2023; 163:107187. [PMID: 37393787 DOI: 10.1016/j.compbiomed.2023.107187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/10/2023] [Accepted: 06/19/2023] [Indexed: 07/04/2023]
Abstract
Artificial intelligence (AI) has achieved significant progress in the field of drug discovery. AI-based tools have been used in all aspects of drug discovery, including chemical structure recognition. We propose a chemical structure recognition framework, Optical Chemical Molecular Recognition (OCMR), to improve the data extraction capability in practical scenarios compared with the rule-based and end-to-end deep learning models. The proposed OCMR framework enhances the recognition performances via the integration of local information in the topology of molecular graphs. OCMR handles complex tasks like non-canonical drawing and atomic group abbreviation and substantially improves the current state-of-the-art results on multiple public benchmark datasets and one internally curated dataset.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Artificial Intelligence, Jilin University, Changchun, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Ruochi Zhang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| | - Shengde Zhang
- Machine Learning Department, Silexon AI Technology Co, Ltd, Beijing, 100084, China
| | - Liming Guo
- Machine Learning Department, Silexon AI Technology Co, Ltd, Beijing, 100084, China
| | - Qiong Zhou
- School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, 130012, China
| | - Bowen Zhao
- Machine Learning Department, Silexon AI Technology Co, Ltd, Beijing, 100084, China
| | - Xiaotong Mo
- Machine Learning Department, Silexon AI Technology Co, Ltd, Beijing, 100084, China
| | - Qian Yang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Artificial Intelligence, Jilin University, Changchun, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yajuan Huang
- Machine Learning Department, Silexon AI Technology Co, Ltd, Beijing, 100084, China
| | - Kewei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| | - Yusi Fan
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Artificial Intelligence, Jilin University, Changchun, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| |
Collapse
|
5
|
Qian Y, Guo J, Tu Z, Li Z, Coley CW, Barzilay R. MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation. J Chem Inf Model 2023; 63:1925-1934. [PMID: 36971363 DOI: 10.1021/acs.jcim.2c01480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhening Li
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
6
|
Musazade F, Jamalova N, Hasanov J. Review of techniques and models used in optical chemical structure recognition in images and scanned documents. J Cheminform 2022; 14:61. [PMID: 36076301 PMCID: PMC9461257 DOI: 10.1186/s13321-022-00642-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 08/20/2022] [Indexed: 11/10/2022] Open
Abstract
Extraction of chemical formulas from images was not in the top priority of Computer Vision tasks for a while. The complexity both on the input and prediction sides has made this task challenging for the conventional Artificial Intelligence and Machine Learning problems. A binary input image which might seem trivial for convolutional analysis was not easy to classify, since the provided sample was not representative of the given molecule: to describe the same formula, a variety of graphical representations which do not resemble each other can be used. Considering the variety of molecules, the problem shifted from classification to that of formula generation, which makes Natural Language Processing (NLP) a good candidate for an effective solution. This paper describes the evolution of approaches from rule-based structure analyses to complex statistical models, and compares the efficiency of models and methodologies used in the recent years. Although the latest achievements deliver ideal results on particular datasets, the authors mention possible problems for various scenarios and provide suggestions for further development.
Collapse
Affiliation(s)
- Fidan Musazade
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States
| | - Narmin Jamalova
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States
| | - Jamaladdin Hasanov
- School of Engineering and Applied Science, The George Washington University, Washington, DC, United States. .,School of IT and Engineering, ADA University, Baku, Azerbaijan.
| |
Collapse
|
7
|
Xu Z, Li J, Yang Z, Li S, Li H. SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. J Cheminform 2022; 14:41. [PMID: 35778754 PMCID: PMC9248127 DOI: 10.1186/s13321-022-00624-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 06/12/2022] [Indexed: 11/26/2022] Open
Abstract
Optical chemical structure recognition from scientific publications is essential for rediscovering a chemical structure. It is an extremely challenging problem, and current rule-based and deep-learning methods cannot achieve satisfactory recognition rates. Herein, we propose SwinOCSR, an end-to-end model based on a Swin Transformer. This model uses the Swin Transformer as the backbone to extract image features and introduces Transformer models to convert chemical information from publications into DeepSMILES. A novel chemical structure dataset was constructed to train and verify our method. Our proposed Swin Transformer-based model was extensively tested against the backbone of existing publicly available deep learning methods. The experimental results show that our model significantly outperforms the compared methods, demonstrating the model’s effectiveness. Moreover, we used a focal loss to address the token imbalance problem in the text representation of the chemical structure diagram, and our model achieved an accuracy of 98.58%.
Collapse
Affiliation(s)
- Zhanpeng Xu
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Jianhua Li
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China.
| | - Zhaopeng Yang
- School of Information Science and Engineering, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Shiliang Li
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| |
Collapse
|
8
|
Brinkhaus HO, Zielesny A, Steinbeck C, Rajan K. DECIMER-hand-drawn molecule images dataset. J Cheminform 2022; 14:36. [PMID: 35681226 PMCID: PMC9185882 DOI: 10.1186/s13321-022-00620-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 05/25/2022] [Indexed: 12/01/2022] Open
Abstract
The translation of images of chemical structures into machine-readable representations of the depicted molecules is known as optical chemical structure recognition (OCSR). There has been a lot of progress over the last three decades in this field, but the development of systems for the recognition of complex hand-drawn structure depictions is still at the beginning. Currently, there is no data for the systematic evaluation of OCSR methods on hand-drawn structures available. Here we present DECIMER — Hand-drawn molecule images, a standardised, openly available benchmark dataset of 5088 hand-drawn depictions of diversely picked chemical structures. Every structure depiction in the dataset is mapped to a machine-readable representation of the underlying molecule. The dataset is openly available and published under the CC-BY 4.0 licence which applies very few limitations. We hope that it will contribute to the further development of the field.
Collapse
Affiliation(s)
- Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
9
|
Weir H, Thompson K, Woodward A, Choi B, Braun A, Martínez TJ. ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning. Chem Sci 2021; 12:10622-10633. [PMID: 34447555 PMCID: PMC8365825 DOI: 10.1039/d1sc02957f] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 06/28/2021] [Indexed: 11/21/2022] Open
Abstract
Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition tool designed to remove these barriers. A neural image captioning approach consisting of a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder learned a mapping from photographs of hand-drawn hydrocarbon structures to machine-readable SMILES representations. We generated a large auxiliary training dataset, based on RDKit molecular images, by combining image augmentation, image degradation and background addition. Additionally, a small dataset of ∼600 hand-drawn hydrocarbon chemical structures was crowd-sourced using a phone web application. These datasets were used to train the image-to-SMILES neural network with the goal of maximizing the hand-drawn hydrocarbon recognition accuracy. By forming a committee of the trained neural networks where each network casts one vote for the predicted molecule, we achieved a nearly 10 percentage point improvement of the molecule recognition accuracy and were able to assign a confidence value for the prediction based on the number of agreeing votes. The ensemble model achieved an accuracy of 76% on hand-drawn hydrocarbons, increasing to 86% if the top 3 predictions were considered.
Collapse
Affiliation(s)
- Hayley Weir
- Department of Chemistry, Stanford University Stanford CA 94305 USA
- SLAC National Accelerator Laboratory 2575 Sand Hill Road Menlo Park CA 94025 USA
| | - Keiran Thompson
- Department of Chemistry, Stanford University Stanford CA 94305 USA
- SLAC National Accelerator Laboratory 2575 Sand Hill Road Menlo Park CA 94025 USA
| | - Amelia Woodward
- Department of Chemistry, Stanford University Stanford CA 94305 USA
| | - Benjamin Choi
- Department of Electrical Engineering, Stanford University Stanford CA 94305 USA
| | - Augustin Braun
- Department of Chemistry, Stanford University Stanford CA 94305 USA
| | - Todd J Martínez
- Department of Chemistry, Stanford University Stanford CA 94305 USA
- SLAC National Accelerator Laboratory 2575 Sand Hill Road Menlo Park CA 94025 USA
| |
Collapse
|
10
|
Rajan K, Zielesny A, Steinbeck C. DECIMER: towards deep learning for chemical image recognition. J Cheminform 2020; 12:65. [PMID: 33372621 PMCID: PMC7590713 DOI: 10.1186/s13321-020-00469-w] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 10/13/2020] [Indexed: 01/18/2023] Open
Abstract
The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of Deep lEarning for Chemical ImagE Recognition (DECIMER), a deep learning method based on existing show-and-tell deep neural networks, which makes very few assumptions about the structure of the underlying problem. It translates a bitmap image of a molecule, as found in publications, into a SMILES. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are superior over SMILES and we have a preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggests that we might be able to achieve near-accurate prediction with 50 to 100 million training structures. This work is entirely based on open-source software and open data and is available to the general public for any purpose.
Collapse
Affiliation(s)
- Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
11
|
Abstract
Structural information about chemical compounds is typically conveyed as 2D images of molecular structures in scientific documents. Unfortunately, these depictions are not a machine-readable representation of the molecules. With a backlog of decades of chemical literature in printed form not properly represented in open-access databases, there is a high demand for the translation of graphical molecular depictions into machine-readable formats. This translation process is known as Optical Chemical Structure Recognition (OCSR). Today, we are looking back on nearly three decades of development in this demanding research field. Most OCSR methods follow a rule-based approach where the key step of vectorization of the depiction is followed by the interpretation of vectors and nodes as bonds and atoms. Opposed to that, some of the latest approaches are based on deep neural networks (DNN). This review provides an overview of all methods and tools that have been published in the field of OCSR. Additionally, a small benchmark study was performed with the available open-source OCSR tools in order to examine their performance.
Collapse
|
12
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
13
|
Karthikeyan M, Vyas R. ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files. J Cheminform 2016; 8:73. [PMID: 28090216 PMCID: PMC5195924 DOI: 10.1186/s13321-016-0175-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2016] [Accepted: 10/18/2016] [Indexed: 11/10/2022] Open
Abstract
Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher's resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed a Java based application, namely ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically. The methodology has been demonstrated via several case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in the PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar.Graphical abstract.
Collapse
Affiliation(s)
- Muthukumarasamy Karthikeyan
- Chemical Engineering and Process Development (CEPD), CSIR-National Chemical Laboratory, Pashan Road, Pune, Maharastra 411008 India
| | - Renu Vyas
- MIT School of Bioengineering Sciences and Research, ADT (Art, Design and Technology) University, Loni Kalbhor, Pune, Maharashtra 412201 India
| |
Collapse
|
14
|
Maeda MH. Current Challenges in Development of a Database of Three-Dimensional Chemical Structures. Front Bioeng Biotechnol 2015; 3:66. [PMID: 26075200 PMCID: PMC4443773 DOI: 10.3389/fbioe.2015.00066] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2015] [Accepted: 04/30/2015] [Indexed: 11/21/2022] Open
Abstract
We are developing a database named 3DMET, a three-dimensional structure database of natural metabolites. There are two major impediments to the creation of 3D chemical structures from a set of planar structure drawings: the limited accuracy of computer programs and insufficient human resources for manual curation. We have tested some 2D–3D converters to convert 2D structure files from external databases. These automatic conversion processes yielded an excessive number of improper conversions. To ascertain the quality of the conversions, we compared IUPAC Chemical Identifier and canonical SMILES notations before and after conversion. Structures whose notations correspond to each other were regarded as a correct conversion in our present work. We found that chiral inversion is the most serious factor during the improper conversion. In the current stage of our database construction, published books or articles have been resources for additions to our database. Chemicals are usually drawn as pictures on the paper. To save human resources, an optical structure reader was introduced. The program was quite useful but some particular errors were observed during our operation. We hope our trials for producing correct 3D structures will help other developers of chemical programs and curators of chemical databases.
Collapse
Affiliation(s)
- Miki H Maeda
- Biomolecular Research Unit, National Institute of Agrobiological Sciences , Tsukuba , Japan
| |
Collapse
|
15
|
Clark AM, Williams AJ, Ekins S. Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. J Cheminform 2015; 7:9. [PMID: 25798198 PMCID: PMC4369291 DOI: 10.1186/s13321-015-0057-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 02/23/2015] [Indexed: 11/12/2022] Open
Abstract
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Lab notebook entries must target both visualisation by scientists and use by machine learning algorithms ![]()
Collapse
Affiliation(s)
- Alex M Clark
- Molecular Materials Informatics, 1900 St. Jacques #302, Montreal, H3J 2S1, QC Canada
| | - Antony J Williams
- Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC 27587 USA
| | - Sean Ekins
- Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526 USA ; Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010 USA
| |
Collapse
|
16
|
Frasconi P, Gabbrielli F, Lippi M, Marinai S. Markov logic networks for optical chemical structure recognition. J Chem Inf Model 2014; 54:2380-90. [PMID: 25068386 DOI: 10.1021/ci5002197] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Optical chemical structure recognition is the problem of converting a bitmap image containing a chemical structure formula into a standard structured representation of the molecule. We introduce a novel approach to this problem based on the pipelined integration of pattern recognition techniques with probabilistic knowledge representation and reasoning. Basic entities and relations (such as textual elements, points, lines, etc.) are first extracted by a low-level processing module. A probabilistic reasoning engine based on Markov logic, embodying chemical and graphical knowledge, is subsequently used to refine these pieces of information. An annotated connection table of atoms and bonds is finally assembled and converted into a standard chemical exchange format. We report a successful evaluation on two large image data sets, showing that the method compares favorably with the current state-of-the-art, especially on degraded low-resolution images. The system is available as a web server at http://mlocsr.dinfo.unifi.it.
Collapse
Affiliation(s)
- Paolo Frasconi
- Dipartimento di Ingegneria dell'Informazione, Università degli Studi di Firenze , Via di Santa Marta, 3, 50139 Firenze, Italy
| | | | | | | |
Collapse
|
17
|
Gurulingappa H, Mudi A, Toldo L, Hofmann-Apitius M, Bhate J. Challenges in mining the literature for chemical information. RSC Adv 2013. [DOI: 10.1039/c3ra40787j] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
|
18
|
Tharatipyakul A, Numnark S, Wichadakul D, Ingsriswang S. ChemEx: information extraction system for chemical data curation. BMC Bioinformatics 2012; 13 Suppl 17:S9. [PMID: 23282330 PMCID: PMC3521388 DOI: 10.1186/1471-2105-13-s17-s9] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Manual chemical data curation from publications is error-prone, time consuming, and hard to maintain up-to-date data sets. Automatic information extraction can be used as a tool to reduce these problems. Since chemical structures usually described in images, information extraction needs to combine structure image recognition and text mining together. Results We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests. Conclusions ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from http://www.biotec.or.th/isl/ChemEx.
Collapse
Affiliation(s)
- Atima Tharatipyakul
- Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand
| | | | | | | |
Collapse
|
19
|
Lounnas V, Vriend G. AsteriX: A Web Server To Automatically Extract Ligand Coordinates from Figures in PDF Articles. J Chem Inf Model 2012; 52:568-76. [DOI: 10.1021/ci2004303] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- V. Lounnas
- CMBI NCMLS Radboud University, Nijmegen Medical
Centre,
Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands
| | - G. Vriend
- CMBI NCMLS Radboud University, Nijmegen Medical
Centre,
Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands
| |
Collapse
|
20
|
Direct use of information extraction from scientific text for modeling and simulation in the life sciences. LIBRARY HI TECH 2009. [DOI: 10.1108/07378830911007637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
21
|
Park J, Rosania GR, Saitou K. Tunable machine vision-based strategy for automated annotation of chemical databases. J Chem Inf Model 2009; 49:1993-2001. [PMID: 19621901 PMCID: PMC2907084 DOI: 10.1021/ci900029v] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We present a tunable, machine vision-based strategy for automated annotation of virtual small molecule databases. The proposed strategy is based on the use of a machine vision-based tool for extracting structure diagrams in research articles and converting them into connection tables, a virtual "Chemical Expert" system for screening the converted structures based on the adjustable levels of estimated conversion accuracy, and a fragment-based measure for calculating intermolecular similarity. For annotation, calculated chemical similarity between the converted structures and entries in a virtual small molecule database is used to establish the links. The overall annotation performances can be tuned by adjusting the cutoff threshold of the estimated conversion accuracy. We perform an annotation test which attempts to link 121 journal articles registered in PubMed to entries in PubChem which is the largest, publicly accessible chemical database. Two cases of tests are performed, and their results are compared to see how the overall annotation performances are affected by the different threshold levels of the estimated accuracy of the converted structure. Our work demonstrates that over 45% of the articles could have true positive links to entries in the PubChem database with promising recall and precision rates in both tests. Furthermore, we illustrate that the Chemical Expert system which can screen converted structures based on the adjustable levels of estimated conversion accuracy is a key factor impacting the overall annotation performance. We propose that this machine vision-based strategy can be incorporated with the text-mining approach to facilitate extraction of contextual scientific knowledge about a chemical structure, from the scientific literature.
Collapse
Affiliation(s)
- Jungkap Park
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, Michigan 48109, ,
| | - Gus R. Rosania
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, Michigan 48109,
| | - Kazuhiro Saitou
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, Michigan 48109, ,
| |
Collapse
|
22
|
Valko AT, Johnson AP. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J Chem Inf Model 2009; 49:780-7. [PMID: 19298076 DOI: 10.1021/ci800449t] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We present CLiDE Pro, the latest version of the output of the long-term CLiDE project for the development of tools for automatic extraction of chemical information from the literature. CLiDE Pro is concerned with the extraction of chemical structure and generic structure information from electronic images of chemical molecules available online as well as pages of scanned chemical documents. The information is extracted in three phases, first the image is segmented into text and graphical regions, then graphical regions are analyzed and where possible the connection tables are reconstructed, and finally any generic structures are interpreted by matching R-groups found in structure diagrams with the ones located in the text. The program has been tested on a large set of images of chemical structures originating from various sources. The results demonstrate good performance in the reconstruction of connection tables with few errors in the interpretation of the individual drawing features found in the structure diagrams. This full test set is presented for use in the validation of other similar systems.
Collapse
Affiliation(s)
- Aniko T Valko
- Keymodule Ltd., Hobberley Lodge, Hobberley Lane, Leeds LS17 8JQ, United Kingdom
| | | |
Collapse
|
23
|
Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model 2009; 49:740-3. [PMID: 19434905 DOI: 10.1021/ci800067r] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Until recently most scientific and patent documents dealing with chemistry have described molecular structures either with systematic names or with graphical images of Kekulé structures. The latter method poses inherent problems in the automated processing that is needed when the number of documents ranges in the hundreds of thousands or even millions since graphical representations cannot be directly interpreted by a computer. To recover this structural information, which is otherwise all but lost, we have built an optical structure recognition application based on modern advances in image processing implemented in open source tools, OSRA. OSRA can read documents in over 90 graphical formats including GIF, JPEG, PNG, TIFF, PDF, and PS, automatically recognizes and extracts the graphical information representing chemical structures in such documents, and generates the SMILES or SD representation of the encountered molecular structure images.
Collapse
Affiliation(s)
- Igor V Filippov
- Laboratory of Medicinal Chemistry, SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 21702, USA.
| | | |
Collapse
|
24
|
Kind T, Scholz M, Fiehn O. How large is the metabolome? A critical analysis of data exchange practices in chemistry. PLoS One 2009; 4:e5440. [PMID: 19415114 PMCID: PMC2673031 DOI: 10.1371/journal.pone.0005440] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2008] [Accepted: 04/15/2009] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites. RESULTS As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases. CONCLUSIONS We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.
Collapse
Affiliation(s)
- Tobias Kind
- University of California Davis, Genome Center – Metabolomics, Davis, California, United States of America
| | - Martin Scholz
- University of California Davis, Genome Center – Metabolomics, Davis, California, United States of America
| | - Oliver Fiehn
- University of California Davis, Genome Center – Metabolomics, Davis, California, United States of America
| |
Collapse
|
25
|
Park J, Rosania GR, Shedden KA, Nguyen M, Lyu N, Saitou K. Automated extraction of chemical structure information from digital raster images. Chem Cent J 2009; 3:4. [PMID: 19196483 PMCID: PMC2648963 DOI: 10.1186/1752-153x-3-4] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2008] [Accepted: 02/05/2009] [Indexed: 11/16/2022] Open
Abstract
Background To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated. Results This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader – a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns. Conclusion The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.
Collapse
Affiliation(s)
- Jungkap Park
- Michigan Alliance for Cheminformatic Exploration, Ann Arbor, MI, USA.
| | | | | | | | | | | |
Collapse
|
26
|
Algorri ME, Zimmermann M, Friedrich CM, Akle S, Hofmann-Apitius M. Reconstruction of chemical molecules from images. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2008; 2007:4609-12. [PMID: 18003032 DOI: 10.1109/iembs.2007.4353366] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
We have developed a system for the automatic reconstruction of chemical molecules from images. The system takes as input an electronically produced image of a chemical molecule and produces an SDF file containing the complete chemical description of the molecule. The SDF file can then be read and used by most chemical computer programs. Our system finds extensive application in information extraction problems where the molecule images contained in chemical documents need to be rendered computer readable. We have benchmarked our system against a commercially available product and have also tested it using chemical databases of several thousand images. The system can be parameterized to reconstruct images of different sources and different characteristics.
Collapse
Affiliation(s)
- Maria-Elena Algorri
- Department of Digital Systems, Instituto Tecnologico Autonomo de Mexico, Mexico City.
| | | | | | | | | |
Collapse
|
27
|
Rosania GR, Crippen G, Woolf P, States D, Shedden K. A Cheminformatic Toolkit for Mining Biomedical Knowledge. Pharm Res 2007; 24:1791-802. [PMID: 17385012 DOI: 10.1007/s11095-007-9285-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2007] [Accepted: 02/27/2007] [Indexed: 01/31/2023]
Abstract
PURPOSE Cheminformatics can be broadly defined to encompass any activity related to the application of information technology to the study of properties, effects and uses of chemical agents. One of the most important current challenges in cheminformatics is to allow researchers to search databases of biomedical knowledge, using chemical structures as input. MATERIALS AND METHODS An important step towards this goal was the establishment of PubChem, an open, centralized database of small molecules accessible through the World Wide Web. While PubChem is primarily intended to serve as a repository for high throughput screening data from federally-funded screening centers and academic research laboratories, the major impact of PubChem could also reside in its ability to serve as a chemical gateway to biomedical databases such as PubMed. CONCLUSION This article will review cheminformatic tools that can be applied to facilitate annotation of PubChem through links to the scientific literature; to integrate PubChem with transcriptomic, proteomic, and metabolomic datasets; to incorporate results of numerical simulations of physiological systems into PubChem annotation; and ultimately, to translate data of chemical genomics screening efforts into information that will benefit biomedical researchers and physician scientists across all therapeutic areas.
Collapse
Affiliation(s)
- Gus R Rosania
- Department of Pharmaceutical Sciences, University of Michigan College of Pharmacy, 428 Church Street, Ann Arbor, MI 48109, USA.
| | | | | | | | | |
Collapse
|
28
|
Abstract
It is easier to find too many documents on a life science topic than to find the right information inside these documents. With the application of text data mining to biological documents, it is no surprise that researchers are starting to look at applications that mine out chemical information. The mining of chemical entities--names and structures--brings with it some unique challenges, which commercial and academic efforts are beginning to address. Ultimately, life science text data mining applications need to focus on the marriage of biological and chemical information.
Collapse
Affiliation(s)
- Debra L Banville
- AstraZeneca Pharmaceuticals, 1800 Concord Pike, Wilmington, DE 19850, USA.
| |
Collapse
|
29
|
Fluck J, Zimmermann M, Kurapkat G, Hofmann M. Information extraction technologies for the life science industry. DRUG DISCOVERY TODAY. TECHNOLOGIES 2005; 2:217-224. [PMID: 24981939 DOI: 10.1016/j.ddtec.2005.08.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Access to relevant information and knowledge is essential for all steps of the drug discovery process. However, keeping track of relevant information in publications and patents becomes a real challenge for scientists and managers in industrial research. Computer-aided information extraction (IE) systems have been developed to support the work of scientists by extracting relevant information from scientific publications and presenting it in an aggregated, condensed form. In this review, we will give an overview on current information extraction strategies in the life sciences with a special focus on biological entity recognition and more recent developments towards the identification and extraction of chemical compound names and structures.:
Collapse
Affiliation(s)
- Juliane Fluck
- Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany.
| | - Marc Zimmermann
- Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany
| | - Günther Kurapkat
- TEMIS Deutschland GmbH, Kurfürstenanlage 3, 69115 Heidelberg, Germany
| | - Martin Hofmann
- Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany
| |
Collapse
|
30
|
Gkoutos GV, Rzepa H, Clark RM, Adjei O, Johal H. Chemical machine vision: automated extraction of chemical metadata from raster images. ACTA ACUST UNITED AC 2004; 43:1342-55. [PMID: 14502466 DOI: 10.1021/ci034017n] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We present a novel application of machine vision methods for the identification of chemical composition diagrams from two-dimensional digital raster images. The method is based on the use of Gabor wavelets and an energy function to derive feature vectors from digital images. These are used for training and classification purposes using a Kohonen network for classification with the Euclidean distance norm. We compare this method with previous approaches to transforming such images to a molecular connection table, which are designed to achieve complete atom connection table fidelity but at the expense of requiring human interaction. The present texture-based approach is complementary in attempting to recognize higher order features such as the presence of a chemical representation in the original raster image. This information can be used for providing chemical metadata descriptors of the original image as part of a robot-based Internet resource discovery tool.
Collapse
Affiliation(s)
- Georgios V Gkoutos
- Department of Chemistry, Imperial College of Science, Technology & Medicine, Exhibition Road, South Kensington, London, England SW7 2AY
| | | | | | | | | |
Collapse
|
31
|
Kemp N, Lynch M. Extraction of Information from the Text of Chemical Patents. 1. Identification of Specific Chemical Names. ACTA ACUST UNITED AC 1998. [DOI: 10.1021/ci980324v] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Nick Kemp
- Department of Information Studies, University of Sheffield, Sheffield S10 2TN, United Kingdom
| | - Michael Lynch
- Department of Information Studies, University of Sheffield, Sheffield S10 2TN, United Kingdom
| |
Collapse
|
32
|
Simon A, Johnson AP. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. ACTA ACUST UNITED AC 1997. [DOI: 10.1021/ci9601022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Anikó Simon
- ICAMS, School of Chemistry, University of Leeds, Leeds LS2 9JT, U.K
| | - A. Peter Johnson
- ICAMS, School of Chemistry, University of Leeds, Leeds LS2 9JT, U.K
| |
Collapse
|