151
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
152
|
Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5042822. [PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/25/2018] [Indexed: 11/14/2022]
Abstract
In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes. We tested three recursive neural network approaches to improve the performance of relation extraction. In the BioCreative VI challenge, we developed a tree-Long Short-Term Memory networks (tree-LSTM) model with several additional features including a position feature and a subtree containment feature, and we also applied an ensemble method. After the challenge, we applied additional pre-processing steps to the tree-LSTM model, and we tested the performance of another recursive neural network model called Stack-augmented Parser Interpreter Neural Network (SPINN). Our tree-LSTM model achieved an F-score of 58.53% in the BioCreative VI challenge. Our tree-LSTM model with additional pre-processing and the SPINN model obtained F-scores of 63.7 and 64.1%, respectively. Database URL: https://github.com/arwhirang/recursive_chemprot
Collapse
Affiliation(s)
- Sangrak Lim
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| |
Collapse
|
153
|
Zheng W, Lin H, Liu X, Xu B. A document level neural model integrated domain knowledge for chemical-induced disease relations. BMC Bioinformatics 2018; 19:328. [PMID: 30223767 PMCID: PMC6142695 DOI: 10.1186/s12859-018-2316-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 08/14/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The effective combination of texts and knowledge may improve performances of natural language processing tasks. For the recognition of chemical-induced disease (CID) relations which may span sentence boundaries in an article, although existing CID systems explored the utilization for knowledge bases, the effects of different knowledge on the identification of a special CID haven't been distinguished by these systems. Moreover, systems based on neural network only constructed sentence or mention level models. RESULTS In this work, we proposed an effective document level neural model integrated domain knowledge to extract CID relations from biomedical articles. Basic semantic information of an article with respect to a special CID candidate pair was learned from the document level sub-network module. Furthermore, knowledge attention depending on the representation of the article was proposed to distinguish the influences of different knowledge on the special CID pair and then the final representation of knowledge was formed by aggregating weighed knowledge. Finally, the integrated representations of texts and knowledge were passed to a softmax classifier to perform the CID recognition. Experimental results on the chemical-disease relation corpus proposed by BioCreative V show that our proposed system integrated knowledge achieves a good overall performance compared with other state-of-the-art systems. CONCLUSIONS Experimental analyses demonstrate that the introduced attention mechanism on domain knowledge plays a significant role in distinguishing influences of different knowledge on the judgment for a special CID relation.
Collapse
Affiliation(s)
- Wei Zheng
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China.,College of Software, Dalian JiaoTong University, Dalian, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Xiaoxia Liu
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Bo Xu
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| |
Collapse
|
154
|
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018; 10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open
Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Sophia Daikou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Kenju Ueno
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Jun’ichi Tsujii
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| |
Collapse
|
155
|
Li H, Yang M, Chen Q, Tang B, Wang X, Yan J. Chemical-induced disease extraction via recurrent piecewise convolutional neural networks. BMC Med Inform Decis Mak 2018; 18:60. [PMID: 30066652 PMCID: PMC6069297 DOI: 10.1186/s12911-018-0629-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Extracting relationships between chemicals and diseases from unstructured literature have attracted plenty of attention since the relationships are very useful for a large number of biomedical applications such as drug repositioning and pharmacovigilance. A number of machine learning methods have been proposed for chemical-induced disease (CID) extraction due to some publicly available annotated corpora. Most of them suffer from time-consuming feature engineering except deep learning methods. In this paper, we propose a novel document-level deep learning method, called recurrent piecewise convolutional neural networks (RPCNN), for CID extraction. RESULTS Experimental results on a benchmark dataset, the CDR (Chemical-induced Disease Relation) dataset of the BioCreative V challenge for CID extraction show that the highest precision, recall and F-score of our RPCNN-based CID extraction system are 65.24, 77.21 and 70.77%, which is competitive with other state-of-the-art systems. CONCLUSIONS A novel deep learning method is proposed for document-level CID extraction, where domain knowledge, piecewise strategy, attention mechanism, and multi-instance learning are combined together. The effectiveness of the method is proved by experiments conducted on a benchmark dataset.
Collapse
Affiliation(s)
- Haodi Li
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China.,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Ming Yang
- Pharmacy Department, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, Guandong, Shenzhen, China
| | - Qingcai Chen
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China. .,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China.
| | - Buzhou Tang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China. .,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China.
| | - Xiaolong Wang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China.,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| |
Collapse
|
156
|
Rakhi NK, Tuwani R, Mukherjee J, Bagler G. Data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. PLoS One 2018; 13:e0198030. [PMID: 29813110 PMCID: PMC5973616 DOI: 10.1371/journal.pone.0198030] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Accepted: 05/11/2018] [Indexed: 01/11/2023] Open
Abstract
Spices and herbs are key dietary ingredients used across cultures worldwide. Beyond their use as flavoring and coloring agents, the popularity of these aromatic plant products in culinary preparations has been attributed to their antimicrobial properties. Last few decades have witnessed an exponential growth of biomedical literature investigating the impact of spices and herbs on health, presenting an opportunity to mine for patterns from empirical evidence. Systematic investigation of empirical evidence to enumerate the health consequences of culinary herbs and spices can provide valuable insights into their therapeutic utility. We implemented a text mining protocol to assess the health impact of spices by assimilating, both, their positive and negative effects. We conclude that spices show broad-spectrum benevolence across a range of disease categories in contrast to negative effects that are comparatively narrow-spectrum. We also implement a strategy for disease-specific culinary recommendations of spices based on their therapeutic tradeoff against adverse effects. Further by integrating spice-phytochemical-disease associations, we identify bioactive spice phytochemicals potentially involved in their therapeutic effects. Our study provides a systems perspective on health effects of culinary spices and herbs with applications for dietary recommendations as well as identification of phytochemicals potentially involved in underlying molecular mechanisms.
Collapse
Affiliation(s)
- N. K. Rakhi
- Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India
- Department of Bioscience and Bioengineering, Indian Institute of Technology Jodhpur, Jodhpur, India
| | - Rudraksh Tuwani
- Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India
| | - Jagriti Mukherjee
- Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India
| | - Ganesh Bagler
- Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India
- * E-mail: ,
| |
Collapse
|
157
|
Warikoo N, Chang YC, Hsu WL. LPTK: a linguistic pattern-aware dependency tree kernel approach for the BioCreative VI CHEMPROT task. Database (Oxford) 2018; 2018:5139652. [PMID: 30346607 PMCID: PMC6196310 DOI: 10.1093/database/bay108] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 08/30/2018] [Accepted: 09/24/2018] [Indexed: 11/14/2022]
Abstract
Identifying the interactions between chemical compounds and genes from biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this paper, we describe Linguistic Pattern-Aware Dependency Tree Kernel, a linguistic interaction pattern learning method developed for CHEMPROT task-BioCreative VI, to capture chemical-protein interaction (CPI) patterns within biomedical literatures. We also introduce a framework to integrate these linguistic patterns with smooth partial tree kernel to extract the CPIs. This new method of feature representation models aspects of linguistic probability in geometric representation, which not only optimizes the sufficiency of feature dimension for classification, but also defines features as interpretable contexts rather than long vectors of numbers. In order to test the robustness and efficiency of our system in identifying different kinds of biological interactions, we evaluated our framework on three separate data sets, i.e. CHEMPROT corpus, Chemical-Disease Relation corpus and Protein-Protein Interaction corpus. Corresponding experiment results demonstrate that our method is effective and outperforms several compared systems for each data set.
Collapse
Affiliation(s)
- Neha Warikoo
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
158
|
Zhao Z, Yang Z, Luo L, Wang L, Zhang Y, Lin H, Wang J. Disease named entity recognition from biomedical literature using a novel convolutional neural network. BMC Med Genomics 2017; 10:73. [PMID: 29297367 PMCID: PMC5751782 DOI: 10.1186/s12920-017-0316-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background Automatic disease named entity recognition (DNER) is of utmost importance for development of more sophisticated BioNLP tools. However, most conventional CRF based DNER systems rely on well-designed features whose selection is labor intensive and time-consuming. Though most deep learning methods can solve NER problems with little feature engineering, they employ additional CRF layer to capture the correlation information between labels in neighborhoods which makes them much complicated. Methods In this paper, we propose a novel multiple label convolutional neural network (MCNN) based disease NER approach. In this approach, instead of the CRF layer, a multiple label strategy (MLS) first introduced by us, is employed. First, the character-level embedding, word-level embedding and lexicon feature embedding are concatenated. Then several convolutional layers are stacked over the concatenated embedding. Finally, MLS strategy is applied to the output layer to capture the correlation information between neighboring labels. Results As shown by the experimental results, MCNN can achieve the state-of-the-art performance on both NCBI and CDR corpora. Conclusions The proposed MCNN based disease NER method achieves the state-of-the-art performance with little feature engineering. And the experimental results show the MLS strategy’s effectiveness of capturing the correlation information between labels in the neighborhood.
Collapse
Affiliation(s)
- Zhehuan Zhao
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China.
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing, 100850, China.
| | - Yin Zhang
- Beijing Institute of Health Administration and Medical Information, Beijing, 100850, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China
| |
Collapse
|
159
|
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017; 33:i37-i48. [PMID: 28881963 PMCID: PMC5870729 DOI: 10.1093/bioinformatics/btx228] [Citation(s) in RCA: 205] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
MOTIVATION Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT habibima@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Mariana Neves
- Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
| | - David Luis Wiegandt
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
160
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
161
|
Gu J, Sun F, Qian L, Zhou G. Chemical-induced disease relation extraction via convolutional neural network. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3098440. [PMID: 28415073 PMCID: PMC5467558 DOI: 10.1093/database/bax024] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 03/01/2017] [Indexed: 01/08/2023]
Abstract
This article describes our work on the BioCreative-V chemical–disease relation (CDR) extraction task, which employed a maximum entropy (ME) model and a convolutional neural network model for relation extraction at inter- and intra-sentence level, respectively. In our work, relation extraction between entity concepts in documents was simplified to relation extraction between entity mentions. We first constructed pairs of chemical and disease mentions as relation instances for training and testing stages, then we trained and applied the ME model and the convolutional neural network model for inter- and intra-sentence level, respectively. Finally, we merged the classification results from mention level to document level to acquire the final relations between chemical and disease concepts. The evaluation on the BioCreative-V CDR corpus shows the effectiveness of our proposed approach. Database URL:http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/
Collapse
Affiliation(s)
- Jinghang Gu
- School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China
| | - Fuqing Sun
- Department of Gynecology Minimally Invasive Center, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, 17 Qihelou Street, Beijing, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China
| |
Collapse
|
162
|
Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Collaborative relation annotation and quality analysis in Markyt environment. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4693828. [PMID: 29220479 PMCID: PMC5737204 DOI: 10.1093/database/bax090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 11/09/2017] [Indexed: 11/30/2022]
Abstract
Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org. Database URL:http://markyt.org
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Gael Pérez-Rodríguez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Florentino Fdez-Riverola
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Anália Lourenço
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain.,CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
163
|
Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Res 2016; 45:D972-D978. [PMID: 27651457 PMCID: PMC5210612 DOI: 10.1093/nar/gkw838] [Citation(s) in RCA: 397] [Impact Index Per Article: 44.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2016] [Accepted: 09/09/2016] [Indexed: 12/19/2022] Open
Abstract
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between chemicals and gene products, and their relationships to diseases. Core CTD content (chemical-gene, chemical-disease and gene-disease interactions manually curated from the literature) are integrated with each other as well as with select external datasets to generate expanded networks and predict novel associations. Today, core CTD includes more than 30.5 million toxicogenomic connections relating chemicals/drugs, genes/proteins, diseases, taxa, Gene Ontology (GO) annotations, pathways, and gene interaction modules. In this update, we report a 33% increase in our core data content since 2015, describe our new exposure module (that harmonizes exposure science information with core toxicogenomic data) and introduce a novel dataset of GO-disease inferences (that identify common molecular underpinnings for seemingly unrelated pathologies). These advancements centralize and contextualize real-world chemical exposures with molecular pathways to help scientists generate testable hypotheses in an effort to understand the etiology and mechanisms underlying environmentally influenced diseases.
Collapse
Affiliation(s)
- Allan Peter Davis
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Cynthia J Grondin
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Robin J Johnson
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Daniela Sciaky
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Benjamin L King
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME 04672, USA
| | - Roy McMorran
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME 04672, USA
| | - Jolene Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Thomas C Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Carolyn J Mattingly
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA.,Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
164
|
Gu J, Qian L, Zhou G. Chemical-induced disease relation extraction with various linguistic features. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw042. [PMID: 27052618 PMCID: PMC4822558 DOI: 10.1093/database/baw042] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 03/04/2016] [Indexed: 01/06/2023]
Abstract
Understanding the relations between chemicals and diseases is crucial in various biomedical tasks such as new drug discoveries and new therapy developments. While manually mining these relations from the biomedical literature is costly and time-consuming, such a procedure is often difficult to keep up-to-date. To address these issues, the BioCreative-V community proposed a challenging task of automatic extraction of chemical-induced disease (CID) relations in order to benefit biocuration. This article describes our work on the CID relation extraction task on the BioCreative-V tasks. We built a machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models. In addition to leveraging various features, the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary were also employed during both training and testing stages to obtain more accurate classification models and better extraction performance, respectively. We demoted relation extraction between entities in documents to relation extraction between entity mentions. In our system, pairs of chemical and disease mentions at both intra- and inter-sentence levels were first constructed as relation instances for training and testing, then two classification models at both levels were trained from the training examples and applied to the testing examples. Finally, we merged the classification results from mention level to document level to acquire final relations between chemicals and diseases. Our system achieved promising F-scores of 60.4% on the development dataset and 58.3% on the test dataset using gold-standard entity annotations, respectively. Database URL: https://github.com/JHnlp/BC5CIDTask
Collapse
Affiliation(s)
- Jinghang Gu
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Longhua Qian
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Guodong Zhou
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| |
Collapse
|