Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai RTH, Dai HJ, Okazaki N, Cho HC, Gerner M, Solt I, Agarwal S, Liu F, Vishnyakova D, Ruch P, Romacker M, Rinaldi F, Bhattacharya S, Srinivasan P, Liu H, Torii M, Matos S, Campos D, Verspoor K, Livingston KM, Wilbur WJ. The gene normalization task in BioCreative III. BMC Bioinformatics 2011;12 Suppl 8:S2. [PMID: 22151901 PMCID: PMC3269937 DOI: 10.1186/1471-2105-12-s8-s2] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

For:	Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai RTH, Dai HJ, Okazaki N, Cho HC, Gerner M, Solt I, Agarwal S, Liu F, Vishnyakova D, Ruch P, Romacker M, Rinaldi F, Bhattacharya S, Srinivasan P, Liu H, Torii M, Matos S, Campos D, Verspoor K, Livingston KM, Wilbur WJ. The gene normalization task in BioCreative III. BMC Bioinformatics 2011;12 Suppl 8:S2. [PMID: 22151901 PMCID: PMC3269937 DOI: 10.1186/1471-2105-12-s8-s2] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

From Machine Learning to Patient Outcomes: A Comprehensive Review of AI in Pancreatic Cancer. Diagnostics (Basel) 2024;14:174. [PMID: 38248051 PMCID: PMC10814554 DOI: 10.3390/diagnostics14020174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/23/2024] Open

An analysis of entity normalization evaluation biases in specialized domains. BMC Bioinformatics 2023;24:227. [PMID: 37268890 DOI: 10.1186/s12859-023-05350-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 05/24/2023] [Indexed: 06/04/2023] Open

Assigning species information to corresponding genes by a sequence labeling framework. Database (Oxford) 2022;2022:6760187. [PMID: 36227127 PMCID: PMC9558450 DOI: 10.1093/database/baac090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 08/26/2022] [Accepted: 10/11/2022] [Indexed: 01/24/2023]

Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII. Database (Oxford) 2022;2022:6748864. [PMID: 36197453 PMCID: PMC9534061 DOI: 10.1093/database/baac084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 08/18/2022] [Accepted: 09/08/2022] [Indexed: 11/06/2022]

Abstract

The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system's ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-4/.

Collapse

Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021;4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021;118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents. Front Res Metr Anal 2021;6:654438. [PMID: 33870071 PMCID: PMC8028406 DOI: 10.3389/frma.2021.654438] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 02/24/2021] [Indexed: 11/21/2022] Open

The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc 2020;27:1529-1537. [PMID: 32968800 PMCID: PMC7647359 DOI: 10.1093/jamia/ocaa106] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/01/2020] [Accepted: 05/14/2020] [Indexed: 01/19/2023] Open

Biomedical named entity recognition and linking datasets: survey and our recent development. Brief Bioinform 2020;21:2219-2238. [DOI: 10.1093/bib/bbaa054] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 02/29/2020] [Accepted: 03/31/2020] [Indexed: 11/14/2022] Open

Abstract AbstractNatural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein–protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein–protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. Collapse

Building a PubMed knowledge graph. Sci Data 2020;7:205. [PMID: 32591513 PMCID: PMC7320186 DOI: 10.1038/s41597-020-0543-2] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/26/2020] [Indexed: 01/08/2023] Open

PPR-SSM: personalized PageRank and semantic similarity measures for entity linking. BMC Bioinformatics 2019;20:534. [PMID: 31664891 PMCID: PMC6819326 DOI: 10.1186/s12859-019-3157-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Accepted: 10/14/2019] [Indexed: 11/21/2022] Open

Abstract

BACKGROUND

Biomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups. However, as new concepts are introduced, biomedical literature is prone to ambiguity, specifically in fields that are advancing more rapidly, for example, drug design and development. Entity linking is a text mining task that aims at linking entities mentioned in the literature to concepts in a knowledge base. For example, entity linking can help finding all documents that mention the same concept and improve relation extraction methods. Existing approaches focus on the local similarity of each entity and the global coherence of all entities in a document, but do not take into account the semantics of the domain.

RESULTS

We propose a method, PPR-SSM, to link entities found in documents to concepts from domain-specific ontologies. Our method is based on Personalized PageRank (PPR), using the relations of the ontology to generate a graph of candidate concepts for the mentioned entities. We demonstrate how the knowledge encoded in a domain-specific ontology can be used to calculate the coherence of a set of candidate concepts, improving the accuracy of entity linking. Furthermore, we explore weighting the edges between candidate concepts using semantic similarity measures (SSM). We show how PPR-SSM can be used to effectively link named entities to biomedical ontologies, namely chemical compounds, phenotypes, and gene-product localization and processes.

CONCLUSIONS

We demonstrated that PPR-SSM outperforms state-of-the-art entity linking methods in four distinct gold standards, by taking advantage of the semantic information contained in ontologies. Moreover, PPR-SSM is a graph-based method that does not require training data. Our method improved the entity linking accuracy of chemical compounds by 0.1385 when compared to a method that does not use SSMs.

Collapse

LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 2019;46:W530-W536. [PMID: 29762787 PMCID: PMC6030971 DOI: 10.1093/nar/gky355] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 05/08/2018] [Indexed: 01/10/2023] Open

Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinformatics 2019;20:156. [PMID: 30917789 PMCID: PMC6437991 DOI: 10.1186/s12859-019-2678-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 02/13/2019] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions.

RESULTS

We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data.

CONCLUSIONS

The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.

Collapse

CRFVoter: gene and protein related object recognition using a conglomerate of CRF-based tools. J Cheminform 2019;11:21. [PMID: 30874918 PMCID: PMC6419804 DOI: 10.1186/s13321-019-0343-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Accepted: 03/01/2019] [Indexed: 11/10/2022] Open

LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools. J Cheminform 2019;11:3. [PMID: 30631966 PMCID: PMC6689880 DOI: 10.1186/s13321-018-0327-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/27/2018] [Indexed: 11/10/2022] Open

Identification of conclusive association entities in biomedical articles. J Biomed Semantics 2019;10:1. [PMID: 30616688 PMCID: PMC6322258 DOI: 10.1186/s13326-018-0194-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 12/20/2018] [Indexed: 11/23/2022] Open

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database (Oxford) 2018;2018:5255130. [PMID: 30576485 PMCID: PMC6301375 DOI: 10.1093/database/bay110] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 08/22/2018] [Accepted: 09/24/2018] [Indexed: 11/12/2022]

Abstract

Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.

Collapse

BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:3053439. [PMID: 28365720 PMCID: PMC5467463 DOI: 10.1093/database/baw156] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 11/07/2016] [Indexed: 12/22/2022]

Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges. Database (Oxford) 2016;2016:baw161. [PMID: 28025348 PMCID: PMC5199160 DOI: 10.1093/database/baw161] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 11/10/2016] [Accepted: 11/11/2016] [Indexed: 12/24/2022]

Metabolic Pathway Mining. Methods Mol Biol 2016. [PMID: 27896740 DOI: 10.1007/978-1-4939-6613-4_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]

Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLoS One 2016;11:e0162287. [PMID: 27643689 PMCID: PMC5028053 DOI: 10.1371/journal.pone.0162287] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2016] [Accepted: 08/19/2016] [Indexed: 02/02/2023] Open

Abstract

Biomedical literature articles and narrative content from Electronic Health Records (EHRs) both constitute rich sources of disease-phenotype information. Phenotype concepts may be mentioned in text in multiple ways, using phrases with a variety of structures. This variability stems partly from the different backgrounds of the authors, but also from the different writing styles typically used in each text type. Since EHR narrative reports and literature articles contain different but complementary types of valuable information, combining details from each text type can help to uncover new disease-phenotype associations. However, the alternative ways in which the same concept may be mentioned in each source constitutes a barrier to the automatic integration of information. Accordingly, identification of the unique concepts represented by phrases in text can help to bridge the gap between text types. We describe our development of a novel method, PhenoNorm, which integrates a number of different similarity measures to allow automatic linking of phenotype concept mentions to known concepts in the UMLS Metathesaurus, a biomedical terminological resource. PhenoNorm was developed using the PhenoCHF corpus—a collection of literature articles and narratives in EHRs, annotated for phenotypic information relating to congestive heart failure (CHF). We evaluate the performance of PhenoNorm in linking CHF-related phenotype mentions to Metathesaurus concepts, using a newly enriched version of PhenoCHF, in which each phenotype mention has an expert-verified link to a concept in the UMLS Metathesaurus. We show that PhenoNorm outperforms a number of alternative methods applied to the same task. Furthermore, we demonstrate PhenoNorm’s wider utility, by evaluating its ability to link mentions of various other types of medically-related information, occurring in texts covering wider subject areas, to concepts in different terminological resources. We show that PhenoNorm can maintain performance levels, and that its accuracy compares favourably to other methods applied to these tasks.

Collapse

Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016. CEUR WORKSHOP PROCEEDINGS 2016;1609:28-42. [PMID: 29308065 PMCID: PMC5756095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016;2016:baw067. [PMID: 27402677 PMCID: PMC4940434 DOI: 10.1093/database/baw067] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 04/11/2016] [Indexed: 12/27/2022]

Mining chemical patents with an ensemble of open systems. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016;2016:baw065. [PMID: 27173521 PMCID: PMC4865327 DOI: 10.1093/database/baw065] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 04/11/2016] [Indexed: 11/30/2022]

EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016;2016:baw005. [PMID: 26896844 PMCID: PMC4761108 DOI: 10.1093/database/baw005] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/11/2016] [Indexed: 12/11/2022]

NOBLE - Flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinformatics 2016;17:32. [PMID: 26763894 PMCID: PMC4712516 DOI: 10.1186/s12859-015-0871-y] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Accepted: 12/22/2015] [Indexed: 11/24/2022] Open

Abstract

Background

Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus.

NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system’s matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator.

Results

We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE’s performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems.

Conclusion

NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines.

Collapse

Concept Recognition in French Biomedical Text Using Automatic Translation. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-44564-9_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]

GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BIOMED RESEARCH INTERNATIONAL 2015;2015:918710. [PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710] [Citation(s) in RCA: 110] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 02/01/2023]

pGenN, a gene normalization tool for plant genes and proteins in scientific literature. PLoS One 2015;10:e0135305. [PMID: 26258475 PMCID: PMC4530884 DOI: 10.1371/journal.pone.0135305] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 07/20/2015] [Indexed: 11/18/2022] Open

SimConcept: a hybrid approach for simplifying composite named entities in biomedical text. IEEE J Biomed Health Inform 2015;19:1385-91. [PMID: 25879978 PMCID: PMC4543296 DOI: 10.1109/jbhi.2015.2422651] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. J Am Med Inform Assoc 2015;22:948-56. [PMID: 25948699 PMCID: PMC4986661 DOI: 10.1093/jamia/ocv037] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2014] [Accepted: 03/29/2015] [Indexed: 12/01/2022] Open

Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 2015;17:132-44. [PMID: 25935162 DOI: 10.1093/bib/bbv024] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Indexed: 11/13/2022] Open

Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2015;17:23-32. [PMID: 25888696 DOI: 10.1093/bib/bbv021] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Scaling drug indication curation through crowdsourcing. Database (Oxford) 2015;2015:bav016. [PMID: 25797061 PMCID: PMC4369375 DOI: 10.1093/database/bav016] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2014] [Revised: 02/04/2015] [Accepted: 02/09/2015] [Indexed: 01/24/2023]

A document processing pipeline for annotating chemical entities in scientific documents. J Cheminform 2015;7:S7. [PMID: 25810778 PMCID: PMC4331697 DOI: 10.1186/1758-2946-7-s1-s7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform 2015;7:S3. [PMID: 25810774 PMCID: PMC4331693 DOI: 10.1186/1758-2946-7-s1-s3] [Citation(s) in RCA: 126] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open

Computer-assisted curation of a human regulatory core network from the biological literature. Bioinformatics 2014;31:1258-66. [DOI: 10.1093/bioinformatics/btu795] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 11/26/2014] [Indexed: 12/20/2022] Open

Quantifying the impact and extent of undocumented biomedical synonymy. PLoS Comput Biol 2014;10:e1003799. [PMID: 25255227 PMCID: PMC4177665 DOI: 10.1371/journal.pcbi.1003799] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2013] [Accepted: 06/26/2014] [Indexed: 12/14/2022] Open

Abstract

Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through “crowd-sourcing.” Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for “next-generation,” high-coverage lexical terminologies.

Automated systems that extract and integrate information from the research literature have become common in biomedicine. As the same meaning can be expressed in many distinct but synonymous ways, access to comprehensive thesauri may enable such systems to maximize their performance. Here, we establish the importance of synonymy for a specific text-mining task (named-entity normalization), and we suggest that current thesauri may be woefully inadequate in their documentation of this linguistic phenomenon. To test this claim, we develop a model for estimating the amount of missing synonymy. We apply our model to both biomedical terminologies and general-English thesauri, predicting massive amounts of missing synonymy for both lexicons. Furthermore, we verify some of our predictions for the latter domain through “crowd-sourcing.” Overall, our work highlights the dramatic incompleteness of current biomedical thesauri, and to mitigate this issue, we propose the creation of “living” terminologies, which would automatically harvest undocumented synonymy and help smart machines enrich biomedicine.

Collapse

Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau094. [PMID: 25246425 PMCID: PMC4170591 DOI: 10.1093/database/bau094] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]

Overview of the gene ontology task at BioCreative IV. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau086. [PMID: 25157073 PMCID: PMC4142793 DOI: 10.1093/database/bau086] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

Abstract

Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

Database URL:http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/.

Collapse

De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014;50:151-61. [DOI: 10.1016/j.jbi.2013.12.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2013] [Revised: 12/13/2013] [Accepted: 12/22/2013] [Indexed: 11/30/2022]

tmBioC: improving interoperability of text-mining tools with BioC. Database (Oxford) 2014;2014:bau073. [PMID: 25062914 PMCID: PMC4110697 DOI: 10.1093/database/bau073] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 06/30/2014] [Accepted: 07/01/2014] [Indexed: 02/05/2023]

Abstract

The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.

Collapse

Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. J Biomed Semantics 2014;5:28. [PMID: 26261718 PMCID: PMC4530550 DOI: 10.1186/2041-1480-5-28] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 06/16/2014] [Indexed: 11/10/2022] Open

Abstract

Background

Scientific publications are documentary representations of defeasible arguments, supported by data and repeatable methods. They are the essential mediating artifacts in the ecosystem of scientific communications. The institutional “goal” of science is publishing results. The linear document publication format, dating from 1665, has survived transition to the Web.

Intractable publication volumes; the difficulty of verifying evidence; and observed problems in evidence and citation chains suggest a need for a web-friendly and machine-tractable model of scientific publications. This model should support: digital summarization, evidence examination, challenge, verification and remix, and incremental adoption. Such a model must be capable of expressing a broad spectrum of representational complexity, ranging from minimal to maximal forms.

Results

The micropublications semantic model of scientific argument and evidence provides these features. Micropublications support natural language statements; data; methods and materials specifications; discussion and commentary; challenge and disagreement; as well as allowing many kinds of statement formalization.

The minimal form of a micropublication is a statement with its attribution. The maximal form is a statement with its complete supporting argument, consisting of all relevant evidence, interpretations, discussion and challenges brought forward in support of or opposition to it. Micropublications may be formalized and serialized in multiple ways, including in RDF. They may be added to publications as stand-off metadata.

An OWL 2 vocabulary for micropublications is available at http://purl.org/mp. A discussion of this vocabulary along with RDF examples from the case studies, appears as OWL Vocabulary and RDF Examples in Additional file 1.

Conclusion

Micropublications, because they model evidence and allow qualified, nuanced assertions, can play essential roles in the scientific communications ecosystem in places where simpler, formalized and purely statement-based models, such as the nanopublications model, will not be sufficient. At the same time they will add significant value to, and are intentionally compatible with, statement-based formalizations.

We suggest that micropublications, generated by useful software tools supporting such activities as writing, editing, reviewing, and discussion, will be of great value in improving the quality and tractability of biomedical communications.

Collapse

BioCreative-IV virtual issue. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau039. [PMID: 24852177 PMCID: PMC4030502 DOI: 10.1093/database/bau039] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

A literature search tool for intelligent extraction of disease-associated genes. J Am Med Inform Assoc 2014;21:399-405. [PMID: 23999671 PMCID: PMC3994846 DOI: 10.1136/amiajnl-2012-001563] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Revised: 07/15/2013] [Accepted: 08/08/2013] [Indexed: 12/27/2022] Open

Adapting a natural language processing tool to facilitate clinical trial curation for personalized cancer therapy. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014;2014:126-31. [PMID: 25717412 PMCID: PMC4333699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau033. [PMID: 24715220 PMCID: PMC3978375 DOI: 10.1093/database/bau033] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Mining images in biomedical publications: Detection and analysis of gel diagrams. J Biomed Semantics 2014;5:10. [PMID: 24568573 PMCID: PMC4190668 DOI: 10.1186/2041-1480-5-10] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Accepted: 02/05/2014] [Indexed: 11/10/2022] Open

NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014;47:1-10. [PMID: 24393765 DOI: 10.1016/j.jbi.2013.12.006] [Citation(s) in RCA: 214] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2013] [Revised: 11/06/2013] [Accepted: 12/07/2013] [Indexed: 10/25/2022]

Abstract

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.

Collapse