1
|
Ding L, Colavizza G, Zhang Z. Partial Annotation Learning for Biomedical Entity Recognition. IEEE J Biomed Health Inform 2025; 29:1409-1418. [PMID: 39312441 DOI: 10.1109/jbhi.2024.3466294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. To conquer this issue, we undertake a systematic exploration of the efficacy of partial annotation learning methods for BioNER, which encompasses a comprehensive evaluation conducted across a spectrum of distinct simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We standardize a compilation of 16 BioNER corpora, encompassing a range of five distinct entity types, to establish a gold standard. And we compare against the state-of-the-art partial annotation model EER-PubMedBERT, the widely acknowledged partial annotation model BiLSTM-Partial-CRF model, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. Moreover, the recall of entity mentions in our model demonstrates a competitive alignment with the upper threshold observed on the fully annotated dataset.
Collapse
|
2
|
Binkheder S, Wu HY, Quinney SK, Zhang S, Zitu MM, Chiang CW, Wang L, Jones J, Li L. PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature. J Biomed Semantics 2022; 13:17. [PMID: 35690873 PMCID: PMC9188713 DOI: 10.1186/s13326-022-00272-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/18/2022] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. RESULTS Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the "Biomedical & Procedure" dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for "The use of NLP". The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. CONCLUSIONS The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications.
Collapse
Affiliation(s)
- Samar Binkheder
- Department of Biohealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN, USA
- Medical Informatics Unit, Department of Medical Education, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Heng-Yi Wu
- Development Science Informatics, Genentech, South San Francisco, CA, USA
| | - Sara K Quinney
- Department of Obstetrics and Gynecology, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Shijun Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Md Muntasir Zitu
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Chien-Wei Chiang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Lei Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Josette Jones
- Department of Biohealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN, USA
| | - Lang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA.
- , 250 Lincoln Tower, 1800 Cannon Drive, Columbus, OH, 43210, USA.
| |
Collapse
|
3
|
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak 2021; 21:352. [PMID: 34922517 PMCID: PMC8684237 DOI: 10.1186/s12911-021-01706-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/01/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. RESULTS We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. CONCLUSIONS MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.
Collapse
Affiliation(s)
- Fabio Giachelle
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
4
|
Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform 2021; 22:6326536. [PMID: 34308472 DOI: 10.1093/bib/bbab282] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/07/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open
Abstract
The biomedical literature is growing rapidly, and the extraction of meaningful information from the large amount of literature is increasingly important. Biomedical named entity (BioNE) identification is one of the critical and fundamental tasks in biomedical text mining. Accurate identification of entities in the literature facilitates the performance of other tasks. Given that an end-to-end neural network can automatically extract features, several deep learning-based methods have been proposed for BioNE recognition (BioNER), yielding state-of-the-art performance. In this review, we comprehensively summarize deep learning-based methods for BioNER and datasets used in training and testing. The deep learning methods are classified into four categories: single neural network-based, multitask learning-based, transfer learning-based and hybrid model-based methods. They can be applied to BioNER in multiple domains, and the results are determined by the dataset size and type. Lastly, we discuss the future development and opportunities of BioNER methods.
Collapse
Affiliation(s)
- Bosheng Song
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Fen Li
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
5
|
Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021; 12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. Results We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. Conclusions ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
Collapse
Affiliation(s)
- Ton E Becker
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA
| | - Eric Jakobsson
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA. .,Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
| |
Collapse
|
6
|
Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021; 22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION To obtain key information for personalized medicine and cancer research, clinicians and researchers in the biomedical field are in great need of searching genomic variant information from the biomedical literature now than ever before. Due to the various written forms of genomic variants, however, it is difficult to locate the right information from the literature when using a general literature search system. To address the difficulty of locating genomic variant information from the literature, researchers have suggested various solutions based on automated literature-mining techniques. There is, however, no study for summarizing and comparing existing tools for genomic variant literature mining in terms of how to search easily for information in the literature on genomic variants. RESULTS In this article, we systematically compared currently available genomic variant recognition and normalization tools as well as the literature search engines that adopted these literature-mining techniques. First, we explain the problems that are caused by the use of non-standard formats of genomic variants in the PubMed literature by considering examples from the literature and show the prevalence of the problem. Second, we review literature-mining tools that address the problem by recognizing and normalizing the various forms of genomic variants in the literature and systematically compare them. Third, we present and compare existing literature search engines that are designed for a genomic variant search by using the literature-mining techniques. We expect this work to be helpful for researchers who seek information about genomic variants from the literature, developers who integrate genomic variant information from the literature and beyond.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information
| | | | - Zhiyong Lu
- National Center for Biotechnology Information
| |
Collapse
|
7
|
Kaushik V, Plazzer J, Macrae F. Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:e10039. [PMID: 36618447 PMCID: PMC9744508 DOI: 10.1002/ggn2.10039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 01/12/2021] [Accepted: 01/14/2021] [Indexed: 01/11/2023]
Abstract
Pathogenic constitutional genomic variants in the mismatch repair (MMR) genes are the drivers of Lynch syndrome; optimal variant interpretation is required for the management of suspected and confirmed cases. The International Society for Hereditary Gastrointestinal Tumours (InSiGHT) provides expert classifications for MMR variants for the US National Human Genome Research Institute's (NHGRI) ClinGen initiative and interprets variants with discordant classifications and those of uncertain significance (VUSs). Given the onerous nature of extracting information related to variants, literature searching tools which harness artificial intelligence may aid in retrieving information to allow optimum variant classification. In this study, we described the nature of discordance in a sample of 80 variants from a list of variants requiring updating by InSiGHT for ClinGen by comparing their existing InSiGHT classifications with the various submissions for each variant on the US National Centre for Biotechnology Information's (NCBI) ClinVar database. To identify the potential value of a literature searching tool in extracting information related to classification, all variants were searched for using a traditional method (Google Scholar) and literature searching tool (Mastermind) independently. Descriptive statistics were used to compare: the number of articles before and after screening for relevance and the number of relevant articles unique to either method. Relevance was defined as containing the variant in question as well as data informing variant interpretation. A total of 916 articles were returned by both methods and Mastermind averaged four relevant articles per search compared to Google Scholar's three. Of relevant Mastermind articles, 193/308 (62.7%) were unique to it, compared to 87/202, (43.0%) for Google Scholar. For 24 variants, either or both methods found no information. All 6/80 (20%) variants with pathogenic or likely pathogenic InSiGHT classifications have newer VUS assertions on ClinVar. Our study demonstrated that for a sample of variants with varying discordant interpretations, Mastermind was able to return on average, a more relevant and unique literature search. Google Scholar was able to retrieve information that Mastermind did not, which supports a conclusion that Mastermind could play a complementary role in literature searching for classification. This work will aid InSiGHT in its role of classifying MMR variants.
Collapse
Affiliation(s)
- Varun Kaushik
- Melbourne Medical SchoolThe University of MelbourneParkvilleVictoriaAustralia
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - John‐Paul Plazzer
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - Finlay Macrae
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
- Department of Medicine, The Royal Melbourne HospitalThe University of MelbourneParkvilleVictoriaAustralia
| |
Collapse
|
8
|
Neves M, Ševa J. An extensive review of tools for manual annotation of documents. Brief Bioinform 2021; 22:146-163. [PMID: 31838514 PMCID: PMC7820865 DOI: 10.1093/bib/bbz130] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2019] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. METHODS We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. RESULTS We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).
Collapse
Affiliation(s)
- Mariana Neves
- German Centre for the Protection of Laboratory Animals (BfR), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Jurica Ševa
- German Centre for the Protection of Laboratory Animals (BfR), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
9
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F, Akhondi SA, Cohn T, Baldwin T, Verspoor K. ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. ADVANCES IN INFORMATION RETRIEVAL 2020; 12036. [PMCID: PMC7148043 DOI: 10.1007/978-3-030-45442-5_74] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
We introduce a new evaluation lab named ChEMU (Cheminformatics Elsevier Melbourne University), part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU involves two key information extraction tasks over chemical reactions from patents. Task 1—Named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. Task 2—Event extraction over chemical reactions—involves event trigger detection and argument recognition. We briefly present the motivations and goals of the ChEMU tasks, as well as resources and evaluation methodology.
Collapse
|
10
|
Giorgi JM, Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 2019; 34:4087-4094. [PMID: 29868832 PMCID: PMC6247938 DOI: 10.1093/bioinformatics/bty449] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/29/2018] [Indexed: 01/08/2023] Open
Abstract
Motivation The explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER. Results We demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target datasets with a small number of labels (approximately 6000 or less). Availability and implementation Source code for the LSTM-CRF is available at https://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available at https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- John M Giorgi
- Department of Computer Science, University of Toronto, Toronto, Canada.,The Donnelly Centre, University of Toronto, Toronto, Canada
| | - Gary D Bader
- Department of Computer Science, University of Toronto, Toronto, Canada.,The Donnelly Centre, University of Toronto, Toronto, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
11
|
Badal VD, Wright D, Katsis Y, Kim HC, Swafford AD, Knight R, Hsu CN. Challenges in the construction of knowledge bases for human microbiome-disease associations. MICROBIOME 2019; 7:129. [PMID: 31488215 PMCID: PMC6728997 DOI: 10.1186/s40168-019-0742-2] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Accepted: 08/20/2019] [Indexed: 05/05/2023]
Abstract
The last few years have seen tremendous growth in human microbiome research, with a particular focus on the links to both mental and physical health and disease. Medical and experimental settings provide initial sources of information about these links, but individual studies produce disconnected pieces of knowledge bounded in context by the perspective of expert researchers reading full-text publications. Building a knowledge base (KB) consolidating these disconnected pieces is an essential first step to democratize and accelerate the process of accessing the collective discoveries of human disease connections to the human microbiome. In this article, we survey the existing tools and development efforts that have been produced to capture portions of the information needed to construct a KB of all known human microbiome-disease associations and highlight the need for additional innovations in natural language processing (NLP), text mining, taxonomic representations, and field-wide vocabulary standardization in human microbiome research. Addressing these challenges will enable the construction of KBs that help identify new insights amenable to experimental validation and potentially clinical decision support.
Collapse
Affiliation(s)
- Varsha Dave Badal
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Dustin Wright
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Yannis Katsis
- Scalable Knowledge Intelligence, IBM Research-Almaden, 650 Harry Road, San Jose, CA 95120 USA
| | - Ho-Cheol Kim
- Scalable Knowledge Intelligence, IBM Research-Almaden, 650 Harry Road, San Jose, CA 95120 USA
| | - Austin D. Swafford
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Rob Knight
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- UCSD Health Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Chun-Nan Hsu
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Neurosciences and Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| |
Collapse
|
12
|
HUNER: improving biomedical NER with pretraining. Bioinformatics 2019; 36:295-302. [DOI: 10.1093/bioinformatics/btz528] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 06/13/2019] [Accepted: 06/24/2019] [Indexed: 02/04/2023] Open
Abstract
Abstract
Motivation
Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora.
Results
We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5–13 pp on the entity types chemicals, species and genes.
Availability and implementation
HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
|
13
|
Yoon W, So CH, Lee J, Kang J. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 2019; 20:249. [PMID: 31138109 PMCID: PMC6538547 DOI: 10.1186/s12859-019-2813-6] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Background Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition. Results To address the lack of data and the entity type misclassification problem, we propose CollaboNet which utilizes a combination of multiple NER models. In CollaboNet, models trained on a different dataset are connected to each other so that a target model obtains information from other collaborator models to reduce false positives. Every model is an expert on their target entity type and takes turns serving as a target and a collaborator model during training time. The experimental results show that CollaboNet can be used to greatly reduce the number of false positives and misclassified entities including polysemous words. CollaboNet achieved state-of-the-art performance in terms of precision, recall and F1 score. Conclusions We demonstrated the benefits of combining multiple models for BioNER. Our model has successfully reduced the number of misclassified entities and improved the performance by leveraging multiple datasets annotated for different entity types. Given the state-of-the-art performance of our model, we believe that CollaboNet can improve the accuracy of downstream biomedical text mining applications such as bio-entity relation extraction.
Collapse
Affiliation(s)
- Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, 02841, Republic of Korea
| | - Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea. .,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, 02841, Republic of Korea.
| |
Collapse
|
14
|
Yepes AJ, MacKinlay A, Gunn N, Schieber C, Faux N, Downton M, Goudey B, Martin RL. A hybrid approach for automated mutation annotation of the extended human mutation landscape in scientific literature. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:616-623. [PMID: 30815103 PMCID: PMC6371299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
As the cost of DNA sequencing continues to fall, an increasing amount of information on human genetic variation is being produced that could help progress precision medicine. However, information about such mutations is typically first made available in the scientific literature, and is then later manually curated into more standardized genomic databases. This curation process is expensive, time-consuming and many variants do not end up being fully curated, if at all. Detecting mutations in the literature is the first key step towards automating this process. However, most of the current methods have focused on identifying mutations that follow existing nomenclatures. In this work, we show that there is a large number of mutations that are missed by using this standard approach. Furthermore, we implement the first mutation annotator to cover an extended mutation landscape, and we show that its F1 performance is the same performance as human annotation (F1 78.29 for manual annotation vs F1 79.56 for automatic annotation).
Collapse
Affiliation(s)
| | | | | | | | - Noel Faux
- IBM Research, Southbank, VIC, Australia
| | | | | | | |
Collapse
|
15
|
Abstract
Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.
Collapse
|
16
|
Kordopati V, Salhi A, Razali R, Radovanovic A, Tifratene F, Uludag M, Li Y, Bokhari A, AlSaieedi A, Bin Raies A, Van Neste C, Essack M, Bajic VB. DES-Mutation: System for Exploring Links of Mutations and Diseases. Sci Rep 2018; 8:13359. [PMID: 30190574 PMCID: PMC6127254 DOI: 10.1038/s41598-018-31439-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 08/17/2018] [Indexed: 12/17/2022] Open
Abstract
During cellular division DNA replicates and this process is the basis for passing genetic information to the next generation. However, the DNA copy process sometimes produces a copy that is not perfect, that is, one with mutations. The collection of all such mutations in the DNA copy of an organism makes it unique and determines the organism’s phenotype. However, mutations are often the cause of diseases. Thus, it is useful to have the capability to explore links between mutations and disease. We approached this problem by analyzing a vast amount of published information linking mutations to disease states. Based on such information, we developed the DES-Mutation knowledgebase which allows for exploration of not only mutation-disease links, but also links between mutations and concepts from 27 topic-specific dictionaries such as human genes/proteins, toxins, pathogens, etc. This allows for a more detailed insight into mutation-disease links and context. On a sample of 600 mutation-disease associations predicted and curated, our system achieves precision of 72.83%. To demonstrate the utility of DES-Mutation, we provide case studies related to known or potentially novel information involving disease mutations. To our knowledge, this is the first mutation-disease knowledgebase dedicated to the exploration of this topic through text-mining and data-mining of different mutation types and their associations with terms from multiple thematic dictionaries.
Collapse
Affiliation(s)
- Vasiliki Kordopati
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Adil Salhi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Rozaimi Razali
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Aleksandar Radovanovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Faroug Tifratene
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Yu Li
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ameerah Bokhari
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ahdab AlSaieedi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,King Abdulaziz University (KAU), Faculty of Applied Medical Sciences (FAMS), Department of Medical Laboratory Technology (MLT), Jeddah, 21589-80324, Saudi Arabia
| | - Arwa Bin Raies
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000, Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
17
|
Mao J, Cui H. Identifying bacterial biotope entities using sequence labeling: Performance and feature analysis. J Assoc Inf Sci Technol 2018. [DOI: 10.1002/asi.24032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Jin Mao
- Center for Studies of Information Resources, Wuhan University, 299 Bayi St.Wuhan Hubei Province430072 China
- School of InformationUniversity of Arizona, 1103 E 2nd St.Tucson AZ 85721
| | - Hong Cui
- School of InformationUniversity of Arizona, 1103 E 2nd St.Tucson AZ 85721
| |
Collapse
|
18
|
Cejuela JM, Bojchevski A, Uhlig C, Bekmukhametov R, Kumar Karn S, Mahmuti S, Baghudana A, Dubey A, Satagopam VP, Rost B. nala: text mining natural language mutation mentions. Bioinformatics 2018; 33:1852-1858. [PMID: 28200120 PMCID: PMC5870606 DOI: 10.1093/bioinformatics/btx083] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 02/08/2017] [Indexed: 01/30/2023] Open
Abstract
Motivation The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). Results We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juan Miguel Cejuela
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Aleksandar Bojchevski
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Carsten Uhlig
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany
| | - Rustem Bekmukhametov
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Microsoft, WA, Bellevue, USA
| | - Sanjeev Kumar Karn
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Ludwig Maximilian University, 80538 Munich & Siemens AG, Corporate Technology, Munich, Germany
| | - Shpend Mahmuti
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany
| | - Ashish Baghudana
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,BITS-Pilani K. K. Birla Goa Campus, Goa, India
| | - Ankit Dubey
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Concur (Germany) GmbH, Frankfurt am Main, Germany
| | - Venkata P Satagopam
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg
| | - Burkhard Rost
- TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.,Institute of Advanced Study (TUM-IAS) & Institute for Food and Plant Sciences WZW - Weihenstephan & New York Consortium on Membrane Protein Structure (NYCOMPS) & Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| |
Collapse
|
19
|
Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018; 2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023]
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
Collapse
Affiliation(s)
- Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| |
Collapse
|
20
|
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017; 33:i37-i48. [PMID: 28881963 PMCID: PMC5870729 DOI: 10.1093/bioinformatics/btx228] [Citation(s) in RCA: 205] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
MOTIVATION Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT habibima@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Mariana Neves
- Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
| | - David Luis Wiegandt
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
21
|
Bokharaeian B, Diaz A, Taghizadeh N, Chitsaz H, Chavoshinejad R. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J Biomed Semantics 2017; 8:14. [PMID: 28388928 PMCID: PMC5383945 DOI: 10.1186/s13326-017-0116-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 01/13/2017] [Indexed: 11/17/2022] Open
Abstract
Background Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. Method In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. Result The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639. Conclusion Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Behrouz Bokharaeian
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain.
| | - Alberto Diaz
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain
| | - Nasrin Taghizadeh
- School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Hamidreza Chitsaz
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| | - Ramyar Chavoshinejad
- External Collaborator, Reproductive Biomedicine Research Center, Royan Institute for Reproductive Biomedicine, Tehran, Iran
| |
Collapse
|
22
|
A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). LANG RESOUR EVAL 2017. [DOI: 10.1007/s10579-017-9382-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
23
|
Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol 2016; 12:e1005017. [PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/04/2016] [Indexed: 11/23/2022] Open
Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. To provide personalized health care it is important to understand patients’ genomic variations and the effect these variants have in protecting or predisposing patients to disease. Several projects aim at providing this information by manually curating such genotype-phenotype relationships in organized databases using data from clinical trials and biomedical literature. However, the exponentially increasing size of biomedical literature and the limited ability of manual curators to discover the genotype-phenotype relationships “hidden” in text has led to delays in keeping such databases updated with the current findings. The result is a bottleneck in leveraging valuable information that is currently available to develop personalized health care solutions. In the past, a few computational techniques have attempted to speed up the curation efforts by using text mining techniques to automatically mine genotype-phenotype information from biomedical literature. However, such computational approaches have not been able to achieve accuracy levels sufficient to make them appealing for practical use. In this work, we present a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. We test the performance of this approach on ten well-known diseases and demonstrate the validity of our approach and its potential utility for practical purposes. We are currently working towards generating genotype-phenotype relationships for all PubMed data with the goal of developing an exhaustive database of all the known diseases in life science. We believe that this work will provide very important and needed support for implementation of personalized health care using genomic data.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
24
|
Névéol A, Cohen KB, Grouin C, Hamon T, Lavergne T, Kelly L, Goeuriot L, Rey G, Robert A, Tannier X, Zweigenbaum P. Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016. CEUR WORKSHOP PROCEEDINGS 2016; 1609:28-42. [PMID: 29308065 PMCID: PMC5756095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.
Collapse
Affiliation(s)
| | - K Bretonnel Cohen
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- University of Colorado, USA
| | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
| | - Thierry Hamon
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Université Paris Nord, Villetaneuse, France
| | - Thomas Lavergne
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Univ. Paris-Sud, Orsay, France
| | - Liadh Kelly
- ADAPT Centre, Trinity College, Dublin, Ireland
| | | | | | | | - Xavier Tannier
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Univ. Paris-Sud, Orsay, France
| | | |
Collapse
|
25
|
Fluck J, Madan S, Ansari S, Kodamullil AT, Karki R, Rastegar-Mojarad M, Catlett NL, Hayes W, Szostak J, Hoeng J, Peitsch M. Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw113. [PMID: 27554092 PMCID: PMC4995071 DOI: 10.1093/database/baw113] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 07/07/2016] [Indexed: 01/21/2023]
Abstract
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL:http://wiki.openbel.org/display/BIOC/Datasets
Collapse
Affiliation(s)
- Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sam Ansari
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Alpha T Kodamullil
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Reagon Karki
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | | | | | - William Hayes
- Selventa, One Alewife Center, Cambridge, MA 02140, USA
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Manuel Peitsch
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| |
Collapse
|
26
|
Verspoor KM, Heo GE, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inform Decis Mak 2016; 16 Suppl 1:68. [PMID: 27454860 PMCID: PMC4959367 DOI: 10.1186/s12911-016-0294-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.
Collapse
Affiliation(s)
- Karin M Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Go Eun Heo
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Keun Young Kang
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea.
| |
Collapse
|
27
|
Matos S, Campos D, Pinho R, Silva RM, Mort M, Cooper DN, Oliveira JL. Mining clinical attributes of genomic variants through assisted literature curation in Egas. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw096. [PMID: 27278817 PMCID: PMC4897594 DOI: 10.1093/database/baw096] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/15/2016] [Indexed: 01/08/2023]
Abstract
The veritable deluge of biological data over recent years has led to the establishment of a considerable number of knowledge resources that compile curated information extracted from the literature and store it in structured form, facilitating its use and exploitation. In this article, we focus on the curation of inherited genetic variants and associated clinical attributes, such as zygosity, penetrance or inheritance mode, and describe the use of Egas for this task. Egas is a web-based platform for text-mining assisted literature curation that focuses on usability through modern design solutions and simple user interactions. Egas offers a flexible and customizable tool that allows defining the concept types and relations of interest for a given annotation task, as well as the ontologies used for normalizing each concept type. Further, annotations may be performed on raw documents or on the results of automated concept identification and relation extraction tools. Users can inspect, correct or remove automatic text-mining results, manually add new annotations, and export the results to standard formats. Egas is compatible with the most recent versions of Google Chrome, Mozilla Firefox, Internet Explorer and Safari and is available for use at https://demo.bmd-software.com/egas/. Database URL: https://demo.bmd-software.com/egas/
Collapse
Affiliation(s)
- Sérgio Matos
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - Renato Pinho
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | - Raquel M Silva
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal Department of Medical Sciences, iBiMED, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | | |
Collapse
|
28
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
29
|
Lee K, Lee S, Park S, Kim S, Kim S, Choi K, Tan AC, Kang J. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw043. [PMID: 27074804 PMCID: PMC4830473 DOI: 10.1093/database/baw043] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Accepted: 03/09/2016] [Indexed: 12/31/2022]
Abstract
Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance. Database URL: http://infos.korea.ac.kr/bronco
Collapse
Affiliation(s)
- Kyubum Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sunwon Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sungjoon Park
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Suhkyung Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Kwanghun Choi
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Aik Choon Tan
- Translational Bioinformatics and Cancer Systems Biology Laboratory, Division of Medical Oncology, Department of Medicine, University of Colorado Anschutz Medical Campus, 12801 East 17th Avenue Aurora, CO 80045, USA
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| |
Collapse
|
30
|
Abstract
OBJECTIVES To summarise current research that takes advantage of "Big Data" in health and biomedical informatics applications. METHODS Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. RESULTS The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. CONCLUSIONS The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health; all areas of biomedicine stand to benefit from Big Data and the associated technologies.
Collapse
Affiliation(s)
- F Martin-Sanchez
- Fernando Martin-Sanchez, Health and Biomedical Informatics Centre, The University of Melbourne, Parkville VIC 3010, Australia, E-mail:
| | | |
Collapse
|
31
|
Pham AD, Névéol A, Lavergne T, Yasunaga D, Clément O, Meyer G, Morello R, Burgun A. Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings. BMC Bioinformatics 2014; 15:266. [PMID: 25099227 PMCID: PMC4133634 DOI: 10.1186/1471-2105-15-266] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2014] [Accepted: 07/19/2014] [Indexed: 12/21/2022] Open
Abstract
Background Natural Language Processing (NLP) has been shown effective to analyze the content of radiology reports and identify diagnosis or patient characteristics. We evaluate the combination of NLP and machine learning to detect thromboembolic disease diagnosis and incidental clinically relevant findings from angiography and venography reports written in French. We model thromboembolic diagnosis and incidental findings as a set of concepts, modalities and relations between concepts that can be used as features by a supervised machine learning algorithm. A corpus of 573 radiology reports was de-identified and manually annotated with the support of NLP tools by a physician for relevant concepts, modalities and relations. A machine learning classifier was trained on the dataset interpreted by a physician for diagnosis of deep-vein thrombosis, pulmonary embolism and clinically relevant incidental findings. Decision models accounted for the imbalanced nature of the data and exploited the structure of the reports. Results The best model achieved an F measure of 0.98 for pulmonary embolism identification, 1.00 for deep vein thrombosis, and 0.80 for incidental clinically relevant findings. The use of concepts, modalities and relations improved performances in all cases. Conclusions This study demonstrates the benefits of developing an automated method to identify medical concepts, modality and relations from radiology reports in French. An end-to-end automatic system for annotation and classification which could be applied to other radiology reports databases would be valuable for epidemiological surveillance, performance monitoring, and accreditation in French hospitals.
Collapse
Affiliation(s)
- Anne-Dominique Pham
- Department of Biostatistics and Clinical Research, CHU de Caen, Caen F-14000, France.
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Comeau DC, Batista-Navarro RT, Dai HJ, Doğan RI, Yepes AJ, Khare R, Lu Z, Marques H, Mattingly CJ, Neves M, Peng Y, Rak R, Rinaldi F, Tsai RTH, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. BioC interoperability track overview. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau053. [PMID: 24980129 PMCID: PMC4074764 DOI: 10.1093/database/bau053] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
BioC is a new simple XML format for sharing biomedical text and annotations and libraries to read and write that format. This promotes the development of interoperable tools for natural language processing (NLP) of biomedical text. The interoperability track at the BioCreative IV workshop featured contributions using or highlighting the BioC format. These contributions included additional implementations of BioC, many new corpora in the format, biomedical NLP tools consuming and producing the format and online services using the format. The ease of use, broad support and rapidly growing number of tools demonstrate the need for and value of the BioC format. Database URL:http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Riza Theresa Batista-Navarro
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Hong-Jie Dai
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Antonio Jimeno Yepes
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Ritu Khare
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Hernani Marques
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Carolyn J Mattingly
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Mariana Neves
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - Yifan Peng
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Rafal Rak
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Fabio Rinaldi
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Richard Tzong-Han Tsai
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Karin Verspoor
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - Thomas C Wiegers
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Cathy H Wu
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| |
Collapse
|
33
|
Abstract
Collection of documents annotated with semantic entities and relationships are crucial resources to support development and evaluation of text mining solutions for the biomedical domain. Here I present an overview of 36 corpora and show an analysis on the semantic annotations they contain. Annotations for entity types were classified into six semantic groups and an overview on the semantic entities which can be found in each corpus is shown. Results show that while some semantic entities, such as genes, proteins and chemicals are consistently annotated in many collections, corpora available for diseases, variations and mutations are still few, in spite of their importance in the biological domain.
Collapse
Affiliation(s)
- Mariana Neves
- Hasso-Plattner-Institut, Potsdam Universität, Potsdam, Germany
| |
Collapse
|
34
|
Jimeno Yepes A, Verspoor K. Literature mining of genetic variants for curation: quantifying the importance of supplementary material. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau003. [PMID: 24520105 PMCID: PMC3920087 DOI: 10.1093/database/bau003] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains 'all of the information', and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia and Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | | |
Collapse
|
35
|
Jimeno Yepes A, Verspoor K. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res 2014; 3:18. [PMID: 25285203 PMCID: PMC4176422 DOI: 10.12688/f1000research.3-18.v2] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/27/2014] [Indexed: 11/20/2022] Open
Abstract
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
36
|
Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges. INTERACTIVE KNOWLEDGE DISCOVERY AND DATA MINING IN BIOMEDICAL INFORMATICS 2014. [DOI: 10.1007/978-3-662-43968-5_16] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
37
|
Lee HJ, Shim SH, Song MR, Lee H, Park JC. CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations. BMC Bioinformatics 2013; 14:323. [PMID: 24225062 PMCID: PMC3833657 DOI: 10.1186/1471-2105-14-323] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Accepted: 11/05/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations. RESULTS In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGCunder the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0). CONCLUSIONS The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.
Collapse
Affiliation(s)
| | | | | | | | - Jong C Park
- Department of Computer Science, KAIST, 291 Daehak-ro, Daejeon, Republic of Korea.
| |
Collapse
|
38
|
Jimeno-Yepes AJ, Sticco JC, Mork JG, Aronson AR. GeneRIF indexing: sentence selection based on machine learning. BMC Bioinformatics 2013; 14:171. [PMID: 23725347 PMCID: PMC3687823 DOI: 10.1186/1471-2105-14-171] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Accepted: 05/22/2013] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. RESULTS We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. CONCLUSIONS The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.
Collapse
Affiliation(s)
- Antonio J Jimeno-Yepes
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
- NICTA Victoria Research Lab, Melbourne VIC 3010, Australia
| | - J Caitlin Sticco
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - James G Mork
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Alan R Aronson
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|