1
|
Zhang Z, Chen ALP. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinformatics 2022; 23:458. [PMID: 36329384 PMCID: PMC9632084 DOI: 10.1186/s12859-022-04994-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 10/19/2022] [Indexed: 11/06/2022] Open
Abstract
Background Biomedical named entity recognition (BioNER) is a basic and important task for biomedical text mining with the purpose of automatically recognizing and classifying biomedical entities. The performance of BioNER systems directly impacts downstream applications. Recently, deep neural networks, especially pre-trained language models, have made great progress for BioNER. However, because of the lack of high-quality and large-scale annotated data and relevant external knowledge, the capability of the BioNER system remains limited. Results In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, namely BioBERT, with a new attention module to integrate the auto-processed syntactic information for the BioNER task. We have conducted numerous experiments on seven benchmark BioNER datasets. The proposed best multi-task model obtains F1 score improvements of 1.03% on BC2GM, 0.91% on NCBI-disease, 0.81% on Linnaeus, 1.26% on JNLPBA, 0.82% on BC5CDR-Chemical, 0.87% on BC5CDR-Disease, and 1.10% on Species-800 compared to the single-task BioBERT model. Conclusion The results demonstrate our model outperforms previous studies on all datasets. Further analysis and case studies are also provided to prove the importance of the proposed attention module and fully-shared multi-task learning method used in our model.
Collapse
Affiliation(s)
- Zhiyu Zhang
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Arbee L P Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. .,Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan.
| |
Collapse
|
2
|
Rodriguez-Esteban R, Duarte J, Teixeira PC, Richard F, Koltsova S, So WV. Prediction of standard cell types and functional markers from textual descriptions of flow cytometry gating definitions using machine learning. CYTOMETRY. PART B, CLINICAL CYTOMETRY 2022; 102:220-227. [PMID: 35253974 DOI: 10.1002/cyto.b.22065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 02/02/2022] [Accepted: 02/28/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND A key step in clinical flow cytometry data analysis is gating, which involves the identification of cell populations. The process of gating produces a set of reportable results, which are typically described by gating definitions. The non-standardized, non-interpreted nature of gating definitions represents a hurdle for data interpretation and data sharing across and within organizations. Interpreting and standardizing gating definitions for subsequent analysis of gating results requires a curation effort from experts. Machine learning approaches have the potential to help in this process by predicting expert annotations associated with gating definitions. METHODS We created a gold-standard dataset by manually annotating thousands of gating definitions with cell type and functional marker annotations. We used this dataset to train and test a machine learning pipeline able to predict standard cell types and functional marker genes associated with gating definitions. RESULTS The machine learning pipeline predicted annotations with high accuracy for both cell types and functional marker genes. Accuracy was lower for gating definitions from assays belonging to laboratories from which limited or no prior data was available in the training. Manual error review ensured that resulting predicted annotations could be reused subsequently as additional gold-standard training data. CONCLUSIONS Machine learning methods are able to consistently predict annotations associated with gating definitions from flow cytometry assays. However, a hybrid automatic and manual annotation workflow would be recommended to achieve optimal results.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - José Duarte
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Priscila C Teixeira
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Fabien Richard
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Svetlana Koltsova
- Curation Department, Rancho BioSciences LLC, San Diego, California, USA
| | - W Venus So
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center New York, New York, USA
| |
Collapse
|
3
|
Wang Y, Fan X, Chen L, Chang EIC, Ananiadou S, Tsujii J, Xu Y. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics 2019; 20:430. [PMID: 31419946 PMCID: PMC6697955 DOI: 10.1186/s12859-019-3005-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 07/23/2019] [Indexed: 11/16/2022] Open
Abstract
*: Background Consisting of dictated free-text documents such as discharge summaries, medical narratives are widely used in medical natural language processing. Relationships between anatomical entities and human body parts are crucial for building medical text mining applications. To achieve this, we establish a mapping system consisting of a Wikipedia-based scoring algorithm and a named entity normalization method (NEN). The mapping system makes full use of information available on Wikipedia, which is a comprehensive Internet medical knowledge base. We also built a new ontology, Tree of Human Body Parts (THBP), from core anatomical parts by referring to anatomical experts and Unified Medical Language Systems (UMLS) to make the mapping system efficacious for clinical treatments. *: Result The gold standard is derived from 50 discharge summaries from our previous work, in which 2,224 anatomical entities are included. The F1-measure of the baseline system is 70.20%, while our algorithm based on Wikipedia achieves 86.67% with the assistance of NEN. *: Conclusions We construct a framework to map anatomical entities to THBP ontology using normalization and a scoring algorithm based on Wikipedia. The proposed framework is proven to be much more effective and efficient than the main baseline system.
Collapse
Affiliation(s)
- Yipei Wang
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
| | - Xingyu Fan
- Bioengineering College of Chongqing University, Shazheng Street No. 174, Chongqing, 400044 China
| | - Luoxin Chen
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
| | | | - Sophia Ananiadou
- The National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Junichi Tsujii
- The National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
- Artificial Intelligence Research Center (AIRC), Tokyo, Japan
| | - Yan Xu
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
- Microsoft Research, Danling Street No. 5, Beijing, 100080 China
| |
Collapse
|
4
|
Song HJ, Jo BC, Park CY, Kim JD, Kim YS. Comparison of named entity recognition methodologies in biomedical documents. Biomed Eng Online 2018; 17:158. [PMID: 30396340 PMCID: PMC6219049 DOI: 10.1186/s12938-018-0573-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers. Results Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan- and Elman-type algorithms have F1 scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and Word2Vec have F1 scores of 72.73%, 72.74%, and 72.82%, respectively. Conclusions By using the word embedding constructed through the unsupervised learning, the time and cost required to construct the learning data can be saved.
Collapse
Affiliation(s)
- Hye-Jeong Song
- School of Software, Hallym University, Chuncheon, South Korea.,Bio-IT Research Center, Hallym University, Chuncheon, South Korea
| | - Byeong-Cheol Jo
- School of Software, Hallym University, Chuncheon, South Korea.,Bio-IT Research Center, Hallym University, Chuncheon, South Korea
| | - Chan-Young Park
- School of Software, Hallym University, Chuncheon, South Korea.,Bio-IT Research Center, Hallym University, Chuncheon, South Korea
| | - Jong-Dae Kim
- School of Software, Hallym University, Chuncheon, South Korea.,Bio-IT Research Center, Hallym University, Chuncheon, South Korea
| | - Yu-Seop Kim
- School of Software, Hallym University, Chuncheon, South Korea. .,Bio-IT Research Center, Hallym University, Chuncheon, South Korea.
| |
Collapse
|
5
|
Zhai X, Li Z, Gao K, Huang Y, Lin L, Wang L. Research status and trend analysis of global biomedical text mining studies in recent 10 years. Scientometrics 2015. [DOI: 10.1007/s11192-015-1700-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
6
|
Yan E, Zhu Y. Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods. J Informetr 2015. [DOI: 10.1016/j.joi.2015.04.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
7
|
Rebholz-Schuhmann D, Kafkas S, Kim JH, Li C, Jimeno Yepes A, Hoehndorf R, Backofen R, Lewin I. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. J Biomed Semantics 2013; 4:28. [PMID: 24112383 PMCID: PMC4021975 DOI: 10.1186/2041-1480-4-28] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 09/11/2013] [Indexed: 11/10/2022] Open
Abstract
Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. Results In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions. Conclusion The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
Collapse
|
8
|
Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai RTH, Dai HJ, Okazaki N, Cho HC, Gerner M, Solt I, Agarwal S, Liu F, Vishnyakova D, Ruch P, Romacker M, Rinaldi F, Bhattacharya S, Srinivasan P, Liu H, Torii M, Matos S, Campos D, Verspoor K, Livingston KM, Wilbur WJ. The gene normalization task in BioCreative III. BMC Bioinformatics 2011; 12 Suppl 8:S2. [PMID: 22151901 PMCID: PMC3269937 DOI: 10.1186/1471-2105-12-s8-s2] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
Collapse
Affiliation(s)
- Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| | - Hung-Yu Kao
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Chih-Hsuan Wei
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Minlie Huang
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Jingchen Liu
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Cheng-Ju Kuo
- Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
| | - Chun-Nan Hsu
- Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
- Information Science Institute, University of Southern California, Marina del Rey, California, USA
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, R.O.C
| | - Hong-Jie Dai
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan, R.O.C
- Institute of Information Science, Academic Sinica, Taipei, Taiwan, R.O.C
| | - Naoaki Okazaki
- Interfaculty Initiative in Information Studies, University of Tokyo, Japan
| | - Han-Cheol Cho
- Graduate School of Information Science and Technology, University of Tokyo, Japan
| | - Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| | - Illes Solt
- Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary
| | - Shashank Agarwal
- Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Feifan Liu
- Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Dina Vishnyakova
- BiTem Group, Division of Medical Information Sciences, University of Geneva, Switzerland
| | - Patrick Ruch
- BiTeM Group, Information Science Department, University of Applied Science, Geneva, Switzerland
| | | | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | | | - Padmini Srinivasan
- Department of Computer Science, The University of Iowa, Iowa City, Iowa 52242, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN 55905 USA
| | - Manabu Torii
- Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, 4000 Reservoir Rd., NW, Washington, DC 20057 USA
| | - Sergio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - David Campos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Karin Verspoor
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Kevin M Livingston
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| |
Collapse
|
9
|
Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. J Am Med Inform Assoc 2011; 18:580-7. [PMID: 21709161 DOI: 10.1136/amiajnl-2011-000155] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. METHODS We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. RESULTS As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. CONCLUSION Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.
Collapse
Affiliation(s)
- Manabu Torii
- Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, Washington, DC 20007, USA.
| | | | | |
Collapse
|
10
|
Han L, Suzek TO, Wang Y, Bryant SH. The Text-mining based PubChem Bioassay neighboring analysis. BMC Bioinformatics 2010; 11:549. [PMID: 21059237 PMCID: PMC3098095 DOI: 10.1186/1471-2105-11-549] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2010] [Accepted: 11/08/2010] [Indexed: 11/10/2022] Open
Abstract
Background In recent years, the number of High Throughput Screening (HTS) assays deposited in PubChem has grown quickly. As a result, the volume of both the structured information (i.e. molecular structure, bioactivities) and the unstructured information (such as descriptions of bioassay experiments), has been increasing exponentially. As a result, it has become even more demanding and challenging to efficiently assemble the bioactivity data by mining the huge amount of information to identify and interpret the relationships among the diversified bioassay experiments. In this work, we propose a text-mining based approach for bioassay neighboring analysis from the unstructured text descriptions contained in the PubChem BioAssay database. Results The neighboring analysis is achieved by evaluating the cosine scores of each bioassay pair and fraction of overlaps among the human-curated neighbors. Our results from the cosine score distribution analysis and assay neighbor clustering analysis on all PubChem bioassays suggest that strong correlations among the bioassays can be identified from their conceptual relevance. A comparison with other existing assay neighboring methods suggests that the text-mining based bioassay neighboring approach provides meaningful linkages among the PubChem bioassays, and complements the existing methods by identifying additional relationships among the bioassay entries. Conclusions The text-mining based bioassay neighboring analysis is efficient for correlating bioassays and studying different aspects of a biological process, which are otherwise difficult to achieve by existing neighboring procedures due to the lack of specific annotations and structured information. It is suggested that the text-mining based bioassay neighboring analysis can be used as a standalone or as a complementary tool for the PubChem bioassay neighboring process to enable efficient integration of assay results and generate hypotheses for the discovery of bioactivities of the tested reagents.
Collapse
Affiliation(s)
- Lianyi Han
- National Center for Biotechnology Information, US National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
11
|
Harmston N, Filsell W, Stumpf MPH. What the papers say: text mining for genomics and systems biology. Hum Genomics 2010; 5:17-29. [PMID: 21106487 PMCID: PMC3500154 DOI: 10.1186/1479-7364-5-1-17] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/06/2010] [Indexed: 12/11/2022] Open
Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
Collapse
Affiliation(s)
- Nathan Harmston
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedford MK44 1 LQ, UK
| | - Michael PH Stumpf
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|
12
|
Yang Y, Gilbert D, Kim S. Annotation confidence score for genome annotation: a genome comparison approach. ACTA ACUST UNITED AC 2009; 26:22-9. [PMID: 19855104 DOI: 10.1093/bioinformatics/btp613] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The massively parallel sequencing technology can be used by small research labs to generate genome sequences of their research interest. However, annotation of genomes still relies on the manual process, which becomes a serious bottleneck to the high-throughput genome projects. Recently, automatic annotation methods are increasingly more accurate, but there are several issues. One important challenge in using automatic annotation methods is to distinguish annotation quality of ORFs or genes. The availability of such annotation quality of genes can reduce the human labor cost dramatically since manual inspection can focus only on genes with low-annotation quality scores. RESULTS In this article, we propose a novel annotation quality or confidence scoring scheme, called Annotation Confidence Score (ACS), using a genome comparison approach. The scoring scheme is computed by combining sequence and textual annotation similarity using a modified version of a logistic curve. The most important feature of the proposed scoring scheme is to generate a score that reflects the excellence in annotation quality of genes by automatically adjusting the number of genomes used to compute the score and their phylogenetic distance. Extensive experiments with bacterial genomes showed that the proposed scoring scheme generated scores for annotation quality according to the quality of annotation regardless of the number of reference genomes and their phylogenetic distance. AVAILABILITY http://microbial.informatics.indiana.edu/acs
Collapse
Affiliation(s)
- Youngik Yang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA
| | | | | |
Collapse
|
13
|
Lourenço A, Carreira R, Carneiro S, Maia P, Glez-Peña D, Fdez-Riverola F, Ferreira EC, Rocha I, Rocha M. @Note: a workbench for biomedical text mining. J Biomed Inform 2009; 42:710-20. [PMID: 19393341 DOI: 10.1016/j.jbi.2009.04.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2008] [Revised: 02/16/2009] [Accepted: 04/07/2009] [Indexed: 10/20/2022]
Abstract
Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.
Collapse
Affiliation(s)
- Anália Lourenço
- IBB - Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal.
| | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Sarntivijai S, Ade AS, Athey BD, States DJ. A bioinformatics analysis of the cell line nomenclature. ACTA ACUST UNITED AC 2008; 24:2760-6. [PMID: 18849319 DOI: 10.1093/bioinformatics/btn502] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Cell lines are used extensively in biomedical research, but the nomenclature describing cell lines has not been standardized. The problems are both linguistic and experimental. Many ambiguous cell line names appear in the published literature. Users of the same cell line may refer to it in different ways, and cell lines may mutate or become contaminated without the knowledge of the user. As a first step towards rationalizing this nomenclature, we created a cell line knowledgebase (CLKB) with a well-structured collection of names and descriptive data for cell lines cultured in vitro. The objectives of this work are: (i) to assist users in extracting useful information from biomedical text and (ii) to highlight the importance of standardizing cell line names in biomedical research. This CLKB contains a broad collection of cell line names compiled from ATCC, Hyper CLDB and MeSH. In addition to names, the knowledgebase specifies relationships between cell lines. We analyze the use of cell line names in biomedical text. Issues include ambiguous names, polymorphisms in the use of names and the fact that some cell line names are also common English words. Linguistic patterns associated with the occurrence of cell line names are analyzed. Applying these patterns to find additional cell line names in the literature identifies only a small number of additional names. Annotation of microarray gene expression studies is used as a test case. The CLKB facilitates data exploration and comparison of different cell lines in support of clinical and experimental research. AVAILABILITY The web ontology file for this cell line collection can be downloaded at http://www.stateslab.org/data/celllineOntology/cellline.zip.
Collapse
Affiliation(s)
- Sirarat Sarntivijai
- National Center for Integrative Biomedical Informatics and the Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | | | | | | |
Collapse
|
15
|
Nordquist L. Physiology education and the linguistic jungle of science. ADVANCES IN PHYSIOLOGY EDUCATION 2008; 32:173-174. [PMID: 18794235 DOI: 10.1152/advan.90115.2008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Affiliation(s)
- Lina Nordquist
- Division of Integrative Physiology, Department of Medical Cell Biology, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
16
|
Wren JD, Wilkins D, Fuscoe JC, Bridges S, Winters-Hilt S, Gusev Y. Proceedings of the 2008 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference. BMC Bioinformatics 2008; 9 Suppl 9:S1. [PMID: 18793454 PMCID: PMC2537572 DOI: 10.1186/1471-2105-9-s9-s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Affiliation(s)
- Jonathan D Wren
- Arthritis and Immunology Research Program, Oklahoma Medical Research Foundation; 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA.
| | | | | | | | | | | |
Collapse
|
17
|
Abstract
Background One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach. Results We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS. Conclusions The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.
Collapse
|
18
|
Marinić I, Supek F, Kovačić Z, Rukavina L, Jendričko T, Kozarić-Kovačić D. Posttraumatic stress disorder: diagnostic data analysis by data mining methodology. Croat Med J 2007; 48:185-97. [PMID: 17436383 PMCID: PMC2080528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023] Open
Abstract
AIM To use data mining methods in assessing diagnostic symptoms in posttraumatic stress disorder (PTSD). METHODS. The study included 102 inpatients: 51 with a diagnosis of PTSD and 51 with psychiatric diagnoses other than PTSD. Several models for predicting diagnosis were built using the random forest classifier, one of the intelligent data analysis methods. The first prediction model was based on a structured psychiatric interview, the second on psychiatric scales (Clinician-administered PTSD Scale--CAPS, Positive and Negative Syndrome Scale--PANSS, Hamilton Anxiety Scale--HAMA, and Hamilton Depression Scale--HAMD), and the third on combined data from both sources. Additional models placing more weight on one of the classes (PTSD or non-PTSD) were trained, and prototypes representing subgroups in the classes constructed. RESULTS The first model was the most relevant for distinguishing PTSD diagnosis from comorbid diagnoses such as neurotic, stress-related, and somatoform disorders. The second model pointed out the scores obtained on the CAPS scale and additional PANSS scales, together with comorbid diagnoses of neurotic, stress-related, and somatoform disorders as most relevant. In the third model, psychiatric scales and the same group of comorbid diagnoses were found to be most relevant. Specialized models placing more weight on either the PTSD or non-PTSD class were able to better predict their targeted diagnoses at some expense of overall accuracy. Class subgroup prototypes mainly differed in values achieved on psychiatric scales and frequency of comorbid diagnoses. CONCLUSION Our work demonstrated the applicability of data mining methods for the analysis of structured psychiatric data for PTSD. In all models, the group of comorbid diagnoses, including neurotic, stress-related, and somatoform disorders, surfaced as important. The important attributes of the data, based on the structured psychiatric interview, were the current symptoms and conditions such as presence and degree of disability, hospitalizations, and duration of military service during the war, while CAPS total scores, symptoms of increased arousal, and PANSS additional criteria scores were indicated as relevant from the psychiatric symptom scales.
Collapse
Affiliation(s)
- Igor Marinić
- Dubrava University Hospital, Department of Psychiatry, Referral Center of the Ministry of Health and Social Welfare for Stress-related Disorders, Zagreb, Croatia
| | - Fran Supek
- Laboratory for information systems, Department of Electronics, Ruđer Bošković Institute, Zagreb, Croatia
| | - Zrnka Kovačić
- Croatian Institute for Brain Research, University of Zagreb School of Medicine, Croatia
| | - Lea Rukavina
- ”Bonifarm“, Polyclinic for Clinical Pharmacology and Toxicology, Zagreb, Croatia
| | - Tihana Jendričko
- Dubrava University Hospital, Department of Psychiatry, Referral Center of the Ministry of Health and Social Welfare for Stress-related Disorders, Zagreb, Croatia
| | - Dragica Kozarić-Kovačić
- Dubrava University Hospital, Department of Psychiatry, Referral Center of the Ministry of Health and Social Welfare for Stress-related Disorders, Zagreb, Croatia
| |
Collapse
|
19
|
Beisvag V, Jünge FKR, Bergum H, Jølsum L, Lydersen S, Günther CC, Ramampiaro H, Langaas M, Sandvik AK, Lægreid A. GeneTools--application for functional annotation and statistical hypothesis testing. BMC Bioinformatics 2006; 7:470. [PMID: 17062145 PMCID: PMC1630634 DOI: 10.1186/1471-2105-7-470] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2006] [Accepted: 10/24/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. RESULTS GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. CONCLUSION GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no
Collapse
Affiliation(s)
- Vidar Beisvag
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Frode KR Jünge
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Hallgeir Bergum
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Lars Jølsum
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Stian Lydersen
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Clara-Cecilie Günther
- Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Heri Ramampiaro
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
| | - Mette Langaas
- Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Arne K Sandvik
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Medicine, St. Olav's University Hospital, Trondheim, Norway
| | - Astrid Lægreid
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|