Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. Database (Oxford) 2018;2018:5042822. [PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/25/2018] [Indexed: 11/14/2022]

For:	Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. Database (Oxford) 2018;2018:5042822. [PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/25/2018] [Indexed: 11/14/2022]

Number

Cited by Other Article(s)

Yu B, Zhao Y, Jiang L, Zhou J, Xu H, Lei L, Xu L, Wang X, Bu S. Network pharmacology and experimental validation of Compound Kushen Powder for the treatment of diarrhea in vivo. Vet Anim Sci 2025;28:100443. [PMID: 40206406 PMCID: PMC11979447 DOI: 10.1016/j.vas.2025.100443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2025] Open

Saad E, Kishk S, Ali-Eldin A, Saleh AI. SB-AGT: A stochastic beam search-enhanced attention-based Gumbel tree framework for drug-drug interaction extraction from biomedical literature. Comput Biol Med 2025;189:110011. [PMID: 40086288 DOI: 10.1016/j.compbiomed.2025.110011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 02/27/2025] [Accepted: 03/06/2025] [Indexed: 03/16/2025]

Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. A comprehensive large scale biomedical knowledge graph for AI powered data driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]

He J, Li F, Li J, Hu X, Nian Y, Xiang Y, Wang J, Wei Q, Li Y, Xu H, Tao C. Prompt Tuning in Biomedical Relation Extraction. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024;8:206-224. [PMID: 38681754 PMCID: PMC11052745 DOI: 10.1007/s41666-024-00162-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 02/09/2024] [Accepted: 02/19/2024] [Indexed: 05/01/2024]

Abstract

Biomedical relation extraction (RE) is critical in constructing high-quality knowledge graphs and databases as well as supporting many downstream text mining applications. This paper explores prompt tuning on biomedical RE and its few-shot scenarios, aiming to propose a simple yet effective model for this specific task. Prompt tuning reformulates natural language processing (NLP) downstream tasks into masked language problems by embedding specific text prompts into the original input, facilitating the adaption of pre-trained language models (PLMs) to better address these tasks. This study presents a customized prompt tuning model designed explicitly for biomedical RE, including its applicability in few-shot learning contexts. The model's performance was rigorously assessed using the chemical-protein relation (CHEMPROT) dataset from BioCreative VI and the drug-drug interaction (DDI) dataset from SemEval-2013, showcasing its superior performance over conventional fine-tuned PLMs across both datasets, encompassing few-shot scenarios. This observation underscores the effectiveness of prompt tuning in enhancing the capabilities of conventional PLMs, though the extent of enhancement may vary by specific model. Additionally, the model demonstrated a harmonious balance between simplicity and efficiency, matching state-of-the-art performance without needing external knowledge or extra computational resources. The pivotal contribution of our study is the development of a suitably designed prompt tuning model, highlighting prompt tuning's effectiveness in biomedical RE. It offers a robust, efficient approach to the field's challenges and represents a significant advancement in extracting complex relations from biomedical texts.

Supplementary Information

The online version contains supplementary material available at 10.1007/s41666-024-00162-9.

Collapse

Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024;25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open

Whitton J, Hunter A. Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations. Artif Intell Med 2023;144:102661. [PMID: 37783549 DOI: 10.1016/j.artmed.2023.102661] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 07/05/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]

Ai X, Kavuluru R. End-to-End Models for Chemical-Protein Interaction Extraction: Better Tokenization and Span-Based Pipeline Strategies. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023;2023:610-618. [PMID: 38274947 PMCID: PMC10809256 DOI: 10.1109/ichi57859.2023.00108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]

Bokharaeian B, Dehghani M, Diaz A. Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method. BMC Bioinformatics 2023;24:144. [PMID: 37046202 PMCID: PMC10099837 DOI: 10.1186/s12859-023-05236-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 03/17/2023] [Indexed: 04/14/2023] Open

Kumar A, Sharaff A. ABEE: automated bio entity extraction from biomedical text documents. DATA TECHNOLOGIES AND APPLICATIONS 2022. [DOI: 10.1108/dta-04-2022-0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]

Abstract PurposeThe purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts.Design/methodology/approachIn the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text.FindingsThe proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments.Research limitations/implicationsAs such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups.Practical implicationsAs far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery.Social implicationsDuring the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.Originality/valueIn this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score. Collapse

Hou Y, Xia Y, Wu L, Xie S, Fan Y, Zhu J, Qin T, Liu TY. Discovering drug-target interaction knowledge from biomedical literature. Bioinformatics 2022;38:5100-5107. [PMID: 36205562 DOI: 10.1093/bioinformatics/btac648] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 07/19/2022] [Indexed: 12/24/2022] Open

Nicholson DN, Himmelstein DS, Greene CS. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 2022;15:26. [PMID: 36258252 PMCID: PMC9578183 DOI: 10.1186/s13040-022-00311-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Accepted: 09/17/2022] [Indexed: 02/04/2023] Open

Abstract

BACKGROUND

Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.

RESULTS

We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.

CONCLUSIONS

Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.

Collapse

Kim H, Sung M, Yoon W, Park S, Kang J. Full-text chemical identification with improved generalizability and tagging consistency. Database (Oxford) 2022;2022:6726385. [PMID: 36170114 PMCID: PMC9518746 DOI: 10.1093/database/baac074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/11/2022] [Accepted: 08/22/2022] [Indexed: 11/14/2022]

He J, Li F, Hu X, Li J, Nian Y, Wang J, Xiang Y, Wei Q, Xu H, Tao C. Chemical-Protein Relation Extraction with Pre-trained Prompt Tuning. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022;2022:608-609. [PMID: 37664001 PMCID: PMC10474649 DOI: 10.1109/ichi54592.2022.00120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]

Brincat A, Hofmann M. Automated extraction of genes associated with antibiotic resistance from the biomedical literature. Database (Oxford) 2022;2022:6520791. [PMID: 35134132 PMCID: PMC9263533 DOI: 10.1093/database/baab077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 09/21/2021] [Accepted: 11/22/2021] [Indexed: 11/15/2022]

Abstract Abstract The detection of bacterial antibiotic resistance phenotypes is important when carrying out clinical decisions for patient treatment. Conventional phenotypic testing involves culturing bacteria which requires a significant amount of time and work. Whole-genome sequencing is emerging as a fast alternative to resistance prediction, by considering the presence/absence of certain genes. A lot of research has focused on determining which bacterial genes cause antibiotic resistance and efforts are being made to consolidate these facts in knowledge bases (KBs). KBs are usually manually curated by domain experts to be of the highest quality. However, this limits the pace at which new facts are added. Automated relation extraction of gene-antibiotic resistance relations from the biomedical literature is one solution that can simplify the curation process. This paper reports on the development of a text mining pipeline that takes in English biomedical abstracts and outputs genes that are predicted to cause resistance to antibiotics. To test the generalisability of this pipeline it was then applied to predict genes associated with Helicobacter pylori antibiotic resistance, that are not present in common antibiotic resistance KBs or publications studying H. pylori. These genes would be candidates for further lab-based antibiotic research and inclusion in these KBs. For relation extraction, state-of-the-art deep learning models were used. These models were trained on a newly developed silver corpus which was generated by distant supervision of abstracts using the facts obtained from KBs. The top performing model was superior to a co-occurrence model, achieving a recall of 95%, a precision of 60% and F1-score of 74% on a manually annotated holdout dataset. To our knowledge, this project was the first attempt at developing a complete text mining pipeline that incorporates deep learning models to extract gene-antibiotic resistance relations from the literature. Additional related data can be found at https://github.com/AndreBrincat/Gene-Antibiotic-Resistance-Relation-Extraction Collapse

Zanoli R, Lavelli A, Löffler T, Perez Gonzalez NA, Rinaldi F. An annotated dataset for extracting gene-melanoma relations from scientific literature. J Biomed Semantics 2022;13:2. [PMID: 35045882 PMCID: PMC8772125 DOI: 10.1186/s13326-021-00251-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 08/27/2021] [Indexed: 11/10/2022] Open

Abstract Abstract Background Melanoma is one of the least common but the deadliest of skin cancers. This cancer begins when the genes of a cell suffer damage or fail, and identifying the genes involved in melanoma is crucial for understanding the melanoma tumorigenesis. Thousands of publications about human melanoma appear every year. However, while biological curation of data is costly and time-consuming, to date the application of machine learning for gene-melanoma relation extraction from text has been severely limited by the lack of annotated resources. Results To overcome this lack of resources for melanoma, we have exploited the information of the Melanoma Gene Database (MGDB, a manually curated database of genes involved in human melanoma) to automatically build an annotated dataset of binary relations between gene and melanoma entities occurring in PubMed abstracts. The entities were automatically annotated by state-of-the-art text-mining tools. Their annotation includes both the mention text spans and normalized concept identifiers. The relations among the entities were annotated at concept- and mention-level. The concept-level annotation was produced using the information of the genes in MGDB to decide if a relation holds between a gene and melanoma concept in the whole abstract. The exploitability of this dataset was tested with both traditional machine learning, and neural network-based models like BERT. The models were then used to automatically extract gene-melanoma relations from the biomedical literature. Most of the current models use context-aware representations of the target entities to establish relations between them. To facilitate researchers in their experiments we generated a mention-level annotation in support to the concept-level annotation. The mention-level annotation was generated by automatically linking gene and melanoma mentions co-occurring within the sentences that in MGDB establish the association of the gene with melanoma. Conclusions This paper presents a corpus containing gene-melanoma annotated relations. Additionally, it discusses experiments which show the usefulness of such a corpus for training a system capable of mining gene-melanoma relationships from the literature. Researchers can use the corpus to develop and compare their own models, and produce results which might be integrated with existing structured knowledge databases, which in turn might facilitate medical research. Collapse

Bhasuran B. BioBERT and Similar Approaches for Relation Extraction. Methods Mol Biol 2022;2496:221-235. [PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021;22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open

Abstract

Background

Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward.

Results

In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists.

Conclusions

This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.

Collapse

Warikoo N, Chang YC, Hsu WL. LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations. Bioinformatics 2021;37:404-412. [PMID: 32810217 DOI: 10.1093/bioinformatics/btaa721] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 06/30/2020] [Accepted: 08/13/2020] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks.

RESULTS

This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein interaction (PPI), drug-drug interaction and protein-bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context.

AVAILABILITY AND IMPLEMENTATION

Github. https://github.com/warikoone/LBERT.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Shao Y, Li H, Gu J, Qian L, Zhou G. Extraction of causal relations based on SBEL and BERT model. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021;2021:6133143. [PMID: 33570092 PMCID: PMC7904051 DOI: 10.1093/database/baab005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 01/19/2021] [Accepted: 01/26/2021] [Indexed: 11/15/2022]

Cui F, Yue Y, Zhang Y, Zhang Z, Zhou HS. Advancing Biosensors with Machine Learning. ACS Sens 2020;5:3346-3364. [PMID: 33185417 DOI: 10.1021/acssensors.0c01424] [Citation(s) in RCA: 261] [Impact Index Per Article: 52.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]

A multi-feature fusion model for Chinese relation extraction with entity sense. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106348] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234-1240. [PMID: 31501885 PMCID: PMC7703786 DOI: 10.1093/bioinformatics/btz682] [Citation(s) in RCA: 1412] [Impact Index Per Article: 282.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 12/15/2022] Open

Abstract

MOTIVATION

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.

RESULTS

We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.

AVAILABILITY AND IMPLEMENTATION

We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

Collapse

Liu X, Fan J, Dong S. Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study. JMIR Med Inform 2020;8:e17644. [PMID: 32469325 PMCID: PMC7314385 DOI: 10.2196/17644] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 03/02/2020] [Accepted: 03/19/2020] [Indexed: 01/26/2023] Open

Abstract

Background

The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations.

Objective

This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction.

Methods

This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics.

Results

Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance.

Conclusions

When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.

Collapse

Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Automated recognition of functional compound-protein relationships in literature. PLoS One 2020;15:e0220925. [PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/29/2020] [Indexed: 11/18/2022] Open

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. BIOINFORMATICS (OXFORD, ENGLAND) 2020;36:1234-1240. [PMID: 31501885 DOI: 10.48550/arxiv.1901.08746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 05/20/2023]

Abstract

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

Collapse

Lung PY, He Z, Zhao T, Yu D, Zhang J. Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering. Database (Oxford) 2019;2019:5280305. [PMID: 30624652 PMCID: PMC6323317 DOI: 10.1093/database/bay138] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 12/04/2018] [Accepted: 12/06/2018] [Indexed: 12/14/2022]

Antunes R, Matos S. Extraction of chemical-protein interactions from the literature using neural networks and narrow instance representation. Database (Oxford) 2019;2019:baz095. [PMID: 31622463 PMCID: PMC6796919 DOI: 10.1093/database/baz095] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 06/28/2019] [Accepted: 07/01/2019] [Indexed: 01/21/2023]