Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lee K, Lee S, Park S, Kim S, Kim S, Choi K, Tan AC, Kang J. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. Database (Oxford) 2016;2016:baw043. [PMID: 27074804 PMCID: PMC4830473 DOI: 10.1093/database/baw043] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Accepted: 03/09/2016] [Indexed: 12/31/2022]

For:	Lee K, Lee S, Park S, Kim S, Kim S, Choi K, Tan AC, Kang J. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. Database (Oxford) 2016;2016:baw043. [PMID: 27074804 PMCID: PMC4830473 DOI: 10.1093/database/baw043] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Accepted: 03/09/2016] [Indexed: 12/31/2022]

Number

Cited by Other Article(s)

Li J, Pan D, Yang Z, Sun Y, Lin H, Wang J. JTIS: enhancing biomedical document-level relation extraction through joint training with intermediate steps. Database (Oxford) 2024;2024:baae125. [PMID: 39700498 PMCID: PMC11658465 DOI: 10.1093/database/baae125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 11/18/2024] [Accepted: 12/08/2024] [Indexed: 12/21/2024]

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024;2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]

Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024;16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]

Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024;25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open

Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024;11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open

Lyons EL, Watson D, Alodadi MS, Haugabook SJ, Tawa GJ, Hannah-Shmouni F, Porter FD, Collins JR, Ottinger EA, Mudunuri US. Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus. BMC Genomics 2023;24:460. [PMID: 37587458 PMCID: PMC10433598 DOI: 10.1186/s12864-023-09561-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/08/2023] [Indexed: 08/18/2023] Open

Abstract

BACKGROUND

Approximately 4-8% of the world suffers from a rare disease. Rare diseases are often difficult to diagnose, and many do not have approved therapies. Genetic sequencing has the potential to shorten the current diagnostic process, increase mechanistic understanding, and facilitate research on therapeutic approaches but is limited by the difficulty of novel variant pathogenicity interpretation and the communication of known causative variants. It is unknown how many published rare disease variants are currently accessible in the public domain.

RESULTS

This study investigated the translation of knowledge of variants reported in published manuscripts to publicly accessible variant databases. Variants, symptoms, biochemical assay results, and protein function from literature on the SLC6A8 gene associated with X-linked Creatine Transporter Deficiency (CTD) were curated and reported as a highly annotated dataset of variants with clinical context and functional details. Variants were harmonized, their availability in existing variant databases was analyzed and pathogenicity assignments were compared with impact algorithm predictions. 24% of the pathogenic variants found in PubMed articles were not captured in any database used in this analysis while only 65% of the published variants received an accurate pathogenicity prediction from at least one impact prediction algorithm.

CONCLUSIONS

Despite being published in the literature, pathogenicity data on patient variants may remain inaccessible for genetic diagnosis, therapeutic target identification, mechanistic understanding, or hypothesis generation. Clinical and functional details presented in the literature are important to make pathogenicity assessments. Impact predictions remain imperfect but are improving, especially for single nucleotide exonic variants, however such predictions are less accurate or unavailable for intronic and multi-nucleotide variants. Developing text mining workflows that use natural language processing for identifying diseases, genes and variants, along with impact prediction algorithms and integrating with details on clinical phenotypes and functional assessments might be a promising approach to scale literature mining of variants and assigning correct pathogenicity. The curated variants list created by this effort includes context details to improve any such efforts on variant curation for rare diseases.

Collapse

Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022;23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open

Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021;12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open

Abstract

Background

The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts.

Results

We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F_β for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted.

Conclusions

ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13326-021-00243-3.

Collapse

Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021;22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open

Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 228] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

Bugnon LA, Yones C, Raad J, Gerard M, Rubiolo M, Merino G, Pividori M, Di Persia L, Milone DH, Stegmayer G. DL4papers: a deep learning approach for the automatic interpretation of scientific articles. Bioinformatics 2020;36:3499-3506. [DOI: 10.1093/bioinformatics/btaa111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Revised: 12/27/2019] [Accepted: 02/14/2020] [Indexed: 01/26/2023] Open

Abstract Abstract Motivation In precision medicine, next-generation sequencing and novel preclinical reports have led to an increasingly large amount of results, published in the scientific literature. However, identifying novel treatments or predicting a drug response in, for example, cancer patients, from the huge amount of papers available remains a laborious and challenging work. This task can be considered a text mining problem that requires reading a lot of academic documents for identifying a small set of papers describing specific relations between key terms. Due to the infeasibility of the manual curation of these relations, computational methods that can automatically identify them from the available literature are urgently needed. Results We present DL4papers, a new method based on deep learning that is capable of analyzing and interpreting papers in order to automatically extract relevant relations between specific keywords. DL4papers receives as input a query with the desired keywords, and it returns a ranked list of papers that contain meaningful associations between the keywords. The comparison against related methods showed that our proposal outperformed them in a cancer corpus. The reliability of the DL4papers output list was also measured, revealing that 100% of the first two documents retrieved for a particular search have relevant relations, in average. This shows that our model can guarantee that in the top-2 papers of the ranked list, the relation can be effectively found. Furthermore, the model is capable of highlighting, within each document, the specific fragments that have the associations of the input keywords. This can be very useful in order to pay attention only to the highlighted text, instead of reading the full paper. We believe that our proposal could be used as an accurate tool for rapidly identifying relationships between genes and their mutations, drug responses and treatments in the context of a certain disease. This new approach can certainly be a very useful and valuable resource for the advancement of the precision medicine field. Availability and implementation A web-demo is available at: http://sinc.unl.edu.ar/web-demo/dl4papers/. Full source code and data are available at: https://sourceforge.net/projects/sourcesinc/files/dl4papers/. Contact lbugnon@sinc.unl.edu.ar Supplementary information Supplementary data are available at Bioinformatics online. Collapse

Affiliation(s)

L A Bugnon Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
C Yones Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
J Raad Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
M Gerard Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
M Rubiolo Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
G Merino Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina Bioengineering and Bioinformatics Research and Development Institute, IBB, FIUNER-CONICET, Ruta Prov 11, Km 10.5, Oro Verde 3100, Argentina
M Pividori Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
L Di Persia Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
D H Milone Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
G Stegmayer Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina

Collapse

Pesaranghader A, Matwin S, Sokolova M, Pesaranghader A. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data. J Am Med Inform Assoc 2020;26:438-446. [PMID: 30811548 DOI: 10.1093/jamia/ocy189] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 12/03/2018] [Accepted: 12/19/2018] [Indexed: 01/05/2023] Open

Abstract

OBJECTIVE

In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable.

MATERIALS AND METHODS

Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner.

RESULTS

We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy.

CONCLUSIONS

Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.

Collapse

Legrand J, Gogdemir R, Bousquet C, Dalleau K, Devignes MD, Digan W, Lee CJ, Ndiaye NC, Petitpain N, Ringot P, Smaïl-Tabbone M, Toussaint Y, Coulet A. PGxCorpus, a manually annotated corpus for pharmacogenomics. Sci Data 2020;7:3. [PMID: 31896797 PMCID: PMC6940385 DOI: 10.1038/s41597-019-0342-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 12/02/2019] [Indexed: 11/09/2022] Open

Gachloo M, Wang Y, Xia J. A review of drug knowledge discovery using BioNLP and tensor or matrix decomposition. Genomics Inform 2019;17:e18. [PMID: 31307133 PMCID: PMC6808632 DOI: 10.5808/gi.2019.17.2.e18] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 05/30/2019] [Accepted: 05/30/2019] [Indexed: 12/12/2022] Open

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language. DATA 2018. [DOI: 10.3390/data3040053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018;34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open

Abstract

Motivation

Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

Results

We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

Availability and implementation

The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

Contact

zhiyong.lu@nih.gov.

Collapse

Kveler K, Starosvetsky E, Ziv-Kenet A, Kalugny Y, Gorelik Y, Shalev-Malul G, Aizenbud-Reshef N, Dubovik T, Briller M, Campbell J, Rieckmann JC, Asbeh N, Rimar D, Meissner F, Wiser J, Shen-Orr SS. Immune-centric network of cytokines and cells in disease context identified by computational mining of PubMed. Nat Biotechnol 2018;36:651-659. [PMID: 29912209 PMCID: PMC6035104 DOI: 10.1038/nbt.4152] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Accepted: 04/05/2018] [Indexed: 02/07/2023]

Affiliation(s)

Ksenya Kveler Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Elina Starosvetsky Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Amit Ziv-Kenet Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Yuval Kalugny Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel CytoReason, Tel-Aviv, 67012, Israel
Yuri Gorelik Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Gali Shalev-Malul Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Netta Aizenbud-Reshef Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Tania Dubovik Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Mayan Briller Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
John Campbell Northrop Grumman IT Health Solutions, Rockville, MD 20850, USA
Jan C. Rieckmann Experimental Systems Immunology, Max Planck Institute of Biochemistry, Bayern, 82152, Germany
Nuaman Asbeh Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel
Doron Rimar Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel Rheumatology Unit, Bnai Zion Medical Center, Haifa 31048, Israel
Felix Meissner Experimental Systems Immunology, Max Planck Institute of Biochemistry, Bayern, 82152, Germany
Jeff Wiser Northrop Grumman IT Health Solutions, Rockville, MD 20850, USA
Shai S. Shen-Orr Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa 3525433, Israel Faculty of Biology, Technion-Israel Institute of Technology, Haifa 3200003, Israel

Collapse

Lee K, Kim B, Choi Y, Kim S, Shin W, Lee S, Park S, Kim S, Tan AC, Kang J. Deep learning of mutation-gene-drug relations from the literature. BMC Bioinformatics 2018;19:21. [PMID: 29368597 PMCID: PMC5784504 DOI: 10.1186/s12859-018-2029-1] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Accepted: 01/17/2018] [Indexed: 12/31/2022] Open

Abstract

Background

Molecular biomarkers that can predict drug efficacy in cancer patients are crucial components for the advancement of precision medicine. However, identifying these molecular biomarkers remains a laborious and challenging task. Next-generation sequencing of patients and preclinical models have increasingly led to the identification of novel gene-mutation-drug relations, and these results have been reported and published in the scientific literature.

Results

Here, we present two new computational methods that utilize all the PubMed articles as domain specific background knowledge to assist in the extraction and curation of gene-mutation-drug relations from the literature. The first method uses the Biomedical Entity Search Tool (BEST) scoring results as some of the features to train the machine learning classifiers. The second method uses not only the BEST scoring results, but also word vectors in a deep convolutional neural network model that are constructed from and trained on numerous documents such as PubMed abstracts and Google News articles. Using the features obtained from both the BEST search engine scores and word vectors, we extract mutation-gene and mutation-drug relations from the literature using machine learning classifiers such as random forest and deep convolutional neural networks.

Our methods achieved better results compared with the state-of-the-art methods. We used our proposed features in a simple machine learning model, and obtained F1-scores of 0.96 and 0.82 for mutation-gene and mutation-drug relation classification, respectively. We also developed a deep learning classification model using convolutional neural networks, BEST scores, and the word embeddings that are pre-trained on PubMed or Google News data. Using deep learning, the classification accuracy improved, and F1-scores of 0.96 and 0.86 were obtained for the mutation-gene and mutation-drug relations, respectively.

Conclusion

We believe that our computational methods described in this research could be used as an important tool in identifying molecular biomarkers that predict drug responses in cancer patients. We also built a database of these mutation-gene-drug relations that were extracted from all the PubMed abstracts. We believe that our database can prove to be a valuable resource for precision medicine researchers.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2029-1) contains supplementary material, which is available to authorized users.

Collapse

Bokharaeian B, Diaz A, Taghizadeh N, Chitsaz H, Chavoshinejad R. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J Biomed Semantics 2017;8:14. [PMID: 28388928 PMCID: PMC5383945 DOI: 10.1186/s13326-017-0116-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 01/13/2017] [Indexed: 11/17/2022] Open

Abstract

Background

Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations.

Method

In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks.

Result

The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639.

Conclusion

Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable

Electronic supplementary material

The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.

Collapse

Chen CC, Ho CL. StemTextSearch: Stem cell gene database with evidence from abstracts. J Biomed Inform 2017;69:150-159. [PMID: 28315408 DOI: 10.1016/j.jbi.2017.03.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Revised: 03/08/2017] [Accepted: 03/10/2017] [Indexed: 11/29/2022]

Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol 2016;12:e1005017. [PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/04/2016] [Indexed: 11/23/2022] Open

Abstract

The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F₁-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.

To provide personalized health care it is important to understand patients’ genomic variations and the effect these variants have in protecting or predisposing patients to disease. Several projects aim at providing this information by manually curating such genotype-phenotype relationships in organized databases using data from clinical trials and biomedical literature. However, the exponentially increasing size of biomedical literature and the limited ability of manual curators to discover the genotype-phenotype relationships “hidden” in text has led to delays in keeping such databases updated with the current findings. The result is a bottleneck in leveraging valuable information that is currently available to develop personalized health care solutions. In the past, a few computational techniques have attempted to speed up the curation efforts by using text mining techniques to automatically mine genotype-phenotype information from biomedical literature. However, such computational approaches have not been able to achieve accuracy levels sufficient to make them appealing for practical use. In this work, we present a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. We test the performance of this approach on ten well-known diseases and demonstrate the validity of our approach and its potential utility for practical purposes. We are currently working towards generating genotype-phenotype relationships for all PubMed data with the goal of developing an exhaustive database of all the known diseases in life science. We believe that this work will provide very important and needed support for implementation of personalized health care using genomic data.

Collapse

BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature. PLoS One 2016;11:e0164680. [PMID: 27760149 PMCID: PMC5070740 DOI: 10.1371/journal.pone.0164680] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/29/2016] [Indexed: 01/08/2023] Open

Lee K, Shin W, Kim B, Lee S, Choi Y, Kim S, Jeon M, Tan AC, Kang J. HiPub: translating PubMed and PMC texts to networks for knowledge discovery. Bioinformatics 2016;32:2886-8. [PMID: 27485446 DOI: 10.1093/bioinformatics/btw511] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Accepted: 07/28/2016] [Indexed: 11/14/2022] Open

Verspoor KM, Heo GE, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inform Decis Mak 2016;16 Suppl 1:68. [PMID: 27454860 PMCID: PMC4959367 DOI: 10.1186/s12911-016-0294-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.

METHODS

In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus.

RESULTS

For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations.

CONCLUSIONS

This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

Collapse