1
|
Wang W, Shuai Y, Zeng M, Fan W, Li M. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information. Nat Commun 2025; 16:70. [PMID: 39746897 PMCID: PMC11697396 DOI: 10.1038/s41467-024-54816-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 11/21/2024] [Indexed: 01/04/2025] Open
Abstract
Computational methods for predicting protein function are of great significance in understanding biological mechanisms and treating complex diseases. However, existing computational approaches of protein function prediction lack interpretability, making it difficult to understand the relations between protein structures and functions. In this study, we propose a deep learning-based solution, named DPFunc, for accurate protein function prediction with domain-guided structure information. DPFunc can detect significant regions in protein structures and accurately predict corresponding functions under the guidance of domain information. It outperforms current state-of-the-art methods and achieves a significant improvement over existing structure-based methods. Detailed analyses demonstrate that the guidance of domain information contributes to DPFunc for protein function prediction, enabling our method to detect key residues or regions in protein structures, which are closely related to their functions. In summary, DPFunc serves as an effective tool for large-scale protein function prediction, which pushes the border of protein understanding in biological systems.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Wei Fan
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, OX39DU, UK
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
2
|
Li PH, Sun YY, Juan HF, Chen CY, Tsai HK, Huang JH. A large language model framework for literature-based disease-gene association prediction. Brief Bioinform 2024; 26:bbaf070. [PMID: 39998433 PMCID: PMC11851487 DOI: 10.1093/bib/bbaf070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 01/09/2025] [Accepted: 02/06/2025] [Indexed: 02/26/2025] Open
Abstract
With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease-gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.
Collapse
Affiliation(s)
- Peng-Hsuan Li
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
| | - Yih-Yun Sun
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
| | - Hsueh-Fen Juan
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
- Department of Life Science, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan
- Center for Computational and Systems Biology, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Center for Advanced Computing and Imaging in Biomedicine, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Chien-Yu Chen
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
- Center for Computational and Systems Biology, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Center for Advanced Computing and Imaging in Biomedicine, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Department of Biomechatronics Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Huai-Kuang Tsai
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
- Institute of Information Science, Academia Sinica, No. 128, Academia Road, Section 2, Nankang, Taipei 11529, Taiwan
| | - Jia-Hsin Huang
- Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan
| |
Collapse
|
3
|
Metzger VT, Cannon DC, Yang JJ, Mathias SL, Bologa CG, Waller A, Schürer SC, Vidović D, Kelleher KJ, Sheils TK, Jensen LJ, Lambert CG, Oprea TI, Edwards JS. TIN-X version 3: update with expanded dataset and modernized architecture for enhanced illumination of understudied targets. PeerJ 2024; 12:e17470. [PMID: 38948230 PMCID: PMC11212617 DOI: 10.7717/peerj.17470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Accepted: 05/06/2024] [Indexed: 07/02/2024] Open
Abstract
TIN-X (Target Importance and Novelty eXplorer) is an interactive visualization tool for illuminating associations between diseases and potential drug targets and is publicly available at newdrugtargets.org. TIN-X uses natural language processing to identify disease and protein mentions within PubMed content using previously published tools for named entity recognition (NER) of gene/protein and disease names. Target data is obtained from the Target Central Resource Database (TCRD). Two important metrics, novelty and importance, are computed from this data and when plotted as log(importance) vs. log(novelty), aid the user in visually exploring the novelty of drug targets and their associated importance to diseases. TIN-X Version 3.0 has been significantly improved with an expanded dataset, modernized architecture including a REST API, and an improved user interface (UI). The dataset has been expanded to include not only PubMed publication titles and abstracts, but also full-text articles when available. This results in approximately 9-fold more target/disease associations compared to previous versions of TIN-X. Additionally, the TIN-X database containing this expanded dataset is now hosted in the cloud via Amazon RDS. Recent enhancements to the UI focuses on making it more intuitive for users to find diseases or drug targets of interest while providing a new, sortable table-view mode to accompany the existing plot-view mode. UI improvements also help the user browse the associated PubMed publications to explore and understand the basis of TIN-X's predicted association between a specific disease and a target of interest. While implementing these upgrades, computational resources are balanced between the webserver and the user's web browser to achieve adequate performance while accommodating the expanded dataset. Together, these advances aim to extend the duration that users can benefit from TIN-X while providing both an expanded dataset and new features that researchers can use to better illuminate understudied proteins.
Collapse
Affiliation(s)
- Vincent T. Metzger
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | | | - Jeremy J. Yang
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | - Stephen L. Mathias
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | - Cristian G. Bologa
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | - Anna Waller
- Center for Molecular Discovery, University of New Mexico Comprehensive Cancer Center, Albuquerque, New Mexico, United States
| | - Stephan C. Schürer
- Department of Molecular and Cellular Pharmacology, Miller School of Medicine, University of Miami, Miami, Florida, United States
| | - Dušica Vidović
- Department of Molecular and Cellular Pharmacology, Miller School of Medicine, University of Miami, Miami, Florida, United States
| | - Keith J. Kelleher
- National Center for Advancing Translational Science, Rockville, Maryland, United States
| | - Timothy K. Sheils
- National Center for Advancing Translational Science, Rockville, Maryland, United States
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Christophe G. Lambert
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | - Tudor I. Oprea
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
| | - Jeremy S. Edwards
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, United States
- Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, New Mexico, United States
| |
Collapse
|
4
|
Narvaez-Rojas A, Arnaout MM, Hoz SS, Agrawal A, Lee A, Moscote-Salazar LR, Deora H. Info-pollution: a word of caution for the neurosurgical community. EGYPTIAN JOURNAL OF NEUROSURGERY 2022. [DOI: 10.1186/s41984-022-00179-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
AbstractThe medical-patient relationship is facing pollution of information all over the internet, for physician and patients is becoming tougher to keep updated with the highest quality of information. During the last 20 years multiple evaluation tools have been developed trying to find the best tool to assess high-quality information, to date DISCERN tool represents the most widely spread. Information can be found on the surface internet and in the deep web, constituting the biggest chunk of the internet, informing and controlling the quality of information is a formidable task. PubMed and Google Scholar are the most important tools for a physician to find information, although multiple others are available; awareness must be raised over improving current strategies for data mining high-quality information for the patients and the healthcare community.
Collapse
|
5
|
Su Y, Wang M, Wang P, Zheng C, Liu Y, Zeng X. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief Bioinform 2022; 23:6686739. [PMID: 36125190 DOI: 10.1093/bib/bbac342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/20/2022] [Accepted: 07/25/2022] [Indexed: 12/14/2022] Open
Abstract
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Collapse
Affiliation(s)
- Yansen Su
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Minglu Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Pengpeng Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Chunhou Zheng
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
6
|
Cheng X, Cao Q, Liao SS. An overview of literature on COVID-19, MERS and SARS: Using text mining and latent Dirichlet allocation. J Inf Sci 2022; 48:304-320. [PMID: 38603038 PMCID: PMC7464068 DOI: 10.1177/0165551520954674] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The unprecedented outbreak of COVID-19 is one of the most serious global threats to public health in this century. During this crisis, specialists in information science could play key roles to support the efforts of scientists in the health and medical community for combatting COVID-19. In this article, we demonstrate that information specialists can support health and medical community by applying text mining technique with latent Dirichlet allocation procedure to perform an overview of a mass of coronavirus literature. This overview presents the generic research themes of the coronavirus diseases: COVID-19, MERS and SARS, reveals the representative literature per main research theme and displays a network visualisation to explore the overlapping, similarity and difference among these themes. The overview can help the health and medical communities to extract useful information and interrelationships from coronavirus-related studies.
Collapse
Affiliation(s)
- Xian Cheng
- Business School, Sichuan University, China
| | - Qiang Cao
- Department of Information Systems, City University of Hong Kong, China
| | | |
Collapse
|
7
|
Auto-generated database of semiconductor band gaps using ChemDataExtractor. Sci Data 2022; 9:193. [PMID: 35504897 PMCID: PMC9065101 DOI: 10.1038/s41597-022-01294-6] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 02/08/2022] [Indexed: 11/16/2022] Open
Abstract
Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery. Measurement(s) | semiconductor band gaps | Technology Type(s) | natural language processing |
Collapse
|
8
|
Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform 2021; 22:6326536. [PMID: 34308472 DOI: 10.1093/bib/bbab282] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/07/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open
Abstract
The biomedical literature is growing rapidly, and the extraction of meaningful information from the large amount of literature is increasingly important. Biomedical named entity (BioNE) identification is one of the critical and fundamental tasks in biomedical text mining. Accurate identification of entities in the literature facilitates the performance of other tasks. Given that an end-to-end neural network can automatically extract features, several deep learning-based methods have been proposed for BioNE recognition (BioNER), yielding state-of-the-art performance. In this review, we comprehensively summarize deep learning-based methods for BioNER and datasets used in training and testing. The deep learning methods are classified into four categories: single neural network-based, multitask learning-based, transfer learning-based and hybrid model-based methods. They can be applied to BioNER in multiple domains, and the results are determined by the dataset size and type. Lastly, we discuss the future development and opportunities of BioNER methods.
Collapse
Affiliation(s)
- Bosheng Song
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Fen Li
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
9
|
de Boer ML. Epistemic in/justice in patient participation. A discourse analysis of the Dutch ME/CFS Health Council advisory process. SOCIOLOGY OF HEALTH & ILLNESS 2021; 43:1335-1354. [PMID: 34137042 PMCID: PMC8453904 DOI: 10.1111/1467-9566.13301] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 03/25/2021] [Accepted: 05/05/2021] [Indexed: 05/28/2023]
Abstract
In healthcare settings, patient participation is increasingly adopted as a possible remedy to ill people suffering from 'epistemic injustices' - that is to their unfair harming as knowers. In exploring and interpreting patient participation discourses within the 2013-2018 Dutch Myalgic Encephalomyelitis (ME)/Chronic Fatigue Syndrome (CFS) Health Council advisory process, this paper assesses the epistemological emancipatory value of this participatory practice. It reveals that in the analysed case, patient representatives predominantly offer biomedical knowledge about ME/CFS. They frame this condition as primarily somatic, and accordingly, perceive appropriate diagnostic criteria, research avenues and treatment options as quantifiable, objectifiable and explicitly non-psychogenic. This paper argues that such a dominant biomedical patient participatory practice is ambiguous in terms of its ability to correct epistemic injustices towards ill people. Biomedicalized patient participation may enhance people's credibility and their ability to make sense of their illness, but it may also undermine their valid position within participatory practices as well as lead to (sustaining) biased and reductive ideas about who ill people are and what kind of knowledge they hold. The final section of this paper offers a brief reflection on how to navigate such biomedicalized participatory practices in order to attain more emancipatory ones.
Collapse
Affiliation(s)
- Marjolein Lotte de Boer
- Department of Culture Studies, School of Humanities, Tilburg University, Tilburg, The Netherlands
| |
Collapse
|
10
|
Henry S, Wijesinghe DS, Myers A, McInnes BT. Using Literature Based Discovery to Gain Insights Into the Metabolomic Processes of Cardiac Arrest. Front Res Metr Anal 2021; 6:644728. [PMID: 34250435 PMCID: PMC8267364 DOI: 10.3389/frma.2021.644728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/07/2021] [Indexed: 12/19/2022] Open
Abstract
In this paper, we describe how we applied LBD techniques to discover lecithin cholesterol acyltransferase (LCAT) as a druggable target for cardiac arrest. We fully describe our process which includes the use of high-throughput metabolomic analysis to identify metabolites significantly related to cardiac arrest, and how we used LBD to gain insights into how these metabolites relate to cardiac arrest. These insights lead to our proposal (for the first time) of LCAT as a druggable target; the effects of which are supported by in vivo studies which were brought forth by this work. Metabolites are the end product of many biochemical pathways within the human body. Observed changes in metabolite levels are indicative of changes in these pathways, and provide valuable insights toward the cause, progression, and treatment of diseases. Following cardiac arrest, we observed changes in metabolite levels pre- and post-resuscitation. We used LBD to help discover diseases implicitly linked via these metabolites of interest. Results of LBD indicated a strong link between Fish Eye disease and cardiac arrest. Since fish eye disease is characterized by an LCAT deficiency, it began an investigation into the effects of LCAT and cardiac arrest survival. In the investigation, we found that decreased LCAT activity may increase cardiac arrest survival rates by increasing ω-3 polyunsaturated fatty acid availability in circulation. We verified the effects of ω-3 polyunsaturated fatty acids on increasing survival rate following cardiac arrest via in vivo with rat models.
Collapse
Affiliation(s)
- Sam Henry
- Department of Physics, Computer Science and Engineering, Christopher Newport University, Newport News, VA, United States
| | - D. Shanaka Wijesinghe
- Department of Pharmacotherapy and Outcomes Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Aidan Myers
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
11
|
Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. J Am Med Inform Assoc 2021; 28:1892-1899. [PMID: 34157094 PMCID: PMC8363782 DOI: 10.1093/jamia/ocab090] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 04/05/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open
Abstract
Objective The study sought to develop and evaluate neural natural language processing (NLP) packages for the syntactic analysis and named entity recognition of biomedical and clinical English text. Materials and Methods We implement and train biomedical and clinical English NLP pipelines by extending the widely used Stanza library originally designed for general NLP tasks. Our models are trained with a mix of public datasets such as the CRAFT treebank as well as with a private corpus of radiology reports annotated with 5 radiology-domain entities. The resulting pipelines are fully based on neural networks, and are able to perform tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition for both biomedical and clinical text. We compare our systems against popular open-source NLP libraries such as CoreNLP and scispaCy, state-of-the-art models such as the BioBERT models, and winning systems from the BioNLP CRAFT shared task. Results For syntactic analysis, our systems achieve much better performance compared with the released scispaCy models and CoreNLP models retrained on the same treebanks, and are on par with the winning system from the CRAFT shared task. For NER, our systems substantially outperform scispaCy, and are better or on par with the state-of-the-art performance from BioBERT, while being much more computationally efficient. Conclusions We introduce biomedical and clinical NLP packages built for the Stanza library. These packages offer performance that is similar to the state of the art, and are also optimized for ease of use. To facilitate research, we make all our models publicly available. We also provide an online demonstration (http://stanza.run/bio).
Collapse
Affiliation(s)
- Yuhao Zhang
- Biomedical Informatics Training Program, Stanford University, Stanford, California, USA
| | - Yuhui Zhang
- Computer Science Department, Stanford University, Stanford, California, USA
| | - Peng Qi
- Computer Science Department, Stanford University, Stanford, California, USA
| | - Christopher D Manning
- Computer Science and Linguistics Departments, Stanford University, Stanford, California, USA
| | - Curtis P Langlotz
- Department of Radiology, Stanford University, Stanford, California, USA
| |
Collapse
|
12
|
Liu C, Peres Kury FS, Li Z, Ta C, Wang K, Weng C. Doc2Hpo: a web application for efficient and accurate HPO concept curation. Nucleic Acids Res 2020; 47:W566-W570. [PMID: 31106327 PMCID: PMC6602487 DOI: 10.1093/nar/gkz386] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 04/26/2019] [Accepted: 04/30/2019] [Indexed: 01/18/2023] Open
Abstract
We present Doc2Hpo, an interactive web application that enables interactive and efficient phenotype concept curation from clinical text with automated concept normalization using the Human Phenotype Ontology (HPO). Users can edit the HPO concepts automatically extracted by Doc2Hpo in real time, and export the extracted HPO concepts into gene prioritization tools. Our evaluation showed that Doc2Hpo significantly reduced manual effort while achieving high accuracy in HPO concept curation. Doc2Hpo is freely available at https://impact2.dbmi.columbia.edu/doc2hpo/. The source code is available at https://github.com/stormliucong/doc2hpo for local installation for protected health data.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | | | - Ziran Li
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Kai Wang
- Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.,Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| |
Collapse
|
13
|
Crichton G, Baker S, Guo Y, Korhonen A. Neural networks for open and closed Literature-based Discovery. PLoS One 2020; 15:e0232891. [PMID: 32413059 PMCID: PMC7228051 DOI: 10.1371/journal.pone.0232891] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 04/23/2020] [Indexed: 12/18/2022] Open
Abstract
Literature-based Discovery (LBD) aims to discover new knowledge automatically from large collections of literature. Scientific literature is growing at an exponential rate, making it difficult for researchers to stay current in their discipline and easy to miss knowledge necessary to advance their research. LBD can facilitate hypothesis testing and generation and thus accelerate scientific progress. Neural networks have demonstrated improved performance on LBD-related tasks but are yet to be applied to it. We propose four graph-based, neural network methods to perform open and closed LBD. We compared our methods with those used by the state-of-the-art LION LBD system on the same evaluations to replicate recently published findings in cancer biology. We also applied them to a time-sliced dataset of human-curated peer-reviewed biological interactions. These evaluations and the metrics they employ represent performance on real-world knowledge advances and are thus robust indicators of approach efficacy. In the first experiments, our best methods performed 2-4 times better than the baselines in closed discovery and 2-3 times better in open discovery. In the second, our best methods performed almost 2 times better than the baselines in open discovery. These results are strong indications that neural LBD is potentially a very effective approach for generating new scientific discoveries from existing literature. The code for our models and other information can be found at: https://github.com/cambridgeltl/nn_for_LBD.
Collapse
Affiliation(s)
- Gamal Crichton
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Simon Baker
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Yufan Guo
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Anna Korhonen
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
14
|
Zhu H, Zeng Y, Wang D, Huangfu C. Species Classification for Neuroscience Literature Based on Span of Interest Using Sequence-to-Sequence Learning Model. Front Hum Neurosci 2020; 14:128. [PMID: 32372933 PMCID: PMC7187631 DOI: 10.3389/fnhum.2020.00128] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2020] [Accepted: 03/19/2020] [Indexed: 11/13/2022] Open
Abstract
Large-scale neuroscience literature call for effective methods to mine the knowledge from species perspective to link the brain and neuroscience communities, neurorobotics, computing devices, and AI research communities. Structured knowledge can motivate researchers to better understand the functionality and structure of the brain and link the related resources and components. However, the abstracts of massive scientific works do not explicitly mention the species. Therefore, in addition to dictionary-based methods, we need to mine species using cognitive computing models that are more like the human reading process, and these methods can take advantage of the rich information in the literature. We also enable the model to automatically distinguish whether the mentioned species is the main research subject. Distinguishing the two situations can generate value at different levels of knowledge management. We propose SpecExplorer project which is used to explore the knowledge associations of different species for brain and neuroscience. This project frees humans from the tedious task of classifying neuroscience literature by species. Species classification task belongs to the multi-label classification which is more complex than the single-label classification due to the correlation between labels. To resolve this problem, we present the sequence-to-sequence classification framework to adaptively assign multiple species to the literature. To model the structure information of documents, we propose the hierarchical attentive decoding (HAD) to extract span of interest (SOI) for predicting each species. We create three datasets from PubMed and PMC corpora. We present two versions of annotation criteria (mention-based annotation and semantic-based annotation) for species research. Experiments demonstrate that our approach achieves improvements in the final results. Finally, we perform species-based analysis of brain diseases, brain cognitive functions, and proteins related to the hippocampus and provide potential research directions for certain species.
Collapse
Affiliation(s)
- Hongyin Zhu
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Yi Zeng
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
- Center for Excellence in Brain Science and Intelligence Technology Chinese Academy of Sciences, Shanghai, China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
| | - Dongsheng Wang
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Cunqing Huangfu
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
15
|
Abstract
Bioinformatics plays a key role in supporting the life sciences. In this work, we examine bioinformatics in Jordan, beginning with the current status of bioinformatics education and research, then exploring the challenges of advancing bioinformatics, and finally looking to the future for how Jordanian bioinformatics research may develop.
Collapse
Affiliation(s)
- Qanita Bani Baker
- Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
| | - Maryam S. Nuser
- Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
- Department of Information Systems, Yarmouk University, Irbid, Jordan
| |
Collapse
|
16
|
Cañada A, Capella-Gutierrez S, Rabal O, Oyarzabal J, Valencia A, Krallinger M. LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes. Nucleic Acids Res 2019; 45:W484-W489. [PMID: 28531339 PMCID: PMC5570141 DOI: 10.1093/nar/gkx462] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 05/16/2017] [Indexed: 01/03/2023] Open
Abstract
A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes—CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es
Collapse
Affiliation(s)
- Andres Cañada
- Spanish National Bioinformatics Institute Unit, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Salvador Capella-Gutierrez
- Spanish National Bioinformatics Institute Unit, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona 31008, Spain
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona 31008, Spain
| | - Alfonso Valencia
- Barcelona Supercomputing Center (BSC), Joint BSC-CRG-IRB, Research Program in Computational Biology, BSC-CRG-IRB, Barcelona 08028, Spain.,Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), 08034 Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain
| | - Martin Krallinger
- Biological Text Mining Unit, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| |
Collapse
|
17
|
Yadav S, Ekbal A, Saha S, Kumar A, Bhattacharyya P. Feature assisted stacked attentive shortest dependency path based Bi-LSTM model for protein–protein interaction. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.11.020] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
18
|
Kirschnick J, Thomas P, Roller R, Hennig L. SIA: a scalable interoperable annotation server for biomedical named entities. J Cheminform 2018; 10:63. [PMID: 30552534 PMCID: PMC6755617 DOI: 10.1186/s13321-018-0319-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Accepted: 12/05/2018] [Indexed: 11/29/2022] Open
Abstract
Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotation Server), our contribution to the BeCalm-Technical interoperability and performance of annotation servers (BeCalm-TIPS) task, a scalable, extensible, and robust annotation service. The system currently covers six named entity types (i.e., chemicals, diseases, genes, miRNA, mutations, and organisms) and is freely available under Apache 2.0 license at https://github.com/Erechtheus/sia.
Collapse
Affiliation(s)
| | - Philippe Thomas
- DFKI Language Technology Lab, Alt-Moabit 91c, Berlin, Germany
| | - Roland Roller
- DFKI Language Technology Lab, Alt-Moabit 91c, Berlin, Germany
| | - Leonhard Hennig
- DFKI Language Technology Lab, Alt-Moabit 91c, Berlin, Germany.
| |
Collapse
|
19
|
Literature-based automated discovery of tumor suppressor p53 phosphorylation and inhibition by NEK2. Proc Natl Acad Sci U S A 2018; 115:10666-10671. [PMID: 30266789 PMCID: PMC6196525 DOI: 10.1073/pnas.1806643115] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Scientific progress depends on formulating testable hypotheses informed by the literature. In many domains, however, this model is strained because the number of research papers exceeds human readability. Here, we developed computational assistance to analyze the biomedical literature by reading PubMed abstracts to suggest new hypotheses. The approach was tested experimentally on the tumor suppressor p53 by ranking its most likely kinases, based on all available abstracts. Many of the best-ranked kinases were found to bind and phosphorylate p53 (P value = 0.005), suggesting six likely p53 kinases so far. One of these, NEK2, was studied in detail. A known mitosis promoter, NEK2 was shown to phosphorylate p53 at Ser315 in vitro and in vivo and to functionally inhibit p53. These bona fide validations of text-based predictions of p53 phosphorylation, and the discovery of an inhibitory p53 kinase of pharmaceutical interest, suggest that automated reasoning using a large body of literature can generate valuable molecular hypotheses and has the potential to accelerate scientific discovery.
Collapse
|
20
|
Vilar S, Friedman C, Hripcsak G. Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief Bioinform 2018; 19:863-877. [PMID: 28334070 PMCID: PMC6454455 DOI: 10.1093/bib/bbx010] [Citation(s) in RCA: 90] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Revised: 12/28/2016] [Indexed: 11/13/2022] Open
Abstract
Drug-drug interactions (DDIs) constitute an important concern in drug development and postmarketing pharmacovigilance. They are considered the cause of many adverse drug effects exposing patients to higher risks and increasing public health system costs. Methods to follow-up and discover possible DDIs causing harm to the population are a primary aim of drug safety researchers. Here, we review different methodologies and recent advances using data mining to detect DDIs with impact on patients. We focus on data mining of different pharmacovigilance sources, such as the US Food and Drug Administration Adverse Event Reporting System and electronic health records from medical institutions, as well as on the diverse data mining studies that use narrative text available in the scientific biomedical literature and social media. We pay attention to the strengths but also further explain challenges related to these methods. Data mining has important applications in the analysis of DDIs showing the impact of the interactions as a cause of adverse effects, extracting interactions to create knowledge data sets and gold standards and in the discovery of novel and dangerous DDIs.
Collapse
Affiliation(s)
- Santiago Vilar
- Department of Biomedical Informatics, Columbia University, New York, USA
- Department of Organic Chemistry, University of Santiago de Compostela, Spain
| | - Carol Friedman
- Department of Biomedical Informatics, Columbia University, New York, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, USA
| |
Collapse
|
21
|
Koci O, Logan M, Svolos V, Russell RK, Gerasimidis K, Ijaz UZ. An automated identification and analysis of ontological terms in gastrointestinal diseases and nutrition-related literature provides useful insights. PeerJ 2018; 6:e5047. [PMID: 30065857 PMCID: PMC6064635 DOI: 10.7717/peerj.5047] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 05/31/2018] [Indexed: 12/20/2022] Open
Abstract
With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 to 2016) for two main types of Inflammatory Bowel Diseases: Crohn’s Disease and Ulcerative Colitis; and two other gastrointestinal (GI) diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered GI diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag.
Collapse
Affiliation(s)
- Orges Koci
- Human Nutrition, School of Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK
| | - Michael Logan
- Infrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, UK
| | - Vaios Svolos
- Human Nutrition, School of Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK
| | - Richard K Russell
- Department of Paediatric Gastroenterology, Hepatology and Nutrition, Royal Hospital for Children, Glasgow, UK
| | - Konstantinos Gerasimidis
- Human Nutrition, School of Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK
| | - Umer Zeeshan Ijaz
- Infrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, UK
| |
Collapse
|
22
|
Baladrón C, Santos-Lozano A, Aguiar JM, Lucia A, Martín-Hernández J. Tool for filtering PubMed search results by sample size. J Am Med Inform Assoc 2018; 25:774-779. [PMID: 29409012 DOI: 10.1093/jamia/ocx155] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 12/20/2017] [Indexed: 11/13/2022] Open
Abstract
Objective The most used search engine for scientific literature, PubMed, provides tools to filter results by several fields. When searching for reports on clinical trials, sample size can be among the most important factors to consider. However, PubMed does not currently provide any means of filtering search results by sample size. Such a filtering tool would be useful in a variety of situations, including meta-analyses or state-of-the-art analyses to support experimental therapies. In this work, a tool was developed to filter articles identified by PubMed based on their reported sample sizes. Materials and Methods A search engine was designed to send queries to PubMed, retrieve results, and compute estimates of reported sample sizes using a combination of syntactical and machine learning methods. The sample size search tool is publicly available for download at http://ihealth.uemc.es. Its accuracy was assessed against a manually annotated database of 750 random clinical trials returned by PubMed. Results Validation tests show that the sample size search tool is able to accurately (1) estimate sample size for 70% of abstracts and (2) classify 85% of abstracts into sample size quartiles. Conclusions The proposed tool was validated as useful for advanced PubMed searches of clinical trials when the user is interested in identifying trials of a given sample size.
Collapse
Affiliation(s)
- Carlos Baladrón
- i+HeALTH Research Group, Miguel de Cervantes European University, Higher Polytechnic School, Department of Technical Teachings, Valladolid, Spain
| | - Alejandro Santos-Lozano
- i+HeALTH Research Group, Miguel de Cervantes European University, Faculty of Health Sciences, Department of Health Sciences, Valladolid, Spain
| | - Javier M Aguiar
- Data Engineering Research Group, Universidad de Valladolid, Higher Technical School of Telecommunications Engineering, TSyCeIT Department, Valladolid, Spain
| | - Alejandro Lucia
- Research Institute of Hospital 12 de Octubre and European University, Madrid, Spain
| | - Juan Martín-Hernández
- i+HeALTH Research Group, Miguel de Cervantes European University, Faculty of Health Sciences, Department of Health Sciences, Valladolid, Spain
| |
Collapse
|
23
|
Research Trend Visualization by MeSH Terms from PubMed. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2018; 15:ijerph15061113. [PMID: 29848974 PMCID: PMC6025283 DOI: 10.3390/ijerph15061113] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Revised: 05/28/2018] [Accepted: 05/29/2018] [Indexed: 11/17/2022]
Abstract
Motivation: PubMed is a primary source of biomedical information comprising search tool function and the biomedical literature from MEDLINE which is the US National Library of Medicine premier bibliographic database, life science journals and online books. Complimentary tools to PubMed have been developed to help the users search for literature and acquire knowledge. However, these tools are insufficient to overcome the difficulties of the users due to the proliferation of biomedical literature. A new method is needed for searching the knowledge in biomedical field. Methods: A new method is proposed in this study for visualizing the recent research trends based on the retrieved documents corresponding to a search query given by the user. The Medical Subject Headings (MeSH) are used as the primary analytical element. MeSH terms are extracted from the literature and the correlations between them are calculated. A MeSH network, called MeSH Net, is generated as the final result based on the Pathfinder Network algorithm. Results: A case study for the verification of proposed method was carried out on a research area defined by the search query (immunotherapy and cancer and "tumor microenvironment"). The MeSH Net generated by the method is in good agreement with the actual research activities in the research area (immunotherapy). Conclusion: A prototype application generating MeSH Net was developed. The application, which could be used as a "guide map for travelers", allows the users to quickly and easily acquire the knowledge of research trends. Combination of PubMed and MeSH Net is expected to be an effective complementary system for the researchers in biomedical field experiencing difficulties with search and information analysis.
Collapse
|
24
|
Chen L, Friedman C, Finkelstein J. Automated Metabolic Phenotyping of Cytochrome Polymorphisms Using PubMed Abstract Mining. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:535-544. [PMID: 29854118 PMCID: PMC5977704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Pharmacogenetics-related publications, which are increasing rapidly, provide important new pharmacogenetics knowledge. Automated approaches to extract information of new alleles and to identify their impact on metabolic phenotypes from publications are urgently needed to facilitate personalized medicine and improve clinical outcomes. Cytochrome polymorphisms, responsible for a wide variation of drug pharmacodynamics, individual efficacy and adverse effects, have significant potential for optimizing drug therapy. A few studies have addressed specialized efforts to automatically extract cytochrome polymorphisms and their characterizations regarding metabolic phenotypes from the literature. In this paper, we present a novel rule-based text-mining system to extract metabolic phenotypes of polymorphisms from PubMed abstracts with a focus on cytochrome P450. This system is promising as it achieved a precision of 85.71% in a preliminary proof-of-concept evaluation and is expected to automatically provide up-to-date metabolic information for cytochrome polymorphisms, which is critical to advance personalized medicine and improve clinical care.
Collapse
Affiliation(s)
- Luoxin Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| | - Carol Friedman
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| | - Joseph Finkelstein
- Department of Biomedical Informatics, Columbia University, New York, NY, US
| |
Collapse
|
25
|
Cannon DC, Yang JJ, Mathias SL, Ursu O, Mani S, Waller A, Schürer SC, Jensen LJ, Sklar LA, Bologa CG, Oprea TI. TIN-X: target importance and novelty explorer. Bioinformatics 2018; 33:2601-2603. [PMID: 28398460 PMCID: PMC5870731 DOI: 10.1093/bioinformatics/btx200] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 04/06/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation The increasing amount of peer-reviewed manuscripts requires the development of specific mining tools to facilitate the visual exploration of evidence linking diseases and proteins. Results We developed TIN-X, the Target Importance and Novelty eXplorer, to visualize the association between proteins and diseases, based on text mining data processed from scientific literature. In the current implementation, TIN-X supports exploration of data for G-protein coupled receptors, kinases, ion channels, and nuclear receptors. TIN-X supports browsing and navigating across proteins and diseases based on ontology classes, and displays a scatter plot with two proposed new bibliometric statistics: Importance and Novelty. Availability and Implementation http://www.newdrugtargets.org. Contact cbologa@salud.unm.edu.
Collapse
Affiliation(s)
- Daniel C Cannon
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Jeremy J Yang
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Stephen L Mathias
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Oleg Ursu
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Subramani Mani
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Anna Waller
- UNM Center for Molecular Discovery, University of New Mexico Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87131, USA
| | - Stephan C Schürer
- Department of Molecular and Cellular Pharmacology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N 2200, Denmark
| | - Larry A Sklar
- UNM Center for Molecular Discovery, University of New Mexico Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87131, USA.,Department of Pathology, University of New Mexico, NM 87131, USA
| | - Cristian G Bologa
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| | - Tudor I Oprea
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87131, USA
| |
Collapse
|
26
|
Jiang X, Ringwald M, Blake J, Shatkay H. Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3084695. [PMID: 28365740 PMCID: PMC5467553 DOI: 10.1093/database/bax017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/13/2017] [Indexed: 12/16/2022]
Abstract
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL:www.informatics.jax.org
Collapse
Affiliation(s)
- Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, USA
| | - Martin Ringwald
- Department of Computer and Information Sciences, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Judith Blake
- Department of Computer and Information Sciences, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, USA
| |
Collapse
|
27
|
Seco de Herrera AG, Schaer R, Müller H. Shangri-La: A medical case-based retrieval tool. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23858] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Alba G. Seco de Herrera
- University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland; National Library of Medicine (NLM/NIH); Bethesda MD USA
| | - Roger Schaer
- University of Applied Sciences Western Switzerland (HES-SO); Sierre Switzerland
| | - Henning Müller
- University of Applied Sciences Western Switzerland (HES-SO); Sierre Switzerland
| |
Collapse
|
28
|
Henry S, McInnes BT. Literature Based Discovery: Models, methods, and trends. J Biomed Inform 2017; 74:20-32. [PMID: 28838802 DOI: 10.1016/j.jbi.2017.08.011] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 07/21/2017] [Accepted: 08/20/2017] [Indexed: 01/25/2023]
Abstract
OBJECTIVES This paper provides an introduction and overview of literature based discovery (LBD) in the biomedical domain. It introduces the reader to modern and historical LBD models, key system components, evaluation methodologies, and current trends. After completion, the reader will be familiar with the challenges and methodologies of LBD. The reader will be capable of distinguishing between recent LBD systems and publications, and be capable of designing an LBD system for a specific application. TARGET AUDIENCE From biomedical researchers curious about LBD, to someone looking to design an LBD system, to an LBD expert trying to catch up on trends in the field. The reader need not be familiar with LBD, but knowledge of biomedical text processing tools is helpful. SCOPE This paper describes a unifying framework for LBD systems. Within this framework, different models and methods are presented to both distinguish and show overlap between systems. Topics include term and document representation, system components, and an overview of models including co-occurrence models, semantic models, and distributional models. Other topics include uninformative term filtering, term ranking, results display, system evaluation, an overview of the application areas of drug development, drug repurposing, and adverse drug event prediction, and challenges and future directions. A timeline showing contributions to LBD, and a table summarizing the works of several authors is provided. Topics are presented from a high level perspective. References are given if more detailed analysis is required.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA
| |
Collapse
|
29
|
Almeida H, Jean-Louis L, Meurs MJ. An open source and modular search engine for biomedical literature retrieval. Comput Intell 2017. [DOI: 10.1111/coin.12125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Hayda Almeida
- Université du Québec à Montréal; Montréal QC Canada
- Centre for Structural and Functional Genomics; Concordia University; Montréal QC Canada
| | | | - Marie-Jean Meurs
- Université du Québec à Montréal; Montréal QC Canada
- Centre for Structural and Functional Genomics; Concordia University; Montréal QC Canada
| |
Collapse
|
30
|
Harris DR, Kavuluru R, Jaromczyk JW, Johnson TR. Rapid and Reusable Text Visualization and Exploration Development with DELVE. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:139-148. [PMID: 28815123 PMCID: PMC5543346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
We present DELVE (Document ExpLoration and Visualization Engine), a framework for developing interactive visualizations as modular Web-applications to assist researchers with exploratory literature search. The goal for web-applications driven by DELVE is to better satisfy the information needs of researchers and to help explore and understand the state of research in scientific liter ature by providing immersive visualizations that both contain facets and are driven by facets derived from the literature. We base our framework on principles from user-centered design and human-computer interaction (HCI). Preliminary evaluations demon strate the usefulness of DELVE's techniques: (1) a clinical researcher immediately saw that her original query was inappropriate simply due to the frequencies displayed via generalized clouds and (2) a muscle biologist quickly learned of vocabulary differences found between two disciplines that were referencing the same idea, which we feel is critical for interdisciplinary work. We dis cuss the underlying category-theoretic model of our framework and show that it naturally encourages the development of reusable visualizations by emphasizing interoperability.
Collapse
Affiliation(s)
- Daniel R. Harris
- Center for Clinical and Translational Sciences, University of Kentucky, Lexington, KY 40506;,Department of Computer Science, University of Kentucky, Lexington, KY 40506
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, Lexington, KY 40506;,Institute of Biomedical Informatics, University of Kentucky, Lexington, KY 40506
| | - Jerzy W. Jaromczyk
- Department of Computer Science, University of Kentucky, Lexington, KY 40506
| | - Todd R. Johnson
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030
| |
Collapse
|
31
|
Larsson K, Baker S, Silins I, Guo Y, Stenius U, Korhonen A, Berglund M. Text mining for improved exposure assessment. PLoS One 2017; 12:e0173132. [PMID: 28257498 PMCID: PMC5336247 DOI: 10.1371/journal.pone.0173132] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Accepted: 02/15/2017] [Indexed: 01/24/2023] Open
Abstract
Chemical exposure assessments are based on information collected via different methods, such as biomonitoring, personal monitoring, environmental monitoring and questionnaires. The vast amount of chemical-specific exposure information available from web-based databases, such as PubMed, is undoubtedly a great asset to the scientific community. However, manual retrieval of relevant published information is an extremely time consuming task and overviewing the data is nearly impossible. Here, we present the development of an automatic classifier for chemical exposure information. First, nearly 3700 abstracts were manually annotated by an expert in exposure sciences according to a taxonomy exclusively created for exposure information. Natural Language Processing (NLP) techniques were used to extract semantic and syntactic features relevant to chemical exposure text. Using these features, we trained a supervised machine learning algorithm to automatically classify PubMed abstracts according to the exposure taxonomy. The resulting classifier demonstrates good performance in the intrinsic evaluation. We also show that the classifier improves information retrieval of chemical exposure data compared to keyword-based PubMed searches. Case studies demonstrate that the classifier can be used to assist researchers by facilitating information retrieval and classification, enabling data gap recognition and overviewing available scientific literature using chemical-specific publication profiles. Finally, we identify challenges to be addressed in future development of the system.
Collapse
Affiliation(s)
- Kristin Larsson
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, Sweden
| | - Simon Baker
- Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
| | - Ilona Silins
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, Sweden
| | - Yufan Guo
- Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
| | - Ulla Stenius
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, Sweden
| | - Anna Korhonen
- Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
- Language Technology Lab, DTAL, University of Cambridge, Cambridge, United Kingdom
| | - Marika Berglund
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, Sweden
| |
Collapse
|
32
|
Roos A, Hedlund T. Using the domain analytical approach in the study of information practices in biomedicine. JOURNAL OF DOCUMENTATION 2016. [DOI: 10.1108/jd-11-2015-0139] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
The purpose of this paper is to analyze the information practices of the researchers in biomedicine using the domain analytical approach.
Design/methodology/approach
The domain analytical research approach used in the study of the scientific domain of biomedicine leads to studies into the organization of sciences. By using Whitley’s dimensions of “mutual dependence” and “task uncertainty” in scientific work as a starting point the authors were able to reanalyze previously collected data. By opening up these concepts in the biomedical research work context, the authors analyzed the distinguishing features of the biomedical domain and the way these features affected researchers’ information practices.
Findings
Several indicators representing “task uncertainty” and “mutual dependence” in the scientific domain of biomedicine were identified. This study supports the view that in biomedicine the task uncertainty is low and researchers are mutually highly dependent on each other. Hard competition seems to be one feature, which is behind the explosion of the data and publications in this domain. This fact, on its part is directly related to the ways information is searched, followed, used and produced. The need for new easy to use services or tools for searching and following information in so called “hot” topics came apparent.
Originality/value
The study highlights new information about information practices in the biomedical domain. Whitley’s theory enabled a thorough analysis of the cultural and social nature of the biomedical domain and it proved to be useful in the examination of researchers’ information practices.
Collapse
|
33
|
Ahmed Z, Zeeshan S, Dandekar T. Mining biomedical images towards valuable information retrieval in biomedical and life sciences. Database (Oxford) 2016; 2016:baw118. [PMID: 27538578 PMCID: PMC4990152 DOI: 10.1093/database/baw118] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Revised: 06/07/2016] [Accepted: 07/19/2016] [Indexed: 12/22/2022]
Abstract
Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Saman Zeeshan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Wuerzburg, Wuerzburg, Germany EMBL, Computational Biology and Structures Program, Heidelberg, Germany
| |
Collapse
|
34
|
Gupta R, Mantri SS. Biomolecular Relationships Discovered from Biological Labyrinth and Lost in Ocean of Literature: Community Efforts Can Rescue Until Automated Artificial Intelligence Takes Over. Front Genet 2016; 7:46. [PMID: 27066067 PMCID: PMC4814459 DOI: 10.3389/fgene.2016.00046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2016] [Accepted: 03/15/2016] [Indexed: 11/30/2022] Open
|
35
|
Ahmed Z, Dandekar T. MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Res 2015; 4:1453. [PMID: 29721305 PMCID: PMC5897790 DOI: 10.12688/f1000research.7329.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/26/2018] [Indexed: 01/12/2023] Open
Abstract
Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Genetics and Genome Sciences, School of Medicine, University of Connecticut Health Center, Farmington, CT, 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06032, USA
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Wuerzburg, Wuerzburg, 97074, Germany
| |
Collapse
|
36
|
Abstract
BACKGROUND Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW). METHODS We explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation. RESULTS Our results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain. CONCLUSIONS A topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers.
Collapse
|
37
|
Yu M, Selvaraj SK, Liang-Chu MMY, Aghajani S, Busse M, Yuan J, Lee G, Peale F, Klijn C, Bourgon R, Kaminker JS, Neve RM. A resource for cell line authentication, annotation and quality control. Nature 2015; 520:307-11. [PMID: 25877200 DOI: 10.1038/nature14397] [Citation(s) in RCA: 161] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Accepted: 03/09/2015] [Indexed: 01/25/2023]
Abstract
Cell line misidentification, contamination and poor annotation affect scientific reproducibility. Here we outline simple measures to detect or avoid cross-contamination, present a framework for cell line annotation linked to short tandem repeat and single nucleotide polymorphism profiles, and provide a catalogue of synonymous cell lines. This resource will enable our community to eradicate the use of misidentified lines and generate credible cell-based data.
Collapse
Affiliation(s)
- Mamie Yu
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Suresh K Selvaraj
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - May M Y Liang-Chu
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Sahar Aghajani
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Matthew Busse
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Jean Yuan
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Genee Lee
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Franklin Peale
- Department of Pathology, Genentech Inc., South San Francisco, California 94080, USA
| | - Christiaan Klijn
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Richard Bourgon
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Joshua S Kaminker
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Richard M Neve
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| |
Collapse
|
38
|
Gobeill J, Gaudinat A, Pasche E, Vishnyakova D, Gaudet P, Bairoch A, Ruch P. Deep Question Answering for protein annotation. Database (Oxford) 2015; 2015:bav081. [PMID: 26384372 PMCID: PMC4572360 DOI: 10.1093/database/bav081] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 07/06/2015] [Accepted: 08/08/2015] [Indexed: 11/14/2022]
Abstract
Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/.
Collapse
Affiliation(s)
- Julien Gobeill
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, SIBTex group, Swiss Institute of Bioinformatics,
| | - Arnaud Gaudinat
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences
| | - Emilie Pasche
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, SIBTex group, Swiss Institute of Bioinformatics, University and Hospitals of Geneva, Division of Medical Information Sciences, Geneva, Switzerland and
| | - Dina Vishnyakova
- University and Hospitals of Geneva, Division of Medical Information Sciences, Geneva, Switzerland and
| | | | - Amos Bairoch
- Calipho group, Swiss Institute of Bioinformatics
| | - Patrick Ruch
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, SIBTex group, Swiss Institute of Bioinformatics
| |
Collapse
|
39
|
Abstract
Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today's researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area.
Collapse
Affiliation(s)
- Julia Hirschberg
- Department of Computer Science, Columbia University, New York, NY 10027, USA.
| | - Christopher D Manning
- Department of Linguistics, Stanford University, Stanford, CA 94305-2150, USA. Department of Computer Science, Stanford University, Stanford, CA 94305-9020, USA
| |
Collapse
|
40
|
|
41
|
Zheng JG, Howsmon D, Zhang B, Hahn J, McGuinness D, Hendler J, Ji H. Entity linking for biomedical literature. BMC Med Inform Decis Mak 2015; 15 Suppl 1:S4. [PMID: 26045232 PMCID: PMC4460707 DOI: 10.1186/1472-6947-15-s1-s4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL problem in the life sciences domain is to enable scientists to build computational models of biological processes with more efficiency. However, simply applying a news-trained entity linker produces inadequate results. METHODS Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, we propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking. RESULTS Without using any manual annotation, our approach significantly outperforms state-of-the-art supervised EL method (9% absolute gain in linking accuracy). Furthermore, the state-of-the-art supervised EL method requires 15,000 manually annotated entity mentions for training. These promising results establish a benchmark for the EL task in the life science domain. We also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature. CONCLUSIONS In this paper, we propose a novel unsupervised collective inference approach to address the EL problem in a new domain. We show that our unsupervised approach is able to outperform a current state-of-the-art supervised approach that has been trained with a large amount of manually labeled data. Life science presents an underrepresented domain for applying EL techniques. By providing a small benchmark data set and identifying opportunities, we hope to stimulate discussions across natural language processing and bioinformatics and motivate others to develop techniques for this largely untapped domain.
Collapse
|
42
|
Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 2015; 17:132-44. [PMID: 25935162 DOI: 10.1093/bib/bbv024] [Citation(s) in RCA: 107] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Indexed: 11/13/2022] Open
Abstract
One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations.
Collapse
|
43
|
Blanc X, Collet TH, Auer R, Iriarte P, Krause J, Légaré F, Cornuz J, Clair C. Retrieval of publications addressing shared decision making: an evaluation of full-text searches on medical journal websites. JMIR Res Protoc 2015; 4:e38. [PMID: 25854180 PMCID: PMC4405619 DOI: 10.2196/resprot.3615] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Revised: 01/21/2015] [Accepted: 02/04/2015] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Full-text searches of articles increase the recall, defined by the proportion of relevant publications that are retrieved. However, this method is rarely used in medical research due to resource constraints. For the purpose of a systematic review of publications addressing shared decision making, a full-text search method was required to retrieve publications where shared decision making does not appear in the title or abstract. OBJECTIVE The objective of our study was to assess the efficiency and reliability of full-text searches in major medical journals for identifying shared decision making publications. METHODS A full-text search was performed on the websites of 15 high-impact journals in general internal medicine to look up publications of any type from 1996-2011 containing the phrase "shared decision making". The search method was compared with a PubMed search of titles and abstracts only. The full-text search was further validated by requesting all publications from the same time period from the individual journal publishers and searching through the collected dataset. RESULTS The full-text search for "shared decision making" on journal websites identified 1286 publications in 15 journals compared to 119 through the PubMed search. The search within the publisher-provided publications of 6 journals identified 613 publications compared to 646 with the full-text search on the respective journal websites. The concordance rate was 94.3% between both full-text searches. CONCLUSIONS Full-text searching on medical journal websites is an efficient and reliable way to identify relevant articles in the field of shared decision making for review or other purposes. It may be more widely used in biomedical research in other fields in the future, with the collaboration of publishers and journals toward open-access data.
Collapse
Affiliation(s)
- Xavier Blanc
- Department of Ambulatory Care and Community Medicine, University of Lausanne, Lausanne, Switzerland.
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Ma Y, Dong M, Mita C, Sun S, Peng CK, Yang AC. Publication analysis on insomnia: how much has been done in the past two decades? Sleep Med 2015; 16:820-6. [PMID: 25979182 DOI: 10.1016/j.sleep.2014.12.028] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/10/2014] [Revised: 12/07/2014] [Accepted: 12/29/2014] [Indexed: 12/19/2022]
Abstract
Insomnia has been a rising public concern in recent years. As one example of a multidisciplinary topic, the theme of insomnia research has gradually shifted over time; however, there is very little quantitative characterization of the research trends in insomnia. The current study aims to quantitatively analyze trends in insomnia publications for the past 20 years. We retrospectively analyzed insomnia-related publications retrieved from PubMed and Google Scholar between 1994 and from a number of different perspectives. We investigated the major areas of research focus for insomnia, journal characteristics, as well as trends in clinical management and treatment modalities. The resulting 5841 publications presented an exponential growth trend over the past two decades, with mean annual growth rates at nearly 10% for each publication type. Analysis of major research focuses indicated that depression, hypnotics and sedatives, questionnaires, and polysomnography are the most common topics at present. Furthermore, we found that while studies on drug therapy and adverse effects decreased in the most recent five years, the greatest expansion of insomnia publications were in the areas of cognitive behavioral therapy for insomnia (CBT-I) and alternative therapies. Collectively, insomnia publications present a continuous trend of increase. While sedative and hypnotic drugs dominated the treatment of insomnia, non-pharmacological therapies may have great potential for advancement in future years. Future research effort is warranted for novel tools and clinical trials, especially on insomnia treatments with inadequate evidence or not-yet-clear efficacy and side effects.
Collapse
Affiliation(s)
- Yan Ma
- Division of Interdisciplinary Medicine and Biotechnology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA; Sleep Center, Eye Hospital, China Academy of Chinese Medical Sciences, Beijing, China
| | - Ming Dong
- IBM, Software Development Lab, Littleton, Massachusetts, USA
| | - Carol Mita
- Reference & Education Services, Countway Library of Medicine, Harvard Medical School, Boston, Massachusetts, USA
| | - Shuchen Sun
- Department of Otolaryngology, Guang'anmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China
| | - Chung-Kang Peng
- Division of Interdisciplinary Medicine and Biotechnology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Albert C Yang
- Division of Interdisciplinary Medicine and Biotechnology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA; Department of Psychiatry, Taipei Veterans General Hospital, Taipei City, Taiwan.
| |
Collapse
|
45
|
Almeida H, Meurs MJ, Kosseim L, Butler G, Tsang A. Machine learning for biomedical literature triage. PLoS One 2014; 9:e115892. [PMID: 25551575 PMCID: PMC4281078 DOI: 10.1371/journal.pone.0115892] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2014] [Accepted: 11/27/2014] [Indexed: 11/30/2022] Open
Abstract
This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.
Collapse
Affiliation(s)
- Hayda Almeida
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada
| | - Marie-Jean Meurs
- Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada
| | - Leila Kosseim
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada
| | - Greg Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada; Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada
| | - Adrian Tsang
- Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada
| |
Collapse
|
46
|
Nim HT, Boyd SE, Rosenthal NA. Systems approaches in integrative cardiac biology: illustrations from cardiac heterocellular signalling studies. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2014; 117:69-77. [PMID: 25499442 DOI: 10.1016/j.pbiomolbio.2014.11.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 11/26/2014] [Accepted: 11/28/2014] [Indexed: 12/27/2022]
Abstract
Understanding the complexity of cardiac physiology requires system-level studies of multiple cardiac cell types. Frequently, however, the end result of published research lacks the detail of the collaborative and integrative experimental design process, and the underlying conceptual framework. We review the recent progress in systems modelling and omics analysis of the heterocellular heart environment through complementary forward and inverse approaches, illustrating these conceptual and experimental frameworks with case studies from our own research program. The forward approach begins by collecting curated information from the niche cardiac biology literature, and connecting the dots to form mechanistic network models that generate testable system-level predictions. The inverse approach starts from the vast pool of public omics data in recent cardiac biological research, and applies bioinformatics analysis to produce novel candidates for further investigation. We also discuss the possibility of combining these two approaches into a hybrid framework, together with the benefits and challenges. These interdisciplinary research frameworks illustrate the interplay between computational models, omics analysis, and wet lab experiments, which holds the key to making real progress in improving human cardiac wellbeing.
Collapse
Affiliation(s)
- Hieu T Nim
- Systems Biology Institute (SBI) Australia, Level 1, Building 75, Monash University, VIC 3800, Australia; Australian Regenerative Medicine Institute, Level 1, Building 75, Monash University, VIC 3800, Australia.
| | - Sarah E Boyd
- Systems Biology Institute (SBI) Australia, Level 1, Building 75, Monash University, VIC 3800, Australia; Australian Regenerative Medicine Institute, Level 1, Building 75, Monash University, VIC 3800, Australia
| | - Nadia A Rosenthal
- Australian Regenerative Medicine Institute, Level 1, Building 75, Monash University, VIC 3800, Australia
| |
Collapse
|
47
|
Wu C, Schwartz JM, Brabant G, Nenadic G. Molecular profiling of thyroid cancer subtypes using large-scale text mining. BMC Med Genomics 2014; 7 Suppl 3:S3. [PMID: 25521965 PMCID: PMC4290788 DOI: 10.1186/1755-8794-7-s3-s3] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Thyroid cancer is the most common endocrine tumor with a steady increase in incidence. It is classified into multiple histopathological subtypes with potentially distinct molecular mechanisms. Identifying the most relevant genes and biological pathways reported in the thyroid cancer literature is vital for understanding of the disease and developing targeted therapeutics. Results We developed a large-scale text mining system to generate a molecular profiling of thyroid cancer subtypes. The system first uses a subtype classification method for the thyroid cancer literature, which employs a scoring scheme to assign different subtypes to articles. We evaluated the classification method on a gold standard derived from the PubMed Supplementary Concept annotations, achieving a micro-average F1-score of 85.9% for primary subtypes. We then used the subtype classification results to extract genes and pathways associated with different thyroid cancer subtypes and successfully unveiled important genes and pathways, including some instances that are missing from current manually annotated databases or most recent review articles. Conclusions Identification of key genes and pathways plays a central role in understanding the molecular biology of thyroid cancer. An integration of subtype context can allow prioritized screening for diagnostic biomarkers and novel molecular targeted therapeutics. Source code used for this study is made freely available online at https://github.com/chengkun-wu/GenesThyCan.
Collapse
|
48
|
Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. J Biomed Semantics 2014; 5:28. [PMID: 26261718 PMCID: PMC4530550 DOI: 10.1186/2041-1480-5-28] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 06/16/2014] [Indexed: 11/10/2022] Open
Abstract
Background Scientific publications are documentary representations of defeasible arguments, supported by data and repeatable methods. They are the essential mediating artifacts in the ecosystem of scientific communications. The institutional “goal” of science is publishing results. The linear document publication format, dating from 1665, has survived transition to the Web. Intractable publication volumes; the difficulty of verifying evidence; and observed problems in evidence and citation chains suggest a need for a web-friendly and machine-tractable model of scientific publications. This model should support: digital summarization, evidence examination, challenge, verification and remix, and incremental adoption. Such a model must be capable of expressing a broad spectrum of representational complexity, ranging from minimal to maximal forms. Results The micropublications semantic model of scientific argument and evidence provides these features. Micropublications support natural language statements; data; methods and materials specifications; discussion and commentary; challenge and disagreement; as well as allowing many kinds of statement formalization. The minimal form of a micropublication is a statement with its attribution. The maximal form is a statement with its complete supporting argument, consisting of all relevant evidence, interpretations, discussion and challenges brought forward in support of or opposition to it. Micropublications may be formalized and serialized in multiple ways, including in RDF. They may be added to publications as stand-off metadata. An OWL 2 vocabulary for micropublications is available at http://purl.org/mp. A discussion of this vocabulary along with RDF examples from the case studies, appears as OWL Vocabulary and RDF Examples in Additional file
1. Conclusion Micropublications, because they model evidence and allow qualified, nuanced assertions, can play essential roles in the scientific communications ecosystem in places where simpler, formalized and purely statement-based models, such as the nanopublications model, will not be sufficient. At the same time they will add significant value to, and are intentionally compatible with, statement-based formalizations. We suggest that micropublications, generated by useful software tools supporting such activities as writing, editing, reviewing, and discussion, will be of great value in improving the quality and tractability of biomedical communications.
Collapse
|
49
|
Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinformatics 2014; 15:105. [PMID: 24725842 PMCID: PMC3998061 DOI: 10.1186/1471-2105-15-105] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 04/07/2014] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Discerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repositioning. Currently, systematic study of disease relationships on a phenome-wide scale is limited due to the lack of large-scale machine understandable disease phenotype relationship knowledge bases. Our study innovates a semi-supervised iterative pattern learning approach that is used to build an precise, large-scale disease-disease risk relationship (D1 → D2) knowledge base (dRiskKB) from a vast corpus of free-text published biomedical literature. RESULTS 21,354,075 MEDLINE records comprised the text corpus under study. First, we used one typical disease risk-specific syntactic pattern (i.e. "D1 due to D2") as a seed to automatically discover other patterns specifying similar semantic relationships among diseases. We then extracted D1 → D2 risk pairs from MEDLINE using the learned patterns. We manually evaluated the precisions of the learned patterns and extracted pairs. Finally, we analyzed the correlations between disease-disease risk pairs and their associated genes and drugs. The newly created dRiskKB consists of a total of 34,448 unique D1 → D2 pairs, representing the risk-specific semantic relationships among 12,981 diseases with each disease linked to its associated genes and drugs. The identified patterns are highly precise (average precision of 0.99) in specifying the risk-specific relationships among diseases. The precisions of extracted pairs are 0.919 for those that are exactly matched and 0.988 for those that are partially matched. By comparing the iterative pattern approach starting from different seeds, we demonstrated that our algorithm is robust in terms of seed choice. We show that diseases and their risk diseases as well as diseases with similar risk profiles tend to share both genes and drugs. CONCLUSIONS This unique dRiskKB, when combined with existing phenotypic, genetic, and genomic datasets, can have profound implications in our deeper understanding of disease etiology and in drug repositioning.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Division, Case Western Reserve University, Cleveland, OH, USA
| | - Li Li
- Departments of Family Medicine and Community Health, Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
| | | |
Collapse
|
50
|
Das S, McCaffrey PG, Talkington MWT, Andrews NA, Corlosquet S, Ivinson AJ, Clark T. Pain Research Forum: application of scientific social media frameworks in neuroscience. Front Neuroinform 2014; 8:21. [PMID: 24653693 PMCID: PMC3949323 DOI: 10.3389/fninf.2014.00021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Accepted: 02/19/2014] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Social media has the potential to accelerate the pace of biomedical research through online collaboration, discussions, and faster sharing of information. Focused web-based scientific social collaboratories such as the Alzheimer Research Forum have been successful in engaging scientists in open discussions of the latest research and identifying gaps in knowledge. However, until recently, tools to rapidly create such communities and provide high-bandwidth information exchange between collaboratories in related fields did not exist. METHODS We have addressed this need by constructing a reusable framework to build online biomedical communities, based on Drupal, an open-source content management system. The framework incorporates elements of Semantic Web technology combined with social media. Here we present, as an exemplar of a web community built on our framework, the Pain Research Forum (PRF) (http://painresearchforum.org). PRF is a community of chronic pain researchers, established with the goal of fostering collaboration and communication among pain researchers. RESULTS Launched in 2011, PRF has over 1300 registered members with permission to submit content. It currently hosts over 150 topical news articles on research; more than 30 active or archived forum discussions and journal club features; a webinar series; an editor-curated weekly updated listing of relevant papers; and several other resources for the pain research community. All content is licensed for reuse under a Creative Commons license; the software is freely available. The framework was reused to develop other sites, notably the Multiple Sclerosis Discovery Forum (http://msdiscovery.org) and StemBook (http://stembook.org). DISCUSSION Web-based collaboratories are a crucial integrative tool supporting rapid information transmission and translation in several important research areas. In this article, we discuss the success factors, lessons learned, and ongoing challenges in using PRF as a driving force to develop tools for online collaboration in neuroscience. We also indicate ways these tools can be applied to other areas and uses.
Collapse
Affiliation(s)
- Sudeshna Das
- MassGeneral Institute for Neurodegenerative Disease, Massachusetts General Hospital Cambridge, MA, USA ; Department of Neurology, Harvard Medical School Boston, MA, USA
| | | | | | - Neil A Andrews
- Harvard NeuroDiscovery Center, Harvard Medical School Boston, MA, USA
| | - Stéphane Corlosquet
- MassGeneral Institute for Neurodegenerative Disease, Massachusetts General Hospital Cambridge, MA, USA
| | - Adrian J Ivinson
- Harvard NeuroDiscovery Center, Harvard Medical School Boston, MA, USA
| | - Tim Clark
- MassGeneral Institute for Neurodegenerative Disease, Massachusetts General Hospital Cambridge, MA, USA ; Department of Neurology, Harvard Medical School Boston, MA, USA ; School of Computer Science, University of Manchester Manchester, UK
| |
Collapse
|