1
|
Vithanage D, Yu P, Wang L, Deng C. Contextual Word Embedding for Biomedical Knowledge Extraction: a Rapid Review and Case Study. J Healthc Inform Res 2024; 8:158-179. [PMID: 38273979 PMCID: PMC10805696 DOI: 10.1007/s41666-023-00157-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 11/27/2023] [Accepted: 12/09/2023] [Indexed: 01/27/2024]
Abstract
Recent advancements in natural language processing (NLP), particularly contextual word embedding models, have improved knowledge extraction from biomedical and healthcare texts. However, limited comprehensive research compares these models. This study conducts a scoping review and compares the performance of the major contextual word embedding models for biomedical knowledge extraction. From 26 articles identified from Scopus, PubMed, PubMed Central, and Google Scholar between 2017 and 2021, 18 notable contextual word embedding models were identified. These include ELMo, BERT, BioBERT, BlueBERT, CancerBERT, DDS-BERT, RuBERT, LABSE, EhrBERT, MedBERT, Clinical BERT, Clinical BioBERT, Discharge Summary BERT, Discharge Summary BioBERT, GPT, GPT-2, GPT-3, and GPT2-Bio-Pt. A case study compared the performance of six representative models-ELMo, BERT, BioBERT, BlueBERT, Clinical BioBERT, and GPT-3-across text classification, named entity recognition, and question answering. The evaluation utilized datasets comprising biomedical text from tweets, NCBI, PubMed, and clinical notes sourced from two electronic health record datasets. Performance metrics, including accuracy and F1 score, were used. The results of this case study reveal that BioBERT performs the best in analyzing biomedical text, while Clinical BioBERT excels in analyzing clinical notes. These findings offer crucial insights into word embedding models for researchers, practitioners, and stakeholders utilizing NLP in biomedical and clinical document analysis. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-023-00157-y.
Collapse
Affiliation(s)
- Dinithi Vithanage
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522 Australia
| | - Ping Yu
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522 Australia
| | - Lei Wang
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522 Australia
| | - Chao Deng
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, NSW 2522 Australia
| |
Collapse
|
2
|
Sarabi S, Han Q, de Vries B, Romme AGL, Almassy D. The Nature-Based Solutions Case-Based System: A hybrid expert system. J Environ Manage 2022; 324:116413. [PMID: 36352717 DOI: 10.1016/j.jenvman.2022.116413] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/17/2022] [Accepted: 09/28/2022] [Indexed: 06/16/2023]
Abstract
Deriving knowledge and learning from past experiences is essential for the successful adoption of Nature-Based Solutions (NBS) as novel integrative solutions that involve many uncertainties. Past experiences in implementing NBS have been collected in a number of repositories; however, it is a major challenge to derive knowledge from the huge amount of information provided by these repositories. This calls for information systems that can facilitate the knowledge extraction process. This paper introduces the NBS Case-Based System (NBS-CBS), an expert system that uses a hybrid architecture to derive information and recommendations from an NBS experience repository. The NBS-CBS combines a 'black-box' artificial neural networks model with a 'white-box' case-based reasoning model to deliver an intelligent, adaptive, and explainable system. Experts have tested this system to assess its functionality and accuracy. Accordingly, the NBS-CBS appears to provide inspirational recommendations and information for the NBS planning and design process.
Collapse
Affiliation(s)
- Shahryar Sarabi
- Information Systems in the Built Environment (ISBE) Group, Department of Built Environment, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, Netherlands.
| | - Qi Han
- Information Systems in the Built Environment (ISBE) Group, Department of Built Environment, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, Netherlands
| | - Bauke de Vries
- Information Systems in the Built Environment (ISBE) Group, Department of Built Environment, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, Netherlands
| | - A Georges L Romme
- Department of Industrial Engineering & Innovation Sciences, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, Netherlands
| | - Dora Almassy
- Department of Environmental Sciences and Policy, Central European University, Austria
| |
Collapse
|
3
|
Alakent B, Kaya-Özkiper K, Soyer-Uzun S. Global interpretation and generalizability of boosted regression models for the prediction of methylene blue adsorption by different clay minerals and alkali activated materials. Chemosphere 2022; 308:136248. [PMID: 36057344 DOI: 10.1016/j.chemosphere.2022.136248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 08/15/2022] [Accepted: 08/25/2022] [Indexed: 06/15/2023]
Abstract
In this study, Gradient Boosted Regression Trees is applied, for the first time, to predict governing factors for methylene blue (MB) adsorption on a variety of adsorbents involving clay minerals, such as kaolinite and sepiolite together with industrial wastes red mud and fly ash, and alkali activated materials synthesized from aforementioned raw materials. Dataset was constructed using electronic databases, such as ScienceDirect, Scopus, Elsevier, and Google, experimental studies published between 2005 and 2022 were covered. The final dataset included experimental conditions, such as adsorbent type, adsorbent properties (surface characteristics, density, and chemical modifications), pH of the medium, adsorbent dosage, and temperature; and it involved 914 datapoints, which were extracted out of 75 papers (out of ∼1360 initially screened). Among distinct parameters, initial adsorbate concentration was found to be the most dominant factor affecting the MB uptake. Concordantly, pH of the solution medium, raw material selection, and modification types were also found to be significant in MB adsorption. Results showed that in terms of raw material and modification types, sepiolite and chemical (acid and/or alkaline modification) and thermal treatments, respectively, come forward as the most powerful candidates for enhanced MB adsorption performance. Modifications applied on adsorbents should be evaluated separately, as there is no general rule applicable for all experimental conditions, and the strength of the contribution of modification type also depends on initial adsorbate concentration. Implementation of various imputation methods showed the importance of reporting experimental factors, such as surface area, in the literature. Range of applicability of the suggested modeling procedure was assessed to help experimenters in testing MB uptake under novel experimental conditions.
Collapse
Affiliation(s)
- Burak Alakent
- Department of Chemical Engineering, Bogazici University, Bebek, 34342 Istanbul, Turkey.
| | - Kardelen Kaya-Özkiper
- Department of Chemical Engineering, Bogazici University, Bebek, 34342 Istanbul, Turkey
| | - Sezen Soyer-Uzun
- Department of Chemical Engineering, Bogazici University, Bebek, 34342 Istanbul, Turkey.
| |
Collapse
|
4
|
Andreadis S, Antzoulatos G, Mavropoulos T, Giannakeris P, Tzionis G, Pantelidis N, Ioannidis K, Karakostas A, Gialampoukidis I, Vrochidis S, Kompatsiaris I. A social media analytics platform visualising the spread of COVID-19 in Italy via exploitation of automatically geotagged tweets. Online Soc Netw Media 2021; 23:100134. [PMID: 36570037 PMCID: PMC9767437 DOI: 10.1016/j.osnem.2021.100134] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 03/31/2021] [Accepted: 04/03/2021] [Indexed: 12/27/2022]
Abstract
Social media play an important role in the daily life of people around the globe and users have emerged as an active part of news distribution as well as production. The threatening pandemic of COVID-19 has been the lead subject in online discussions and posts, resulting to large amounts of related social media data, which can be utilised to reinforce the crisis management in several ways. Towards this direction, we propose a novel framework to collect, analyse, and visualise Twitter posts, which has been tailored to specifically monitor the virus spread in severely affected Italy. We present and evaluate a deep learning localisation technique that geotags posts based on the locations mentioned in their text, a face detection algorithm to estimate the number of people appearing in posted images, and a community detection approach to identify communities of Twitter users. Moreover, we propose further analysis of the collected posts to predict their reliability and to detect trending topics and events. Finally, we demonstrate an online platform that comprises an interactive map to display and filter analysed posts, utilising the outcome of the localisation technique, and a visual analytics dashboard that visualises the results of the topic, community, and event detection methodologies.
Collapse
|
5
|
Abstract
Background Diabetes has become one of the hot topics in life science researches. To support the analytical procedures, researchers and analysts expend a mass of labor cost to collect experimental data, which is also error-prone. To reduce the cost and to ensure the data quality, there is a growing trend of extracting clinical events in form of knowledge from electronic medical records (EMRs). To do so, we first need a high-coverage knowledge base (KB) of a specific disease to support the above extraction tasks called KB-based Extraction. Methods We propose an approach to build a diabetes-centric knowledge base (a.k.a. DKB) via mining the Web. In particular, we first extract knowledge from semi-structured contents of vertical portals, fuse individual knowledge from each site, and further map them to a unified KB. The target DKB is then extracted from the overall KB based on a distance-based Expectation-Maximization (EM) algorithm. Results During the experiments, we selected eight popular vertical portals in China as data sources to construct DKB. There are 7703 instances and 96,041 edges in the final diabetes KB covering diseases, symptoms, western medicines, traditional Chinese medicines, examinations, departments, and body structures. The accuracy of DKB is 95.91%. Besides the quality assessment of extracted knowledge from vertical portals, we also carried out detailed experiments for evaluating the knowledge fusion performance as well as the convergence of the distance-based EM algorithm with positive results. Conclusions In this paper, we introduced an approach to constructing DKB. A knowledge extraction and fusion pipeline was first used to extract semi-structured data from vertical portals and individual KBs were further fused into a unified knowledge base. After that, we develop a distance based Expectation Maximization algorithm to extract a subset from the overall knowledge base forming the target DKB. Experiments showed that the data in DKB are rich and of high-quality.
Collapse
Affiliation(s)
- Fan Gong
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu'an Road, Shanghai, China
| | - Yilei Chen
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu'an Road, Shanghai, China
| | - Haofen Wang
- Shanghai Leyan Technologies Co. Ltd, No. 1028 Panyu Road, Shanghai, China
| | - Hao Lu
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu'an Road, Shanghai, China.
| |
Collapse
|
6
|
Abstract
BACKGROUND The right dataset is essential to obtain the right insights in data science; therefore, it is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are underway to integrate the large amount and variety of datasets, the lack of an information resource that focuses on specific needs of target users of datasets has existed as a problem for years. To address this gap, we have developed a Dataset Information Resource (DIR), using a user-oriented approach, which gathers relevant dataset knowledge for specific user types. In the present version, we specifically address the challenges of entry-level data scientists in learning to identify, understand, and analyze major datasets in healthcare. We emphasize that the DIR does not contain actual data from the datasets but aims to provide comprehensive knowledge about the datasets and their analyses. METHODS The DIR leverages Semantic Web technologies and the W3C Dataset Description Profile as the standard for knowledge integration and representation. To extract tailored knowledge for target users, we have developed methods for manual extractions from dataset documentations as well as semi-automatic extractions from related publications, using natural language processing (NLP)-based approaches. A semantic query component is available for knowledge retrieval, and a parameterized question-answering functionality is provided to facilitate the ease of search. RESULTS The DIR prototype is composed of four major components-dataset metadata and related knowledge, search modules, question answering for frequently-asked questions, and blogs. The current implementation includes information on 12 commonly used large and complex healthcare datasets. The initial usage evaluation based on health informatics novices indicates that the DIR is helpful and beginner-friendly. CONCLUSIONS We have developed a novel user-oriented DIR that provides dataset knowledge specialized for target user groups. Knowledge about datasets is effectively represented in the Semantic Web. At this initial stage, the DIR has already been able to provide sophisticated and relevant knowledge of 12 datasets to help entry health informacians learn healthcare data analysis using suitable datasets. Further development of both content and function levels is underway.
Collapse
Affiliation(s)
- Jingyi Shi
- Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, 28223 NC USA
| | - Mingna Zheng
- Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, 28223 NC USA
| | - Lixia Yao
- Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, 55905 MN USA
| | - Yaorong Ge
- Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, 28223 NC USA
| |
Collapse
|
7
|
Weitschek E, Lauro SD, Cappelli E, Bertolazzi P, Felici G. CamurWeb: a classification software and a large knowledge base for gene expression data of cancer. BMC Bioinformatics 2018; 19:354. [PMID: 30367574 PMCID: PMC6191971 DOI: 10.1186/s12859-018-2299-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer. RESULTS We propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas ("if then" rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb . CONCLUSIONS The experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.
Collapse
Affiliation(s)
- Emanuel Weitschek
- Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II 39, Rome, 00186, Italy. .,Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185, Italy.
| | - Silvia Di Lauro
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185, Italy
| | - Eleonora Cappelli
- Department of Engineering, Roma Tre University, Via della Vasca Navale 79, Rome, 00146, Italy
| | - Paola Bertolazzi
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185, Italy.,SYSBIO.IT Center for Systems Biology, Milano Bicocca University, Piazza della Scienza 2, Milan, 20126, Italy
| | - Giovanni Felici
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185, Italy
| |
Collapse
|
8
|
Michelini S, Balakrishnan B, Parolo S, Matone A, Mullaney JA, Young W, Gasser O, Wall C, Priami C, Lombardo R, Kussmann M. A reverse metabolic approach to weaning: in silico identification of immune-beneficial infant gut bacteria, mining their metabolism for prebiotic feeds and sourcing these feeds in the natural product space. Microbiome 2018; 6:171. [PMID: 30241567 PMCID: PMC6151060 DOI: 10.1186/s40168-018-0545-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 08/30/2018] [Indexed: 05/13/2023]
Abstract
BACKGROUND Weaning is a period of marked physiological change. The introduction of solid foods and the changes in milk consumption are accompanied by significant gastrointestinal, immune, developmental, and microbial adaptations. Defining a reduced number of infections as the desired health benefit for infants around weaning, we identified in silico (i.e., by advanced public domain mining) infant gut microbes as potential deliverers of this benefit. We then investigated the requirements of these bacteria for exogenous metabolites as potential prebiotic feeds that were subsequently searched for in the natural product space. RESULTS Using public domain literature mining and an in silico reverse metabolic approach, we constructed probiotic-prebiotic-food associations, which can guide targeted feeding of immune health-beneficial microbes by weaning food; analyzed competition and synergy for (prebiotic) nutrients between selected microbes; and translated this information into designing an experimental complementary feed for infants enrolled in a pilot clinical trial ( http://www.nourishtoflourish.auckland.ac.nz/ ). CONCLUSIONS In this study, we applied a benefit-oriented microbiome research strategy for enhanced early-life immune health. We extended from "classical" to molecular nutrition aiming to identify nutrients, bacteria, and mechanisms that point towards targeted feeding to improve immune health in infants around weaning. Here, we present the systems biology-based approach we used to inform us on the most promising prebiotic combinations known to support growth of beneficial gut bacteria ("probiotics") in the infant gut, thereby favorably promoting development of the immune system.
Collapse
Affiliation(s)
- Samanta Michelini
- The Microsoft Research–University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Biju Balakrishnan
- The Liggins Institute, the University of Auckland, Auckland, New Zealand
| | - Silvia Parolo
- The Microsoft Research–University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Alice Matone
- The Microsoft Research–University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Jane A. Mullaney
- AgResearch, Food & Bio-based Products, Palmerston North, New Zealand
- Riddet Institute, Palmerston North, New Zealand
| | - Wayne Young
- AgResearch, Food & Bio-based Products, Palmerston North, New Zealand
- Riddet Institute, Palmerston North, New Zealand
| | - Olivier Gasser
- Malaghan Institute of Medical Research, Wellington, New Zealand
| | - Clare Wall
- Discipline of Nutrition, School of Medical Science, University of Auckland, Auckland, New Zealand
| | - Corrado Priami
- The Microsoft Research–University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Rosario Lombardo
- The Microsoft Research–University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Martin Kussmann
- The Liggins Institute, the University of Auckland, Auckland, New Zealand
- National Science Challenge “High Value Nutrition”, Auckland, New Zealand
| |
Collapse
|
9
|
Lazzarini N, Bacardit J. RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers. BMC Bioinformatics 2017; 18:322. [PMID: 28666416 PMCID: PMC5493069 DOI: 10.1186/s12859-017-1729-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Accepted: 06/13/2017] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Current -omics technologies are able to sense the state of a biological sample in a very wide variety of ways. Given the high dimensionality that typically characterises these data, relevant knowledge is often hidden and hard to identify. Machine learning methods, and particularly feature selection algorithms, have proven very effective over the years at identifying small but relevant subsets of variables from a variety of application domains, including -omics data. Many methods exist with varying trade-off between the size of the identified variable subsets and the predictive power of such subsets. In this paper we focus on an heuristic for the identification of biomarkers called RGIFE: Rank Guided Iterative Feature Elimination. RGIFE is guided in its biomarker identification process by the information extracted from machine learning models and incorporates several mechanisms to ensure that it creates minimal and highly predictive features sets. RESULTS We compare RGIFE against five well-known feature selection algorithms using both synthetic and real (cancer-related transcriptomics) datasets. First, we assess the ability of the methods to identify relevant and highly predictive features. Then, using a prostate cancer dataset as a case study, we look at the biological relevance of the identified biomarkers. CONCLUSIONS We propose RGIFE, a heuristic for the inference of reduced panels of biomarkers that obtains similar predictive performance to widely adopted feature selection methods while selecting significantly fewer feature. Furthermore, focusing on the case study, we show the higher biological relevance of the biomarkers selected by our approach. The RGIFE source code is available at: http://ico2s.org/software/rgife.html .
Collapse
Affiliation(s)
- Nicola Lazzarini
- ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK
| | - Jaume Bacardit
- ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK.
| |
Collapse
|
10
|
Cumbo F, Fiscon G, Ceri S, Masseroli M, Weitschek E. TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinformatics 2017; 18:6. [PMID: 28049410 DOI: 10.1186/s12859-016-1419-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Accepted: 12/10/2016] [Indexed: 01/05/2023] Open
Abstract
Background Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. Results We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. Conclusions The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1419-5) contains supplementary material, which is available to authorized users.
Collapse
|
11
|
Abstract
BACKGROUND Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in the ever growing domain of biomedical informatics. The problem of most state-of-the-art methods for calculating semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this paper we present tESA, an extension to a well known Explicit Semantic Relatedness (ESA) method. RESULTS In our extension we use two separate sets of vectors, corresponding to different sections of the articles from the underlying corpus of documents, as opposed to the original method, which only uses a single vector space. We present an evaluation of Life Sciences domain-focused applicability of both tESA and domain-adapted Explicit Semantic Analysis. The methods are tested against a set of standard benchmarks established for the evaluation of biomedical semantic relatedness quality. Our experiments show that the propsed method achieves results comparable with or superior to the current state-of-the-art methods. Additionally, a comparative discussion of the results obtained with tESA and ESA is presented, together with a study of the adaptability of the methods to different corpora and their performance with different input parameters. CONCLUSIONS Our findings suggest that combined use of the semantics from different sections (i.e. extending the original ESA methodology with the use of title vectors) of the documents of scientific corpora may be used to enhance the performance of a distributional semantic relatedness measures, which can be observed in the largest reference datasets. We also present the impact of the proposed extension on the size of distributional representations.
Collapse
Affiliation(s)
- Maciej Rybinski
- Departamento LCC, University of Malaga, Campus Teatinos, Malaga, 29010, Spain
| | | |
Collapse
|
12
|
Wang L, Bray BE, Shi J, Del Fiol G, Haug PJ. A method for the development of disease-specific reference standards vocabularies from textual biomedical literature resources. Artif Intell Med 2016; 68:47-57. [PMID: 26971304 DOI: 10.1016/j.artmed.2016.02.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Revised: 02/22/2016] [Accepted: 02/25/2016] [Indexed: 10/22/2022]
Abstract
OBJECTIVE Disease-specific vocabularies are fundamental to many knowledge-based intelligent systems and applications like text annotation, cohort selection, disease diagnostic modeling, and therapy recommendation. Reference standards are critical in the development and validation of automated methods for disease-specific vocabularies. The goal of the present study is to design and test a generalizable method for the development of vocabulary reference standards from expert-curated, disease-specific biomedical literature resources. METHODS We formed disease-specific corpora from literature resources like textbooks, evidence-based synthesized online sources, clinical practice guidelines, and journal articles. Medical experts annotated and adjudicated disease-specific terms in four classes (i.e., causes or risk factors, signs or symptoms, diagnostic tests or results, and treatment). Annotations were mapped to UMLS concepts. We assessed source variation, the contribution of each source to build disease-specific vocabularies, the saturation of the vocabularies with respect to the number of used sources, and the generalizability of the method with different diseases. RESULTS The study resulted in 2588 string-unique annotations for heart failure in four classes, and 193 and 425 respectively for pulmonary embolism and rheumatoid arthritis in treatment class. Approximately 80% of the annotations were mapped to UMLS concepts. The agreement among heart failure sources ranged between 0.28 and 0.46. The contribution of these sources to the final vocabulary ranged between 18% and 49%. With the sources explored, the heart failure vocabulary reached near saturation in all four classes with the inclusion of minimal six sources (or between four to seven sources if only counting terms occurred in two or more sources). It took fewer sources to reach near saturation for the other two diseases in terms of the treatment class. CONCLUSIONS We developed a method for the development of disease-specific reference vocabularies. Expert-curated biomedical literature resources are substantial for acquiring disease-specific medical knowledge. It is feasible to reach near saturation in a disease-specific vocabulary using a relatively small number of literature sources.
Collapse
Affiliation(s)
- Liqin Wang
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA; Homer Warner Research Center, Intermountain Healthcare, 5121 South Cottonwood Street, Murray, UT 84107, USA.
| | - Bruce E Bray
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA; Department of Internal Medicine, University of Utah, 30 North 1900 East, Salt Lake City, UT 84132, USA
| | - Jianlin Shi
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA
| | - Peter J Haug
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA; Homer Warner Research Center, Intermountain Healthcare, 5121 South Cottonwood Street, Murray, UT 84107, USA
| |
Collapse
|
13
|
Scharl A, Hubmann-Haidvogel A, Jones A, Fischl D, Kamolov R, Weichselbraun A, Rafelsberger W. Analyzing the public discourse on works of fiction - Detection and visualization of emotion in online coverage about HBO's Game of Thrones. Inf Process Manag 2016; 52:129-138. [PMID: 27065510 PMCID: PMC4804387 DOI: 10.1016/j.ipm.2015.02.003] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
“Westeros Sentinel” – a visual analytics dashboard for Game of Thrones. Extraction of affective and factual knowledge from news and social media coverage. Emotional categories from semantic knowledge bases. Automated annotation services for contextualized information spaces. Interactive visualizations to explore context features.
This paper presents a Web intelligence portal that captures and aggregates news and social media coverage about “Game of Thrones”, an American drama television series created for the HBO television network based on George R.R. Martin’s series of fantasy novels. The system collects content from the Web sites of Anglo-American news media as well as from four social media platforms: Twitter, Facebook, Google+ and YouTube. An interactive dashboard with trend charts and synchronized visual analytics components not only shows how often Game of Thrones events and characters are being mentioned by journalists and viewers, but also provides a real-time account of concepts that are being associated with the unfolding storyline and each new episode. Positive or negative sentiment is computed automatically, which sheds light on the perception of actors and new plot elements.
Collapse
Affiliation(s)
- Arno Scharl
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
- WebLyzard Technology, Puechlgasse 2/44, 1190 Vienna, Austria
- Corresponding author at: Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria. Tel.: +43 1 320 3555 500; fax: +43 1 320 3555 903.
| | - Alexander Hubmann-Haidvogel
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
- WebLyzard Technology, Puechlgasse 2/44, 1190 Vienna, Austria
| | - Alistair Jones
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
| | - Daniel Fischl
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
| | - Ruslan Kamolov
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
| | - Albert Weichselbraun
- WebLyzard Technology, Puechlgasse 2/44, 1190 Vienna, Austria
- University of Applied Sciences HTW Chur, Faculty of Information Sciences, Pulvermuehlestrasse 57, CH-7004, Chur, Switzerland
| | | |
Collapse
|
14
|
Abstract
This paper presents a novel method for contextualizing and enriching large semantic knowledge bases for opinion mining with a focus on Web intelligence platforms and other high-throughput big data applications. The method is not only applicable to traditional sentiment lexicons, but also to more comprehensive, multi-dimensional affective resources such as SenticNet. It comprises the following steps: (i) identify ambiguous sentiment terms, (ii) provide context information extracted from a domain-specific training corpus, and (iii) ground this contextual information to structured background knowledge sources such as ConceptNet and WordNet. A quantitative evaluation shows a significant improvement when using an enriched version of SenticNet for polarity classification. Crowdsourced gold standard data in conjunction with a qualitative evaluation sheds light on the strengths and weaknesses of the concept grounding, and on the quality of the enrichment process.
Collapse
Affiliation(s)
- A. Weichselbraun
- Faculty of Information Science, University of Applied Sciences Chur, Pulvermühlestrasse 57, CH-7004 Chur, Switzerland
| | - S. Gindl
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
| | - A. Scharl
- Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, Austria
- Corresponding author. Tel.: +43 1 3203555 500.
| |
Collapse
|
15
|
Sannino G, De Falco I, De Pietro G. Monitoring Obstructive Sleep Apnea by means of a real-time mobile system based on the automatic extraction of sets of rules through Differential Evolution. J Biomed Inform 2014; 49:84-100. [PMID: 24632080 DOI: 10.1016/j.jbi.2014.02.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2013] [Revised: 02/04/2014] [Accepted: 02/28/2014] [Indexed: 10/25/2022]
Abstract
Real-time Obstructive Sleep Apnea (OSA) episode detection and monitoring are important for society in terms of an improvement in the health of the general population and of a reduction in mortality and healthcare costs. Currently, to diagnose OSA patients undergo PolySomnoGraphy (PSG), a complicated and invasive test to be performed in a specialized center involving many sensors and wires. Accordingly, each patient is required to stay in the same position throughout the duration of one night, thus restricting their movements. This paper proposes an easy, cheap, and portable approach for the monitoring of patients with OSA, which collects single-channel ElectroCardioGram (ECG) data only. It is easy to perform from the patient's point of view because only one wearable sensor is required, so the patient is not restricted to keeping the same position all night long, and the detection and monitoring can be carried out in any place through the use of a mobile device. Our approach is based on the automatic extraction, from a database containing information about the monitored patient, of explicit knowledge in the form of a set of IF…THEN rules containing typical parameters derived from Heart Rate Variability (HRV) analysis. The extraction is carried out off-line by means of a Differential Evolution algorithm. This set of rules can then be exploited in the real-time mobile monitoring system developed at our Laboratory: the ECG data is gathered by a wearable sensor and sent to a mobile device, where it is processed in real time. Subsequently, HRV-related parameters are computed from this data, and, if their values activate some of the rules describing the occurrence of OSA, an alarm is automatically produced. This approach has been tested on a well-known literature database of OSA patients. The numerical results show its effectiveness in terms of accuracy, sensitivity, and specificity, and the achieved sets of rules evidence the user-friendliness of the approach. Furthermore, the method is compared against other well known classifiers, and its discrimination ability is shown to be higher.
Collapse
Affiliation(s)
- Giovanna Sannino
- Institute of High Performance Computing and Networking (ICAR), National Research Council of Italy (CNR), Via Pietro Castellino 111, Naples, Italy; University of Naples "Parthenope", Department of Technology, Naples, Italy.
| | - Ivanoe De Falco
- Institute of High Performance Computing and Networking (ICAR), National Research Council of Italy (CNR), Via Pietro Castellino 111, Naples, Italy.
| | - Giuseppe De Pietro
- Institute of High Performance Computing and Networking (ICAR), National Research Council of Italy (CNR), Via Pietro Castellino 111, Naples, Italy.
| |
Collapse
|