1
|
Barros M, Moitinho A, Couto FM. SeEn: Sequential enriched datasets for sequence-aware recommendations. Sci Data 2022; 9:478. [PMID: 35927282 PMCID: PMC9352715 DOI: 10.1038/s41597-022-01598-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 07/27/2022] [Indexed: 11/09/2022] Open
Abstract
The recommendation of items based on the sequential past users' preferences has evolved in the last few years, mostly due to deep learning approaches, such as BERT4Rec. However, in scientific fields, recommender systems for recommending the next best item are not widely used. The main goal of this work is to improve the results for the recommendation of the next best item in scientific domains using sequence aware datasets and algorithms. In the first part of this work, we present the adaptation of a previous method (LIBRETTI) for creating sequential recommendation datasets for scientific fields. The results were assessed in Astronomy and Chemistry. In the second part of this work, we propose a new approach to improve the datasets, not the algorithms, to obtain better recommendations. The new hybrid approach is called sequential enrichment (SeEn), which consists of adding to a sequence of items the n most similar items after each original item. The results show that the enriched sequences obtained better results than the original ones. The Chemistry dataset improved by approximately seven percentage points and the Astronomy dataset by 16 percentage points for Hit Ratio and Normalized Discounted Cumulative Gain.
Collapse
Affiliation(s)
- Marcia Barros
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal. .,CENTRA, Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.
| | - André Moitinho
- CENTRA, Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Francisco M Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
2
|
Conceição SIR, Couto FM. Text Mining for Building Biomedical Networks Using Cancer as a Case Study. Biomolecules 2021; 11:biom11101430. [PMID: 34680062 PMCID: PMC8533101 DOI: 10.3390/biom11101430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/24/2021] [Accepted: 09/27/2021] [Indexed: 12/15/2022] Open
Abstract
In the assembly of biological networks it is important to provide reliable interactions in an effort to have the most possible accurate representation of real-life systems. Commonly, the data used to build a network comes from diverse high-throughput essays, however most of the interaction data is available through scientific literature. This has become a challenge with the notable increase in scientific literature being published, as it is hard for human curators to track all recent discoveries without using efficient tools to help them identify these interactions in an automatic way. This can be surpassed by using text mining approaches which are capable of extracting knowledge from scientific documents. One of the most important tasks in text mining for biological network building is relation extraction, which identifies relations between the entities of interest. Many interaction databases already use text mining systems, and the development of these tools will lead to more reliable networks, as well as the possibility to personalize the networks by selecting the desired relations. This review will focus on different approaches of automatic information extraction from biomedical text that can be used to enhance existing networks or create new ones, such as deep learning state-of-the-art approaches, focusing on cancer disease as a case-study.
Collapse
|
3
|
Xie J, Zi W, Li Z, He Y. Ontology-based Precision Vaccinology for Deep Mechanism Understanding and Precision Vaccine Development. Curr Pharm Des 2021; 27:900-910. [PMID: 33238868 DOI: 10.2174/1381612826666201125112131] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Accepted: 10/08/2020] [Indexed: 11/22/2022]
Abstract
Vaccination is one of the most important innovations in human history. It has also become a hot research area in a new application - the development of new vaccines against non-infectious diseases such as cancers. However, effective and safe vaccines still do not exist for many diseases, and where vaccines exist, their protective immune mechanisms are often unclear. Although licensed vaccines are generally safe, various adverse events, and sometimes severe adverse events, still exist for a small population. Precision medicine tailors medical intervention to the personal characteristics of individual patients or sub-populations of individuals with similar immunity-related characteristics. Precision vaccinology is a new strategy that applies precision medicine to the development, administration, and post-administration analysis of vaccines. Several conditions contribute to make this the right time to embark on the development of precision vaccinology. First, the increased level of research in vaccinology has generated voluminous "big data" repositories of vaccinology data. Secondly, new technologies such as multi-omics and immunoinformatics bring new methods for investigating vaccines and immunology. Finally, the advent of AI and machine learning software now makes possible the marriage of Big Data to the development of new vaccines in ways not possible before. However, something is missing in this marriage, and that is a common language that facilitates the correlation, analysis, and reporting nomenclature for the field of vaccinology. Solving this bioinformatics problem is the domain of applied biomedical ontology. Ontology in the informatics field is human- and machine-interpretable representation of entities and the relations among entities in a specific domain. The Vaccine Ontology (VO) and Ontology of Vaccine Adverse Events (OVAE) have been developed to support the standard representation of vaccines, vaccine components, vaccinations, host responses, and vaccine adverse events. Many other biomedical ontologies have also been developed and can be applied in vaccine research. Here, we review the current status of precision vaccinology and how ontological development will enhance this field, and propose an ontology-based precision vaccinology strategy to support precision vaccine research and development.
Collapse
Affiliation(s)
- Jiangan Xie
- Chongqing Engineering Research Center of Medical Electronics and Information Technology, School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Wenrui Zi
- Chongqing engineering research center of medical electronics and information technology, School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Zhangyong Li
- Chongqing engineering research center of medical electronics and information technology, School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Yongqun He
- Unit of Laboratory Animal Medicine, Development of Microbiology and Immunology, Center of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan, United States
| |
Collapse
|
4
|
Barros M, Moitinho A, Couto FM. Hybrid semantic recommender system for chemical compounds in large-scale datasets. J Cheminform 2021; 13:15. [PMID: 33622374 PMCID: PMC7903631 DOI: 10.1186/s13321-021-00495-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 02/10/2021] [Indexed: 12/16/2022] Open
Abstract
The large, and increasing, number of chemical compounds poses challenges to the exploration of such datasets. In this work, we propose the usage of recommender systems to identify compounds of interest to scientific researchers. Our approach consists of a hybrid recommender model suitable for implicit feedback datasets and focused on retrieving a ranked list according to the relevance of the items. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares and Bayesian Personalized Ranking) and a new content-based algorithm, using the semantic similarity between the chemical compounds in the ChEBI ontology. The algorithms were assessed on an implicit dataset of chemical compounds, CheRM-20, with more than 16.000 items (chemical compounds). The hybrid model was able to improve the results of the collaborative-filtering algorithms, by more than ten percentage points in most of the assessed evaluation metrics.
Collapse
Affiliation(s)
- Marcia Barros
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal. .,CENTRA, Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal.
| | - Andre Moitinho
- CENTRA, Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| | - Francisco M Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| |
Collapse
|
5
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F. Hybrid Semantic Recommender System for Chemical Compounds. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7148023 DOI: 10.1007/978-3-030-45442-5_12] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Recommending Chemical Compounds of interest to a particular researcher is a poorly explored field. The few existent datasets with information about the preferences of the researchers use implicit feedback. The lack of Recommender Systems in this particular field presents a challenge for the development of new recommendations models. In this work, we propose a Hybrid recommender model for recommending Chemical Compounds. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares (ALS) and Bayesian Personalized Ranking (BPR)) and semantic similarity between the Chemical Compounds in the ChEBI ontology (ONTO). We evaluated the model in an implicit dataset of Chemical Compounds, CheRM. The Hybrid model was able to improve the results of state-of-the-art collaborative-filtering algorithms, especially for Mean Reciprocal Rank, with an increase of 6.7% when comparing the collaborative-filtering ALS and the Hybrid ALS_ONTO.
Collapse
|
6
|
Couto FM, Krallinger M. Proposal of the First International Workshop on Semantic Indexing and Information Retrieval for Health from Heterogeneous Content Types and Languages (SIIRH). LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7148030 DOI: 10.1007/978-3-030-45442-5_87] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
7
|
Campos L, Pedro V, Couto F. Impact of translation on named-entity recognition in radiology texts. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2017:4097790. [PMID: 29220455 PMCID: PMC5737072 DOI: 10.1093/database/bax064] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 08/03/2017] [Indexed: 11/17/2022]
Abstract
Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL:https://github.com/lasigeBioTM/MRRAD
Collapse
Affiliation(s)
- Luís Campos
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Vasco Pedro
- Unbabel Lda, Rua Visconde de Santarém, 67-B, 1000-286 Lisboa, Portugal
| | - Francisco Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| |
Collapse
|
8
|
Ferreira JD, Inácio B, Salek RM, Couto FM. Assessing Public Metabolomics Metadata, Towards Improving Quality. J Integr Bioinform 2017; 14:/j/jib.2017.14.issue-4/jib-2017-0054/jib-2017-0054.xml. [PMID: 29236679 PMCID: PMC6042808 DOI: 10.1515/jib-2017-0054] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Accepted: 11/03/2017] [Indexed: 01/17/2023] Open
Abstract
Public resources need to be appropriately annotated with metadata in order to make them discoverable, reproducible and traceable, further enabling them to be interoperable or integrated with other datasets. While data-sharing policies exist to promote the annotation process by data owners, these guidelines are still largely ignored. In this manuscript, we analyse automatic measures of metadata quality, and suggest their application as a mean to encourage data owners to increase the metadata quality of their resources and submissions, thereby contributing to higher quality data, improved data sharing, and the overall accountability of scientific publications. We analyse these metadata quality measures in the context of a real-world repository of metabolomics data (i.e. MetaboLights), including a manual validation of the measures, and an analysis of their evolution over time. Our findings suggest that the proposed measures can be used to mimic a manual assessment of metadata quality.
Collapse
|
9
|
Abstract
Computational manipulation of knowledge is an important, and often under-appreciated, aspect of biomedical Data Science. The first Data Science initiative from the US National Institutes of Health was entitled "Big Data to Knowledge (BD2K)." The main emphasis of the more than $200M allocated to that program has been on "Big Data;" the "Knowledge" component has largely been the implicit assumption that the work will lead to new biomedical knowledge. However, there is long-standing and highly productive work in computational knowledge representation and reasoning, and computational processing of knowledge has a role in the world of Data Science. Knowledge-based biomedical Data Science involves the design and implementation of computer systems that act as if they knew about biomedicine. There are many ways in which a computational approach might act as if it knew something: for example, it might be able to answer a natural language question about a biomedical topic, or pass an exam; it might be able to use existing biomedical knowledge to rank or evaluate hypotheses; it might explain or interpret data in light of prior knowledge, either in a Bayesian or other sort of framework. These are all examples of automated reasoning that act on computational representations of knowledge. After a brief survey of existing approaches to knowledge-based data science, this position paper argues that such research is ripe for expansion, and expanded application.
Collapse
Affiliation(s)
- Lawrence E Hunter
- Computational Bioscience, University of Colorado School of Medicine, Aurora, CO 80045, USA ; ORCID: https://orcid.org/0000-0003-1455-3370
| |
Collapse
|
10
|
Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules. BIOMED RESEARCH INTERNATIONAL 2017; 2017:8565739. [PMID: 29250549 PMCID: PMC5700471 DOI: 10.1155/2017/8565739] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Revised: 09/20/2017] [Accepted: 10/15/2017] [Indexed: 11/18/2022]
Abstract
Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses Stanford CoreNLP for text processing and applies Conditional Random Fields trained with a rich feature set, which includes linguistic, orthographic, morphologic, lexical, and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted manual rules, such as the negative connotation analysis, that combined with a dictionary can filter incorrectly identified entities, find missed entities, and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC), where the system Bio-LarK CR obtained the best F-measure of 0.56. IHP achieved an F-measure of 0.65 on the GSC. Due to inconsistencies found in the GSC, an extended version of the GSC was created, adding 881 entities and modifying 4 entities. IHP achieved an F-measure of 0.863 on the new GSC.
Collapse
|
11
|
Eftimov T, Koroušić Seljak B, Korošec P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One 2017. [PMID: 28644863 PMCID: PMC5482438 DOI: 10.1371/journal.pone.0179488] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.
Collapse
Affiliation(s)
- Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- * E-mail:
| | | | - Peter Korošec
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Faculty of Mathematics, Natural Science and Information Technologies, Koper, Slovenia
| |
Collapse
|